An Industrial Case Study on Fault Detection Effectiveness of Combinatorial Robustness Testing

An Industrial Case Study on Fault Detection Effectiveness of Combinatorial Robustness Testing KonradFögen foegen@swc.rwth-aachen.de HorstLichter lichter@swc.rwth-aachen.de Research Group Software Construction RWTH Aachen University

Aachen Germany

httpswwwrwth

-aachen.de

An Industrial Case Study on Fault Detection Effectiveness of Combinatorial Robustness Testing 1613-0073 609EA9F3C347CA36EA8624C2B8AA49B4 GROBID - A machine learning software for extracting information from scholarly documents Software Testing Combinatorial Testing Robustness Testing

Combinatorial robustness testing (CRT) is an extension of combinatorial testing (CT) to separate test suites with valid and strong invalid test inputs. Until now, only one controlled experiment using artificial test scenarios was conducted to compare CRT with CT. The results indicate advantages of CRT when much exception handling is involved. But, it is unclear if these advantages are also valid in the real-world. In this paper, we present the results of a case study conducted to compare the fault detection effectiveness of CRT and CT by testing an industrial system with 31 validation rules and 13 injected faults.

Introduction

Robustness is an important property of software. It describes "the degree to which a system [...] can function correctly in the presence of [invalid inputs]" [1]. Invalid inputs are caused by external faults, i.e. faults in other systems or made by users interacting with a system. Examples are inputs to the system under test (SUT) that contain invalid values like a string value when a numerical value is expected, or invalid value combinations like a begin date which is after the end date. When invalid inputs remain undetected, they can propagate to failures in the SUT resulting in abnormal behavior or crashes [2].

Developers attempt to improve robustness of systems by implementing exception handling (EH) to detect and recover from invalid inputs. Unfortunately, EH is itself a significant source of faults (cf. [3,4]). Therefore, it is important to test the exceptional behavior as well.

Combinatorial testing (CT) is a black-box test method that is based on an input parameter model (IPM) [5]. When considering the exceptional behavior, an IPM must describe invalid values and invalid value combinations that trigger EH. Unfortunately, invalid values and invalid value combinations can cause input masking (cf. [6,7,8]). When a SUT is stimulated with an invalid input, the EH is expected to detect it, to respond with an error message, and to terminate the SUT without resuming the normal behavior. Consequently, the remaining values and value combinations of the test input remain untested as they are masked.

To avoid input masking, combinatorial robustness testing (CRT) is developed as an extension to CT using a robustness input parameter model (RIPM) being an extension of an IPM with additional semantic information to annotate values and value combinations as invalid [7]. With this semantic information, valid test inputs can be selected which do not cover any invalid value or invalid value combination. Further on, strong invalid test inputs can be selected which contain exactly one invalid value or one invalid value combination.

Due to the separation of valid and strong invalid test inputs, the input masking effect can be avoided when testing the normal behavior and the exceptional behavior. However, in comparison to CT which does not separate valid and strong invalid test inputs, CRT requires effort to model the additional semantic information.

Despite the presence of input masking, CT can still be effective in detecting faults as a previous controlled experiment indicates [8]. Nevertheless, the fault detection effectiveness (FDE) of CT decreases for systems with much EH. Even for high testing strengths and large test suites, the FDE of CT deteriorates. For systems with much EH, CRT is a promising approach that can achieve a higher FDE while requiring fewer test inputs than CT [7]. For systems with little EH, CRT is at least as effective as CT.

Although, the current assessment is solely based on one controlled experiment with artificial test scenarios (cf. [7]). Therefore, our objective is to further compare CRT with CT guided by the following two research questions. To answer these research questions, we conducted a case study. According to Kitchenham et al. [9], a case study helps to evaluate the benefits of methods and tools in industrial settings. When applied to compare methods and tools, a case study is of explanatory nature "seeking an explanation of a situation or a problem" [10]. As Runeson & Höst state, a case study "will never provide conclusions with statistical significance" [10]. But it can provide sufficient information to help you judge if specific technologies will benefit your own organization or project" [9]. Since a case study has, by definition, a higher degree of realism than a controlled experiment [10], a case study that compares CRT with CT can provide additional insights that complement and extend the findings of the previously conducted controlled experiment.

RQ 1

The paper is structured as follows. Section 2 introduces basic concepts of CT and CRT. Related work is discussed in Section 3. Next, the design of the case study is introduced (Section 4) and its results are presented (Section 5). Afterwards, threats to validity are discussed (Section 6) before the paper is concluded in Section 7.

Background

In the following, CT and CRT are briefly introduced. For more information, please refer to [11,5,7].

Combinatorial Testing

CT is a black-box test method [5]. It is based on an input parameter model (IPM) which declares 𝑛 parameters and each parameter is associated with a non-empty set of values. A schema is a set of parameter-value pairs for 𝑑 distinct parameters [12]. A schema with 𝑑 = 𝑛 parameter-value pairs is a test input. A schema 𝑎 covers another schema 𝑏 if and only if schema 𝑎 includes all parameter-value pairs of schema 𝑏.

Real-world systems are often constrained and certain values should not be combined to schemata and test inputs [5]. These schemata are irrelevant because they are not of any interest for the test. Test inputs that cover irrelevant schemata are irrelevant as well and their test results have no informative value. Hence, they should be excluded from testing.

Constraint handling is often used to exclude irrelevant schemata [13]. Therefore, irrelevant schemata are explicitly modeled by a set of logical expressions (called exclusion-constraints). A schema is relevant if it satisfies all exclusion-constraints. A schema is irrelevant if at least one exclusion-constraint remains unsatisfied.

A coverage criterion is a condition that must be satisfied by a test suite. A test selection strategy describes how values are combined to test inputs such that a given coverage criterion is satisfied [11]. Test suites resulting from a test selection strategy that supports constraint handling, e.g. I P O G -C [13], satisfy the 𝑡-wise relevant coverage criterion. This criterion is satisfied if the relevant test inputs of a test suite cover all relevant schemata of degree 𝑑 = 𝑡 that are described by an IPM [11,5].

Combinatorial Robustness Testing

To avoid input masking, CRT is developed as an extension to CT that separates valid and invalid test inputs [7]. To better separate the concepts, we say that CT relies on IPMs while CRT relies on robustness input parameter models (RIPM). A RIPM contains additional error-constraints which is another set of constraints to annotate relevant schemata as invalid. A relevant schema is also a valid schema if it satisfies all error-constraints. A relevant schema is an invalid schema if at least one error-constraint remains unsatisfied. Further on, an invalid schema is a strong invalid schema if exactly one error-constraint remains unsatisfied.

Test selection strategies like R O B U S T A [7] not only consider exclusion-constraints to exclude irrelevant schemata, they also consider error-constraints and exclude invalid schemata from valid test inputs. Further on, strong invalid test inputs are selected such that each invalid value and invalid value combination that is modeled by error-constraints appears in strong invalid test inputs.

Valid test inputs are selected to satisfy 𝑡-wise valid coverage. The 𝑡-wise valid coverage criterion is an extension of the 𝑡-wise relevant coverage criterion. It is satisfied if all valid schemata with a degree of 𝑑 = 𝑡 that are described by a RIPM are covered at least once by a valid test input.

Strong invalid test inputs are selected to satisfy 𝑏-wise strong invalid coverage where 𝑏 denotes the robustness interaction degree. Without robustness interaction (𝑏 = 0), the coverage criterion is called single error coverage (cf. [11,7]). It is satisfied if each invalid schema that is described by an error-constraint appears in a strong invalid test input. With robustness interaction (𝑏 ≥ 1), each described invalid schema is combined with all valid schemata of degree 𝑑 = 𝑏. The coverage criterion is satisfied if all combinations of invalid schemata and 𝑏-sized valid schemata are covered by strong invalid test inputs.

Following these brief introductions of CT and CRT, the conceptual difference between the two approaches should become clear. CT and CRT use the same parameters and values. But CT does not distinguish between valid and invalid schemata. Instead, both types of schemata are mixed and the FDE purely relies on the combinatorics, i.e. different testing strengths 𝑡. In contrast, CRT distinguishes valid and invalid schemata to avoid the effect of input masking. Here too the FDE relies on combinatorics but the avoidance of input masking has an additional influence.

8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)

CRT requires the effort to model error-constraints. Test selection strategies that consider error-constraints also become more complex. This raises the question whether the avoidance of input masking outweighs the additional effort and complexity of CRT. Until now, only artificial test scenarios are used to compare CT with CRT (cf. [7]) and it remains unclear if indicated advantages of CRT can be transferred to real-world scenarios. Therefore, this case study was conducted.

Related Work

To the best of our knowledge, Sherwood [6] first mentioned invalid values in the context of C A T S which is a test selection strategy and tool for CT. Cohen et al. [14] and Czerwonka [15] also acknowledged the necessity to separate valid and strong invalid test inputs. They also published test selection strategies and tools and the IPMs contain semantic information to distinguish relevant from irrelevant schemata and to distinguish valid from invalid values. However, invalid value combinations are not directly supported. Therefore, we proposed R O B U S T A and the structure of RIPMs with error-constraints [7].

Many studies exist that demonstrate the usefulness and effectiveness of CT (cf. [16,17,18]). But most studies do not distinguish between relevance and validness and focus on testing the normal behavior.

One case study by Wojciak & Tzoref-Brill [19] reports on applying CT and also considers testing with invalid inputs. They report that single error coverage was not sufficient because EH depended on interactions between invalid and valid values. In particular, "the same [exception] would often be handled differently depending on the firmware in control [...] or depending on the configuration of the system". A further remark is concerned with the ratio of valid versus invalid test inputs: "Since a lot of attention was given to [robustness] testing [...] where full recovery in the presence of [exceptions] was expected, the [test suite] contained a ratio of up to 2:1 [invalid test inputs vs. valid test inputs]. "

Case Study Design

In this section, the case under analysis and the data collection procedure are introduced.

Case Under Analysis

The case is a development project conducted by an IT service provider of an insurance company, where a new software was developed to manage the life-cycle of life insurance contracts. One subsystem of the software is concerned with the validation of insurance application data according to a set of validation rules and with forwarding the data when it satisfies the validation rules. It is the same project which we analyzed in a previous case study (cf. [18]).

Altogether, 31 validation rules are defined to check insurance application data. The order of the validation rules is predefined and all validation rules are traversed for each insurance application data. Whenever a validation rule is not satisfied by an insurance application, a corresponding error code is returned and the remaining validation rules are skipped. If all validation rules are satisfied, the subsystem returns S U C C E S S and the insurance application data is further processed. Although, the further processing is out of scope for this case study.

Each validation rule is built as an implication consisting of two parts:

isApplicable(application) ⇒ isValid(application)

The first part determines whether a given validation rule is applicable to the insurance application data or not. If a rule is applicable, the insurance application must not violate the rule, i.e. isValid(application). Otherwise, the validation rule is ignored.

Because details of the case are confidential, a generic example is given to provide further illustration of validation rules. The example depicts two validation rules to define maximum sums that can be insured depending on the permissions of the insurance agents. The first validation rule is applicable to all applications created by insurance agents with the highest level of permission. The second validation rule is applicable to all applications that are created by insurance agents with lower permission level.

The distinction between the two validation rules is made by the first part of the implication: Rule 1: i s A p p l i c a b l e (application) ∶ a p p l i c a t i o n .agent.permission = h i g h e s t _level Rule 2: i s A p p l i c a b l e (application) ∶ a p p l i c a t i o n .agent.permission ≠ h i g h e s t _level

The second part of the implication is used to enforce the maximum insured sum. As an application may consist of several partial contracts, the individual insured sums of all partial contracts are collected first. Afterwards, it is checked whether the total sum exceeds the threshold. While the structure of both rule's isValid() parts is the same, different values for the m a x i m u m _insured_sum constant are used: This example shows that many parameters may be involved in a validation rule, that intermediate calculations may be required, and that intermediate calculations may be reused in different validation rules. Therefore, all validation rules should be tested thoroughly.

i s V a l i d (application) ∶ t o t a l _sum = ∑ p a

For this case study, we consider the current set of validation rules as correct and treat them as our specification. By browsing the source code repository, we have identified 13 changes that have been made to the validation rules in order to correct them. Each change documents a fault that existed previously but is fixed prior to release. Based on these 13 changes, we reconstructed 13 implementation versions of which each contains one fault.

The 13 faults can also be classified according to our robustness fault classification (cf. [7]). Five faults can only be detected by invalid test inputs, while eight faults can be detected by both valid and invalid test inputs. Two of these five faults can be classified as faults in error-signaling. To reveal them, invalid test inputs must trigger EH which responds with an incorrect error code. The other three faults can be classified as faults in error-detection conditions. The conditions are too weak and do not detect invalid test inputs. Hence, the SUT incorrectly continues with its normal behavior.

The remaining eight faults can be detected by both valid and invalid test inputs. They are faults in error-detection conditions. Four of theses faults have conditions that are too strong and therefore incorrectly detect exception occurrences for valid test inputs. The other four faults have characteristics of being too weak and too strict at the same time because wrong parameters with similar characteristics are used in the exception condition. As a consequence, an invalid test input may not violate the condition (too weak) while a valid test input may not satisfy the condition (too strong).

Data Collection Procedure

Data collection refers to the measurement and calculation of metric values from test execution. Therefore, metrics are defined in this section. Furthermore, the modeling of the IPM and RIPM as well as the selection and execution of test inputs is described.

Metrics

The resources available from the software development project are not directly analyzed and compared. Instead, they are used to reconstruct the implementation versions for test execution and to create a RIPM and an IPM that represent variations of insurance application data.

Based on the RIPM and IPM, test inputs are selected using a CT and a CRT test selection strategy. Then, the test inputs are executed on the 13 reconstructed implementations to assess the effectiveness.

A common metric to assess the effectiveness is fault detection effectiveness (FDE) [11,16]

Modeling of IPM and RIPM

Since the FDE and AFDE metrics highly depend on the quality of the RIPM and IPM, a systematic modeling approach is necessary. We model the IPM first and later extend it with error-constraints to get a RIPM. The IPM is modeled iteratively for one validation rule at a time. In each iteration, parameters and values are added to ensure that test inputs with the following three characteristics can be detected: (1) test inputs that are not applicable; (2) test inputs that are applicable and valid; (3) test inputs that are applicable but not valid. In addition, some exclusion-constraints are introduced to ensure syntactic correctness of selected test inputs. The IPM is considered as complete once the IPM contains all parameters and values necessary to satisfy branch coverage of each validation rule.

For the RIPM, the modeling of additional error-constraints is required. The error-constraints are modeled iteratively and we add new or update existing ones until the separation of valid and strong invalid test inputs conforms to the responses of the SUT, i.e. the SUT returns S U C C E S S for each valid test input and the SUT returns an error code for each strong invalid test input.

In total, the IPM and RIPM consist of 32 parameters and 106 values. Most parameters have two, three, or four values each. But two parameters have six values each 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) and one parameter has even nine values. Three exclusion-constraints of which each restricts combinations of two parameters are required to ensure syntactical correctness of the insurance applications. Furthermore, the RIPM contains 31 error-constraints. 15 error-constraints annotate single values as invalid. The remaining 16 error-constraints annotate schemata with 2, 3, or 5 values.

The complete IPM and RIPM are described below in exponential notation. For parameters and values, 𝑥 𝑦 refers to 𝑦 parameters with 𝑥 values. For exclusion-and errorconstraints, 𝑥 𝑦 refers to 𝑦 constraints with 𝑥 parameters.

Parameters & Values:

9 1 6 2 5 1 4 8 3 8 2 12

Exclusion-Constraints: 2 3 Error-Constraints: 5 2 3 6 2 8 1 15

Selecting and Executing Test Inputs

After creating the IPM and RIPM, both models are used to select sets of test inputs. Since we compare CRT with CT, two different test selection strategies are used. R O B U S T A is used to select test inputs for the RIPM and I P O G -C is used to select test inputs for the IPM.

To compare the FDE and AFDE of CRT with CT, test suites that satisfy different coverage criteria are used. We apply I P O G -C to select test suites that satisfy 𝑡-wise relevant coverage for 𝑡 ∈ {1, ..., 5}. Furthermore, we apply R O B U S T A to select test suites that satisfy 𝑡-wise valid coverage with 𝑡 ∈ {1, ..., 3} and that satisfy 𝑏-wise strong invalid coverage with 𝑏 ∈ {0, 1}.

To reduce the effect of accidental fault detection caused by ordering, the order of parameters and values of the input parameter models is randomly reordered and 20 different model variants are used to select test suites for each coverage criteria.

Table 1 depicts the average sizes of test suites that satisfy the different coverage criteria. Since R O B U S T A encompasses two coverage criteria (𝑡-wise valid coverage and 𝑏-wise strong invalid coverage), the test suites are considered both, separately and combined.

The largest test suite is selected by I P O G -C which is required to satisfy 𝑡-wise relevant coverage with 𝑡 = 5 (15023.70 test inputs). The second-largest test suite is also selected by I P O G -C to satisfy 𝑡-wise relevant coverage with 𝑡 = 4 (2813.45 test inputs). The third-largest test suite is selected by R O B U S T A and satisfies 𝑡-wise valid coverage with 𝑡 = 3 and 𝑏-wise strong invalid coverage with 𝑏 = 1 (2224.30 test inputs).

When comparing the test suite sizes of 𝑡-wise relevant coverage of I P O G -C with 𝑡-wise valid coverage of R O B U S T A , it can be seen that the error-constraints drastically reduce the number of valid test inputs.

After test input selection, the test suites are used to stimulate the SUT in 13 different versions. Therefore, the 13 reconstructed implementations of which each contains

Results & Discussion

In this section, the case study results regarding the computed FDE and AFDE values are reported and discussed. As can be observed, 𝑡-wise relevant coverage is not able to detect all faults reliably. The FDE values increase when testing strength 𝑡 grows. But even with 𝑡 = 5 (15023.70 test inputs), only 7 faults are detected reliably (FDE value of 1.00). Further on, fault no. 10 remains undetected (FDE value of 0) and faults nos. 9 and 13 are only detected by one out of 20 test suites (FDE value of 0.05).

Fault Detection Effectiveness

The CRT coverage criteria are characterized by avoiding the invalid input masking effect. Since all invalid schemata are excluded by 𝑡-wise valid coverage, the faults 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) nos. 9 to 13 cannot be detected. But for all other faults, 𝑡-wise valid coverage has higher FDE values for the same testing strength 𝑡 when compared to 𝑡-wise relevant coverage. Because invalid input masking is avoided, a testing strength of 𝑡 = 2 is sufficient to detect faults nos. 1 to 8 reliably (FDE values of 1.00).

Using 𝑏-wise strong invalid coverage with 𝑏 = 0, 11 out of 13 faults can already be detected reliably and the two remaining faults have high FDE values of 0.90 and 0.80. The effectiveness of robustness interactions is even higher and all faults can be detected reliably with 𝑏 = 1.

Four faults that have too strong error detection conditions and that actually require valid test inputs to be detected are also reliably detected by 𝑏-wise strong invalid coverage. We could observe that a strong invalid test input that is expected to violate the error detection condition of the 𝑙-th validation rule is also expected to satisfy all prior validation rules from 1 to 𝑙 − 1. Therefore, strong invalid test inputs can be considered as "partiallyvalid" test inputs that are able to accidentally detect faults that require valid test inputs. This effect is strengthened by robustness interactions because more test inputs are selected and more interactions are covered by them.

R O B U S T A combines 𝑡-wise valid coverage and 𝑏-wise strong invalid coverage and the FDE values show that test suites for both coverage criteria complement each other. Since valid and strong invalid test inputs are able to detect faults nos. 1 to 8, the FDE values are complemented by the combination of both test suites. For faults nos. 9 to 13, the FDE values are not complemented by the combination of both test suites. This is because test suites that only satisfy 𝑡-wise valid coverage cannot detect these faults. Therefore, the FDE values of the combined test suites are the same as the FDE values of the test suites that satisfy 𝑏-wise strong invalid coverage.

In order to detect all faults reliably, the 𝑏-wise strong invalid coverage must be selected because faults nos. 9 to 13 remain undetected otherwise. Either robustness interaction (𝑏 > 0) or the combination of 𝑏-wise strong invalid coverage with 𝑡-wise valid coverage is required to reliably detect faults nos. 1 to 8. Even though 𝑡 = 1 is only sufficient to detect three of the first eight faults reliably, the combination with 𝑏-wise strong invalid coverage improves the FDE and all faults can be detected reliably.

The discussion of the FDE shows which coverage criteria are appropriate to reliably detect different types of faults. Next, we discuss the AFDE over all 13 faults.

Average Fault Detection Effectiveness

Because AFDE values are average values over a set of faults, AFDE allows making general statements about both the effectiveness and the efficiency of coverage criteria. First, we discuss the effectiveness in terms of AFDE values of different coverage criteria. Therefore, In direct comparison, test suites that satisfy 𝑡-wise valid coverage reach a maximum AFDE value of 0.62 as well. The same AFDE value can be reached because they prevent invalid input masking. However, the AFDE value cannot be further improved by increasing the testing 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) strength because faults nos. 1 to 8 are already detected reliably and faults nos. 9 to 13 cannot be detected by valid test inputs. Comparing the two coverage criteria for each testing strength individually shows that the AFDE value of 𝑡-wise valid coverage is always higher than the AFDE value of 𝑡-wise relevant coverage.

For 𝑏-wise strong invalid coverage, the lowest AFDE value is 0.98 (no robustness interactions) which is always higher than the AFDE values of 𝑡-wise relevant and valid coverage. Furthermore, 𝑏-wise strong invalid coverage with robustness interactions has an AFDE value of 1 and therefore detects all faults reliably.

Overall, the combination of 𝑡-wise valid coverage and 𝑏-wise strong invalid coverage performs the best and always detects all faults reliably.

When putting the AFDE values in relation to test suite sizes, it can be noted that 𝑡-wise relevant coverage has the worst efficiency as it requires 15023.70 test inputs for an AFDE value of 0.62. In contrast, 𝑡-wise valid coverage only requires 48.30 test inputs for an AFDE value of 0.62.

The best efficiency is offered by the combination of 𝑡-wise valid coverage with 𝑡 = 1 and 𝑏-wise strong invalid coverage with 𝑏 = 0 which requires 308.00 test inputs for an AFDE value of 1.00. When using an AFDE value of 0.92 as a lower boundary (12 out of 13 faults), 𝑏-wise strong invalid coverage with 𝑏 = 0 is sufficient and only requires 301.00 test inputs for an AFDE value of 0.98.

This discussion about efficiency is, of course, influenced by the characteristics of the 13 faults and cannot be generalized. But as more general statements, it can be observed that 𝑡-wise relevant coverage requires more test inputs to reach a similar AFDE value than 𝑡-wise valid coverage, 𝑏-wise strong invalid or the combination of both. At the same time, the combination of 𝑡-wise valid coverage and 𝑏-wise strong invalid coverage always has an AFDE value of 1.00 while at most 2224.30 test inputs are used. This finding is also consistent with our prior experimental evaluation (cf. [7]).

Therefore, we draw the conclusion that 𝑡-wise valid coverage, 𝑏-wise strong invalid coverage, and the combination of both perform as well as or better than 𝑡-wise relevant coverage in terms of effectiveness and efficiency. Although, the findings are only derived from one particular case. Therefore, we do not consider this to be true for all SUTs but for SUTs with many validation rules.

Threats to Validity

We compare the effectiveness of CRT using an implementation of the R O B U S T A test selection strategy with CT using an implementation of the I P O G -C test selection strategy. To ensure an unbiased implementation, both implementations follow the guidelines of Kleine & Simos [20]. Further on, the source code of the test selection strate-gies is published as part of the c o f f e e 4 j open-source test automation framework 1 .

The effectiveness of CRT and CT highly depend on the IPM and RIPM. Furthermore, the effectiveness depends on the faults that are considered in this case study.

Unfortunately, details of the case, i.e. source code of the validation rules and detailed descriptions of the faults, are confidential. To improve transparency and reproducibility, we describe the faults and make the characteristics of the IPM and RIPM explicit.

To avoid any bias, both the IPM and RIPM are modeled systematically and share the same set of parameters and values. To prevent falsified results due to accidental fault triggering, the orders of parameters and values are randomized and 20 different variants are used in test input selection. All presented FDE values are average values.

Since this is a case study with only one case, it is difficult to generalize the findings [10]. Further on, it has to be noted that the archival data of this case study is only a snapshot and the ground truth, i.e. the existing and previously existing faults, is unknown. Hence, the data can be biased towards simpler faults that are easier to detect. To prevent too far-reaching conclusions, we describe the characteristics of the SUT and also limit our conclusions to similar systems with many validation rules.

Conclusion

CRT extends CT to generate separate test suites with valid and strong invalid test inputs in order to avoid input masking that is caused by EH. Therefore, CRT requires additional effort to model error-constraints and introduces additional complexity to test selection strategies because error-constraints must be considered. This raises the question about the usefulness of CRT and whether the avoidance of input masking outweighs the additional effort and complexity. Until now, only artificial test scenarios are used to compare CT with CRT and it remains unclear if indicated advantages of CRT can be transferred to real-world scenarios.

In this paper, we therefore present the results of a case study based on a real-world system with 31 validation rules and 13 previously existing faults. To compare CT with CRT, we construct a IPM and a RIPM, select test inputs, and stimulate 13 implementations of the realworld system of which each implementation contains one of the 13 previously existing faults. For the subsequent discussion, we introduce the FDE and AFDE metrics.

To summarize the findings of this case study, we discuss both research questions individually.

Research Question 1: Our results indicate that the CRT test method is applicable in real-world test scenarios. This case study demonstrated that RIPMs with 32 parameters and 31 error-constraints can be constructed. Further on, the R O B U S T A test selection strategy is capable of selecting test suites for RIPMs with 32 parameters and 31 error-constraints.

Research Question 2: The comparison of CRT with CT is consistent with the findings of our previously conducted controlled experiment with artificial test scenarios (cf. [7]). Since the case under analysis has much EH, CRT performs better than CT in terms of FDE. Further on, it requires fewer test inputs to achieve better AFDE values than CT.

Although, the FDE and AFDE values are influenced by the characteristics of the 13 faults and cannot be generalized. Therefore, we do not consider this to be true for all SUTs but for SUTs with much EH.

In future work, we plan to conduct further case studies to learn more about the FDE of CRT and CT.

. A test suite 𝑇 is denoted as failing for a test scenario 𝑆𝐶 if at least one of the test inputs 𝜏 ∈ 𝑇 detects the fault in 𝑆𝐶. failing(𝑇 , 𝑆𝐶) = { 1 if ∃𝜏 ∈ 𝑇 that fails for 𝑆𝐶 0 otherwiseUsing the failing function, FDE is defined as the ratio between the number of test suites 𝑇 of a test suite family 𝑇 * that fail for a test scenario 𝑆𝐶 and the number of all test suites in the family 𝑇 * . In this case study, the family of test suites contains 20 different variants. In other words, the FDE is based on 20 randomized test suites that all satisfy the same coverage criterion for the same IPM or RIPM. They all test the same test scenario.Further on, the average fault detection effectiveness (AFDE) denotes the average FDE over a family of test scenarios 𝑆𝐶 * . In our case study, the family of test scenarios 𝑆𝐶 * consists of the 13 reconstructed implementations. The AFDE represents the average effectiveness of CRT and CT equally distributed over the 13 faults.FDE(𝑇 * , 𝑆𝐶) =∑ 𝑇 ∈𝑇 * failing(𝑇 , 𝑆𝐶) |𝑇 * |

AFDE(𝑇* , 𝑆𝐶 * ) = ∑ 𝑆𝐶∈𝑆𝐶 * FDE(𝑇 * , 𝑆𝐶) |𝑆𝐶 * |

Table 11Test suite sizes of test suites for different coverage criteriaCoverage CriteriatbSize𝑡-wise relevant1-9.00coverage2-68.103-480.104-2813.455-15023.70𝑡-wise valid coverage1-7.002-48.303-267.95𝑏-wise strong-0301.00invalid coverage-11956.35𝑡-wise valid coverage10308.00and 𝑏-wise strong111963.35invalid coverage20349.30212004.6530568.95312224.30one fault are tested to determine which test suite is ableto detect which fault. The results are discussed in thefollowing section.

Table 22lists the FDE values of all test suites familiesapplied to all 13 implementations. For better readability,+ is used to indicate an FDE value of 1.00. The faults nos.1 to 8 can all be detected by both valid and invalid testinputs, while the faults nos. 9 to 13 can only be detectedby invalid test inputs. Again, the shown FDE value is anaverage value for one test suite family with 20 differenttest suites that are created by randomizing the order ofparameters and values before selecting test inputs. Asan example, in the first row for fault no. 3, an FDE valueof 0.05 means that one out of 20 test suites detected thefault at least once per test suite.

Table 22FDE values for different coverage criteriaCoverageFDE values for faults nos. 1 to 13AFDECriteriatb12345678910111213values𝑡-wise relevant1-000.05 0.050000000.25 0.0500.03coverage2-0.10 0.10 0.45 0.20 0.10000000.65 0.2000.143-0.75 0.75++0.65 0.05 0.10 0.050.050+0.6500.474-+++++0.15 0.10 0.0500++00.565-+++++0.50 0.35 0.150.050++0.050.62𝑡-wise valid1-0.75 0.75++0.50 0.50+0.80000000.48coverage2-++++++++000000.623-++++++++000000.62b-wise strong-0++++++0.90 0.80+++++0.98invalid-1++++++++++++++𝑡-wise valid10++++++++++++++coverage and11++++++++++++++b-wise20++++++++++++++strong invalid21++++++++++++++coverage30++++++++++++++31++++++++++++++

Table 22lists the AFDE values for test suites that satisfy different coverage criteria. Afterwards, we discuss the efficiency in terms of AFDE values in relation to test suite sizes (listed in Table 1). The AFDE values reflect what we discussed before since they aggregate FDE values. Because of the invalid input masking effect, test suites that satisfy 𝑡-wise relevant coverage only reach an AFDE value of 0.62.See https://coffee4j.github.io for more information.8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ)

</analytic> <monogr> <title level="j">IEEE Standard Glossary of Software Engineering Terminology 610 1990 IEEE Std Basic concepts and taxonomy of dependable and secure computing AAvižienis JLaprie BRandell CELandwehr IEEE Trans. Dependable Sec. Comput 1 2004 Are the classes that use exceptions defect prone? CMarinescu Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th annual ERCIM Workshop on Software Evolution, EVOL/IWPSE 2011 the 12th International Workshop on Principles of Software Evolution and the 7th annual ERCIM Workshop on Software Evolution, EVOL/IWPSE 2011

Szeged, Hungary

September 5-6, 2011. 2011 Exception handling defects: An empirical study PSawadpong EBAllen BJWilliams IEEE 14th International Symposium on High-Assurance Systems Engineering 2012. 2012 A survey of combinatorial testing CNie HLeung ACM Comput. Surv 43 29 2011 Effective testing of factor combinations GBSherwood Proceedings of the Third International Conference on Software Testing, Analysis and Review the Third International Conference on Software Testing, Analysis and Review

Washington, DC

1994 Combinatorial robustness testing with negative test cases KFögen HLichter Proceedings of the 19th IEEE International Conference on Software Quality, Reliability and Security, QRS 2019 the 19th IEEE International Conference on Software Quality, Reliability and Security, QRS 2019

Sofia, Bulgaria

July 22-26, 2019, 2019 An experiment to compare combinatorial testing in the presence of invalid values KFögen HLichter Proceedings of the 7th International Workshop on Quantitative Approaches to Software Quality co-located with 26th Asia-Pacific Software Engineering Conference (APSEC 2019) the 7th International Workshop on Quantitative Approaches to Software Quality co-located with 26th Asia-Pacific Software Engineering Conference (APSEC 2019)

Putrajaya, Malaysia

December 2, 2019. 2019 Case studies for method and tool evaluation BAKitchenham LPickard SLPfleeger IEEE Softw 12 1995 Guidelines for conducting and reporting case study research in software engineering PRuneson MHöst Empirical Software Engineering 14 2009 Combination testing strategies: a survey MGrindal JOffutt SFAndler Softw. Test., Verif. Reliab 15 2005 The minimal failure-causing schema of combinatorial testing CNie HLeung ACM Trans. Softw. Eng. Methodol 20 38 2011 An efficient algorithm for constraint handling in combinatorial test generation LYu YLei MNBorazjany RKacker DRKuhn Sixth IEEE International Conference on Software Testing, Verification and Validation, ICST 2013

Luxembourg, Luxembourg

March 18-22, 2013, 2013 The AETG system: An approach to testing based on combinatiorial design DMCohen SRDalal MLFredman GCPatton IEEE Trans. Software Eng 23 1997 Pairwise testing in real world JCzerwonka 24th Pacific Northwest Software Quality Conference Citeseer 2006 200 Practical combinatorial interaction testing: Empirical findings on efficiency and early fault detection JPetke MBCohen MHarman SYoo IEEE Trans. Software Eng 41 2015 An empirical comparison of combinatorial testing, random testing and adaptive random testing HWu NChanghai JPetke YJia MHarman IEEE Transactions on Software Engineering 2018 A case study on robustness fault characteristics for combinatorial testing -results and challenges KFögen HLichter Proceedings of the 6th International Workshop on Quantitative Approaches to Software Quality co-located with 25th Asia-Pacific Software Engineering Conference (APSEC 2018) the 6th International Workshop on Quantitative Approaches to Software Quality co-located with 25th Asia-Pacific Software Engineering Conference (APSEC 2018)

Nara, Japan

December 4, 2018. 2018 System level combinatorial testing in practice -the concurrent maintenance case study PWojciak RTzoref-Brill Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014

Cleveland, Ohio, USA

March 31 2014-April 4, 2014. 2014 An efficient design and implementation of the in-parameter-order algorithm KKleine DESimos Mathematics in Computer Science 12 2018 International Workshop on Quantitative Approaches to Software Quality

QuASoQ

2020