8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) An Industrial Case Study on Fault Detection Effectiveness of Combinatorial Robustness Testing Konrad Fögen, Horst Lichter Research Group Software Construction, RWTH Aachen University, Aachen, Germany https://www.swc.rwth-aachen.de Abstract Combinatorial robustness testing (CRT) is an extension of combinatorial testing (CT) to separate test suites with valid and strong invalid test inputs. Until now, only one controlled experiment using artificial test scenarios was conducted to compare CRT with CT. The results indicate advantages of CRT when much exception handling is involved. But, it is unclear if these advantages are also valid in the real-world. In this paper, we present the results of a case study conducted to compare the fault detection effectiveness of CRT and CT by testing an industrial system with 31 validation rules and 13 injected faults. Keywords Software Testing, Combinatorial Testing, Robustness Testing 1. Introduction To avoid input masking, combinatorial robustness test- ing (CRT) is developed as an extension to CT using a Robustness is an important property of software. It de- robustness input parameter model (RIPM) being an ex- scribes “the degree to which a system [...] can function tension of an IPM with additional semantic information correctly in the presence of [invalid inputs]” [1]. Invalid to annotate values and value combinations as invalid [7]. inputs are caused by external faults, i.e. faults in other With this semantic information, valid test inputs can be systems or made by users interacting with a system. Ex- selected which do not cover any invalid value or invalid amples are inputs to the system under test (SUT) that value combination. Further on, strong invalid test inputs contain invalid values like a string value when a numeri- can be selected which contain exactly one invalid value cal value is expected, or invalid value combinations like or one invalid value combination. a begin date which is after the end date. When invalid Due to the separation of valid and strong invalid test inputs remain undetected, they can propagate to failures inputs, the input masking effect can be avoided when in the SUT resulting in abnormal behavior or crashes [2]. testing the normal behavior and the exceptional behavior. Developers attempt to improve robustness of systems However, in comparison to CT which does not separate by implementing exception handling (EH) to detect and valid and strong invalid test inputs, CRT requires effort recover from invalid inputs. Unfortunately, EH is itself to model the additional semantic information. a significant source of faults (cf. [3, 4]). Therefore, it is Despite the presence of input masking, CT can still important to test the exceptional behavior as well. be effective in detecting faults as a previous controlled Combinatorial testing (CT) is a black-box test method experiment indicates [8]. Nevertheless, the fault detec- that is based on an input parameter model (IPM) [5]. tion effectiveness (FDE) of CT decreases for systems with When considering the exceptional behavior, an IPM must much EH. Even for high testing strengths and large test describe invalid values and invalid value combinations suites, the FDE of CT deteriorates. For systems with that trigger EH. Unfortunately, invalid values and invalid much EH, CRT is a promising approach that can achieve value combinations can cause input masking (cf. [6, 7, 8]). a higher FDE while requiring fewer test inputs than CT When a SUT is stimulated with an invalid input, the EH [7]. For systems with little EH, CRT is at least as effective is expected to detect it, to respond with an error message, as CT. and to terminate the SUT without resuming the normal Although, the current assessment is solely based on behavior. Consequently, the remaining values and value one controlled experiment with artificial test scenarios (cf. combinations of the test input remain untested as they [7]). Therefore, our objective is to further compare CRT are masked. with CT guided by the following two research questions. QuASoQ 2020: 8th International Workshop on Quantitative RQ 1 Is the CRT test method applicable in real-world Approaches to Software Quality, December 01, 2020, Singapore Envelope-Open foegen@swc.rwth-aachen.de (K. Fögen); test scenarios? lichter@swc.rwth-aachen.de (H. Lichter) Orcid 0000-0002-3440-1238 (H. Lichter) RQ 2 How does the CRT test method compare with CT © 2020 Copyright for this paper by its authors. Use permitted under Creative in real-world test scenarios? Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 29 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) To answer these research questions, we conducted a from a test selection strategy that supports constraint case study. According to Kitchenham et al. [9], a case handling, e.g. I P O G - C [13], satisfy the 𝑡-wise relevant study helps to evaluate the benefits of methods and tools coverage criterion. This criterion is satisfied if the rele- in industrial settings. When applied to compare methods vant test inputs of a test suite cover all relevant schemata and tools, a case study is of explanatory nature “seek- of degree 𝑑 = 𝑡 that are described by an IPM [11, 5]. ing an explanation of a situation or a problem” [10]. As Runeson & Höst state, a case study “will never provide 2.2. Combinatorial Robustness Testing conclusions with statistical significance” [10]. But it can provide sufficient information to help you judge if spe- To avoid input masking, CRT is developed as an exten- cific technologies will benefit your own organization or sion to CT that separates valid and invalid test inputs [7]. project” [9]. Since a case study has, by definition, a higher To better separate the concepts, we say that CT relies on degree of realism than a controlled experiment [10], a IPMs while CRT relies on robustness input parameter case study that compares CRT with CT can provide addi- models (RIPM). A RIPM contains additional error-con- tional insights that complement and extend the findings straints which is another set of constraints to annotate of the previously conducted controlled experiment. relevant schemata as invalid. A relevant schema is also The paper is structured as follows. Section 2 intro- a valid schema if it satisfies all error-constraints. A duces basic concepts of CT and CRT. Related work is relevant schema is an invalid schema if at least one discussed in Section 3. Next, the design of the case study error-constraint remains unsatisfied. Further on, an in- is introduced (Section 4) and its results are presented valid schema is a strong invalid schema if exactly one (Section 5). Afterwards, threats to validity are discussed error-constraint remains unsatisfied. (Section 6) before the paper is concluded in Section 7. Test selection strategies like R O B U S T A [7] not only con- sider exclusion-constraints to exclude irrelevant schema- ta, they also consider error-constraints and exclude in- 2. Background valid schemata from valid test inputs. Further on, strong invalid test inputs are selected such that each invalid In the following, CT and CRT are briefly introduced. For value and invalid value combination that is modeled by more information, please refer to [11, 5, 7]. error-constraints appears in strong invalid test inputs. Valid test inputs are selected to satisfy 𝑡-wise valid 2.1. Combinatorial Testing coverage. The 𝑡-wise valid coverage criterion is an ex- tension of the 𝑡-wise relevant coverage criterion. It is CT is a black-box test method [5]. It is based on an input satisfied if all valid schemata with a degree of 𝑑 = 𝑡 that parameter model (IPM) which declares 𝑛 parameters are described by a RIPM are covered at least once by a and each parameter is associated with a non-empty set valid test input. of values. A schema is a set of parameter-value pairs Strong invalid test inputs are selected to satisfy 𝑏-wise for 𝑑 distinct parameters [12]. A schema with 𝑑 = 𝑛 strong invalid coverage where 𝑏 denotes the robust- parameter-value pairs is a test input. A schema 𝑎 covers ness interaction degree. Without robustness interaction another schema 𝑏 if and only if schema 𝑎 includes all (𝑏 = 0), the coverage criterion is called single error cover- parameter-value pairs of schema 𝑏. age (cf. [11, 7]). It is satisfied if each invalid schema that Real-world systems are often constrained and certain is described by an error-constraint appears in a strong values should not be combined to schemata and test in- invalid test input. With robustness interaction (𝑏 ≥ 1), puts [5]. These schemata are irrelevant because they are each described invalid schema is combined with all valid not of any interest for the test. Test inputs that cover schemata of degree 𝑑 = 𝑏. The coverage criterion is satis- irrelevant schemata are irrelevant as well and their test fied if all combinations of invalid schemata and 𝑏-sized results have no informative value. Hence, they should be valid schemata are covered by strong invalid test inputs. excluded from testing. Following these brief introductions of CT and CRT, Constraint handling is often used to exclude irrelevant the conceptual difference between the two approaches schemata [13]. Therefore, irrelevant schemata are ex- should become clear. CT and CRT use the same param- plicitly modeled by a set of logical expressions (called eters and values. But CT does not distinguish between exclusion-constraints). A schema is relevant if it sat- valid and invalid schemata. Instead, both types of schema- isfies all exclusion-constraints. A schema is irrelevant ta are mixed and the FDE purely relies on the combina- if at least one exclusion-constraint remains unsatisfied. torics, i.e. different testing strengths 𝑡. In contrast, CRT A coverage criterion is a condition that must be sat- distinguishes valid and invalid schemata to avoid the isfied by a test suite. A test selection strategy describes effect of input masking. Here too the FDE relies on com- how values are combined to test inputs such that a given binatorics but the avoidance of input masking has an coverage criterion is satisfied [11]. Test suites resulting additional influence. 30 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) CRT requires the effort to model error-constraints. data according to a set of validation rules and with for- Test selection strategies that consider error-constraints warding the data when it satisfies the validation rules. It also become more complex. This raises the question is the same project which we analyzed in a previous case whether the avoidance of input masking outweighs the study (cf. [18]). additional effort and complexity of CRT. Until now, only Altogether, 31 validation rules are defined to check artificial test scenarios are used to compare CT with CRT insurance application data. The order of the validation (cf. [7]) and it remains unclear if indicated advantages of rules is predefined and all validation rules are traversed CRT can be transferred to real-world scenarios. There- for each insurance application data. Whenever a valida- fore, this case study was conducted. tion rule is not satisfied by an insurance application, a corresponding error code is returned and the remaining validation rules are skipped. If all validation rules are 3. Related Work satisfied, the subsystem returns S U C C E S S and the insur- ance application data is further processed. Although, the To the best of our knowledge, Sherwood [6] first men- further processing is out of scope for this case study. tioned invalid values in the context of C A T S which is a test Each validation rule is built as an implication consist- selection strategy and tool for CT. Cohen et al. [14] and ing of two parts: Czerwonka [15] also acknowledged the necessity to sep- arate valid and strong invalid test inputs. They also pub- isApplicable(application) ⇒ isValid(application) lished test selection strategies and tools and the IPMs con- tain semantic information to distinguish relevant from The first part determines whether a given validation rule irrelevant schemata and to distinguish valid from invalid is applicable to the insurance application data or not. If values. However, invalid value combinations are not di- a rule is applicable, the insurance application must not rectly supported. Therefore, we proposed R O B U S T A and violate the rule, i.e. isValid(application). Otherwise, the the structure of RIPMs with error-constraints [7]. validation rule is ignored. Many studies exist that demonstrate the usefulness and Because details of the case are confidential, a generic effectiveness of CT (cf. [16, 17, 18]). But most studies example is given to provide further illustration of vali- do not distinguish between relevance and validness and dation rules. The example depicts two validation rules focus on testing the normal behavior. to define maximum sums that can be insured depending One case study by Wojciak & Tzoref-Brill [19] reports on the permissions of the insurance agents. The first on applying CT and also considers testing with invalid validation rule is applicable to all applications created inputs. They report that single error coverage was not by insurance agents with the highest level of permission. sufficient because EH depended on interactions between The second validation rule is applicable to all applica- invalid and valid values. In particular, “the same [excep- tions that are created by insurance agents with lower tion] would often be handled differently depending on permission level. the firmware in control [...] or depending on the config- The distinction between the two validation rules is uration of the system”. A further remark is concerned made by the first part of the implication: with the ratio of valid versus invalid test inputs: “Since Rule 1: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶ a lot of attention was given to [robustness] testing [...] where full recovery in the presence of [exceptions] was a p p l i c a t i o n .a g e n t .p e r m i s s i o n = h i g h e s t _l e v e l expected, the [test suite] contained a ratio of up to 2:1 Rule 2: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶ [invalid test inputs vs. valid test inputs].” a p p l i c a t i o n .a g e n t .p e r m i s s i o n ≠ h i g h e s t _l e v e l The second part of the implication is used to enforce 4. Case Study Design the maximum insured sum. As an application may consist In this section, the case under analysis and the data col- of several partial contracts, the individual insured sums lection procedure are introduced. of all partial contracts are collected first. Afterwards, it is checked whether the total sum exceeds the thresh- old. While the structure of both rule’s isValid() parts is 4.1. Case Under Analysis the same, different values for the m a x i m u m _i n s u r e d _s u m The case is a development project conducted by an IT constant are used: service provider of an insurance company, where a new i s V a l i d (a p p l i c a t i o n ) ∶ software was developed to manage the life-cycle of life insurance contracts. One subsystem of the software is t o t a l _s u m = ∑ p a r t i a l .i n s u r e d _s u m concerned with the validation of insurance application partial ∈ application t o t a l _s u m ≤ m a x i m u m _i n s u r e d _s u m 31 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) This example shows that many parameters may be A common metric to assess the effectiveness is fault involved in a validation rule, that intermediate calcula- detection effectiveness (FDE) [11, 16]. A test suite 𝑇 tions may be required, and that intermediate calculations is denoted as failing for a test scenario 𝑆𝐶 if at least one may be reused in different validation rules. Therefore, of the test inputs 𝜏 ∈ 𝑇 detects the fault in 𝑆𝐶. all validation rules should be tested thoroughly. For this case study, we consider the current set of vali- 1 if ∃𝜏 ∈ 𝑇 that fails for 𝑆𝐶 failing(𝑇 , 𝑆𝐶) = { dation rules as correct and treat them as our specification. 0 otherwise By browsing the source code repository, we have iden- Using the failing function, FDE is defined as the ratio tified 13 changes that have been made to the validation between the number of test suites 𝑇 of a test suite family rules in order to correct them. Each change documents a 𝑇 ∗ that fail for a test scenario 𝑆𝐶 and the number of all test fault that existed previously but is fixed prior to release. suites in the family 𝑇 ∗ . In this case study, the family of Based on these 13 changes, we reconstructed 13 imple- test suites contains 20 different variants. In other words, mentation versions of which each contains one fault. the FDE is based on 20 randomized test suites that all The 13 faults can also be classified according to our ro- satisfy the same coverage criterion for the same IPM or bustness fault classification (cf. [7]). Five faults can only RIPM. They all test the same test scenario. be detected by invalid test inputs, while eight faults can be detected by both valid and invalid test inputs. Two of ∑𝑇 ∈𝑇 ∗ failing(𝑇 , 𝑆𝐶) these five faults can be classified as faults in error-signal- FDE(𝑇 ∗ , 𝑆𝐶) = |𝑇 ∗ | ing. To reveal them, invalid test inputs must trigger EH which responds with an incorrect error code. The other Further on, the average fault detection effective- three faults can be classified as faults in error-detection ness (AFDE) denotes the average FDE over a family of conditions. The conditions are too weak and do not detect test scenarios 𝑆𝐶 ∗ . In our case study, the family of test invalid test inputs. Hence, the SUT incorrectly continues scenarios 𝑆𝐶 ∗ consists of the 13 reconstructed implemen- with its normal behavior. tations. The AFDE represents the average effectiveness The remaining eight faults can be detected by both of CRT and CT equally distributed over the 13 faults. valid and invalid test inputs. They are faults in error-de- ∑𝑆𝐶∈𝑆𝐶 ∗ FDE(𝑇 ∗ , 𝑆𝐶) tection conditions. Four of theses faults have conditions AFDE(𝑇 ∗ , 𝑆𝐶 ∗ ) = that are too strong and therefore incorrectly detect ex- |𝑆𝐶 ∗ | ception occurrences for valid test inputs. The other four faults have characteristics of being too weak and too 4.2.2. Modeling of IPM and RIPM strict at the same time because wrong parameters with Since the FDE and AFDE metrics highly depend on the similar characteristics are used in the exception condi- quality of the RIPM and IPM, a systematic modeling ap- tion. As a consequence, an invalid test input may not proach is necessary. We model the IPM first and later violate the condition (too weak) while a valid test input extend it with error-constraints to get a RIPM. may not satisfy the condition (too strong). The IPM is modeled iteratively for one validation rule at a time. In each iteration, parameters and values are 4.2. Data Collection Procedure added to ensure that test inputs with the following three characteristics can be detected: (1) test inputs that are Data collection refers to the measurement and calculation not applicable; (2) test inputs that are applicable and of metric values from test execution. Therefore, metrics valid; (3) test inputs that are applicable but not valid. In are defined in this section. Furthermore, the modeling of addition, some exclusion-constraints are introduced to the IPM and RIPM as well as the selection and execution ensure syntactic correctness of selected test inputs. The of test inputs is described. IPM is considered as complete once the IPM contains all parameters and values necessary to satisfy branch 4.2.1. Metrics coverage of each validation rule. For the RIPM, the modeling of additional error-con- The resources available from the software development straints is required. The error-constraints are modeled project are not directly analyzed and compared. Instead, iteratively and we add new or update existing ones until they are used to reconstruct the implementation versions the separation of valid and strong invalid test inputs con- for test execution and to create a RIPM and an IPM that forms to the responses of the SUT, i.e. the SUT returns represent variations of insurance application data. S U C C E S S for each valid test input and the SUT returns an Based on the RIPM and IPM, test inputs are selected error code for each strong invalid test input. using a CT and a CRT test selection strategy. Then, the In total, the IPM and RIPM consist of 32 parameters test inputs are executed on the 13 reconstructed imple- and 106 values. Most parameters have two, three, or four mentations to assess the effectiveness. values each. But two parameters have six values each 32 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) and one parameter has even nine values. Three exclu- Table 1 sion-constraints of which each restricts combinations of Test suite sizes of test suites for different coverage criteria two parameters are required to ensure syntactical cor- Coverage Criteria t b Size rectness of the insurance applications. Furthermore, the 𝑡-wise relevant 1 - 9.00 RIPM contains 31 error-constraints. 15 error-constraints coverage 2 - 68.10 annotate single values as invalid. The remaining 16 er- 3 - 480.10 ror-constraints annotate schemata with 2, 3, or 5 values. 4 - 2813.45 The complete IPM and RIPM are described below in ex- 5 - 15023.70 ponential notation. For parameters and values, 𝑥 𝑦 refers 𝑡-wise valid coverage 1 - 7.00 to 𝑦 parameters with 𝑥 values. For exclusion- and error- 2 - 48.30 constraints, 𝑥 𝑦 refers to 𝑦 constraints with 𝑥 parameters. 3 - 267.95 𝑏-wise strong - 0 301.00 Parameters & Values: 91 62 51 48 38 212 invalid coverage - 1 1956.35 𝑡-wise valid coverage 1 0 308.00 Exclusion-Constraints: 23 and 𝑏-wise strong 1 1 1963.35 Error-Constraints: 52 36 28 115 invalid coverage 2 0 349.30 2 1 2004.65 3 0 568.95 4.2.3. Selecting and Executing Test Inputs 3 1 2224.30 After creating the IPM and RIPM, both models are used to select sets of test inputs. Since we compare CRT with CT, two different test selection strategies are used. R O B U S T A one fault are tested to determine which test suite is able is used to select test inputs for the RIPM and I P O G - C is to detect which fault. The results are discussed in the used to select test inputs for the IPM. following section. To compare the FDE and AFDE of CRT with CT, test suites that satisfy different coverage criteria are used. We apply I P O G - C to select test suites that satisfy 𝑡-wise 5. Results & Discussion relevant coverage for 𝑡 ∈ {1, ..., 5}. Furthermore, we ap- In this section, the case study results regarding the com- ply R O B U S T A to select test suites that satisfy 𝑡-wise valid puted FDE and AFDE values are reported and discussed. coverage with 𝑡 ∈ {1, ..., 3} and that satisfy 𝑏-wise strong invalid coverage with 𝑏 ∈ {0, 1}. To reduce the effect of accidental fault detection caused 5.1. Fault Detection Effectiveness by ordering, the order of parameters and values of the Table 2 lists the FDE values of all test suites families input parameter models is randomly reordered and 20 applied to all 13 implementations. For better readability, different model variants are used to select test suites for + is used to indicate an FDE value of 1.00. The faults nos. each coverage criteria. 1 to 8 can all be detected by both valid and invalid test Table 1 depicts the average sizes of test suites that inputs, while the faults nos. 9 to 13 can only be detected satisfy the different coverage criteria. Since R O B U S T A en- by invalid test inputs. Again, the shown FDE value is an compasses two coverage criteria (𝑡-wise valid coverage average value for one test suite family with 20 different and 𝑏-wise strong invalid coverage), the test suites are test suites that are created by randomizing the order of considered both, separately and combined. parameters and values before selecting test inputs. As The largest test suite is selected by I P O G - C which is an example, in the first row for fault no. 3, an FDE value required to satisfy 𝑡-wise relevant coverage with 𝑡 = 5 of 0.05 means that one out of 20 test suites detected the (15023.70 test inputs). The second-largest test suite is also fault at least once per test suite. selected by I P O G - C to satisfy 𝑡-wise relevant coverage with As can be observed, 𝑡-wise relevant coverage is not 𝑡 = 4 (2813.45 test inputs). The third-largest test suite able to detect all faults reliably. The FDE values increase is selected by R O B U S T A and satisfies 𝑡-wise valid coverage when testing strength 𝑡 grows. But even with 𝑡 = 5 with 𝑡 = 3 and 𝑏-wise strong invalid coverage with 𝑏 = 1 (15023.70 test inputs), only 7 faults are detected reliably (2224.30 test inputs). (FDE value of 1.00). Further on, fault no. 10 remains When comparing the test suite sizes of 𝑡-wise relevant undetected (FDE value of 0) and faults nos. 9 and 13 are coverage of I P O G - C with 𝑡-wise valid coverage of R O B U S T A , only detected by one out of 20 test suites (FDE value of it can be seen that the error-constraints drastically reduce 0.05). the number of valid test inputs. The CRT coverage criteria are characterized by avoid- After test input selection, the test suites are used to ing the invalid input masking effect. Since all invalid stimulate the SUT in 13 different versions. Therefore, the schemata are excluded by 𝑡-wise valid coverage, the faults 13 reconstructed implementations of which each contains 33 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Table 2 FDE values for different coverage criteria Coverage FDE values for faults nos. 1 to 13 AFDE Criteria t b 1 2 3 4 5 6 7 8 9 10 11 12 13 values 𝑡-wise relevant 1 - 0 0 0.05 0.05 0 0 0 0 0 0 0.25 0.05 0 0.03 coverage 2 - 0.10 0.10 0.45 0.20 0.10 0 0 0 0 0 0.65 0.20 0 0.14 3 - 0.75 0.75 + + 0.65 0.05 0.10 0.05 0.05 0 + 0.65 0 0.47 4 - + + + + + 0.15 0.10 0.05 0 0 + + 0 0.56 5 - + + + + + 0.50 0.35 0.15 0.05 0 + + 0.05 0.62 𝑡-wise valid 1 - 0.75 0.75 + + 0.50 0.50 + 0.80 0 0 0 0 0 0.48 coverage 2 - + + + + + + + + 0 0 0 0 0 0.62 3 - + + + + + + + + 0 0 0 0 0 0.62 b-wise strong - 0 + + + + + + 0.90 0.80 + + + + + 0.98 invalid - 1 + + + + + + + + + + + + + + 𝑡-wise valid 1 0 + + + + + + + + + + + + + + coverage and 1 1 + + + + + + + + + + + + + + b-wise 2 0 + + + + + + + + + + + + + + strong invalid 2 1 + + + + + + + + + + + + + + coverage 3 0 + + + + + + + + + + + + + + 3 1 + + + + + + + + + + + + + + nos. 9 to 13 cannot be detected. But for all other faults, In order to detect all faults reliably, the 𝑏-wise strong 𝑡-wise valid coverage has higher FDE values for the same invalid coverage must be selected because faults nos. 9 testing strength 𝑡 when compared to 𝑡-wise relevant cov- to 13 remain undetected otherwise. Either robustness erage. Because invalid input masking is avoided, a testing interaction (𝑏 > 0) or the combination of 𝑏-wise strong strength of 𝑡 = 2 is sufficient to detect faults nos. 1 to 8 invalid coverage with 𝑡-wise valid coverage is required reliably (FDE values of 1.00). to reliably detect faults nos. 1 to 8. Even though 𝑡 = 1 Using 𝑏-wise strong invalid coverage with 𝑏 = 0, 11 is only sufficient to detect three of the first eight faults out of 13 faults can already be detected reliably and the reliably, the combination with 𝑏-wise strong invalid cov- two remaining faults have high FDE values of 0.90 and erage improves the FDE and all faults can be detected 0.80. The effectiveness of robustness interactions is even reliably. higher and all faults can be detected reliably with 𝑏 = 1. The discussion of the FDE shows which coverage cri- Four faults that have too strong error detection con- teria are appropriate to reliably detect different types of ditions and that actually require valid test inputs to be faults. Next, we discuss the AFDE over all 13 faults. detected are also reliably detected by 𝑏-wise strong in- valid coverage. We could observe that a strong invalid 5.2. Average Fault Detection test input that is expected to violate the error detection condition of the 𝑙-th validation rule is also expected to Effectiveness satisfy all prior validation rules from 1 to 𝑙 − 1. Therefore, Because AFDE values are average values over a set of strong invalid test inputs can be considered as “partially- faults, AFDE allows making general statements about valid” test inputs that are able to accidentally detect faults both the effectiveness and the efficiency of coverage cri- that require valid test inputs. This effect is strengthened teria. First, we discuss the effectiveness in terms of AFDE by robustness interactions because more test inputs are values of different coverage criteria. Therefore, Table 2 selected and more interactions are covered by them. lists the AFDE values for test suites that satisfy different R O B U S T A combines 𝑡-wise valid coverage and 𝑏-wise coverage criteria. Afterwards, we discuss the efficiency strong invalid coverage and the FDE values show that test in terms of AFDE values in relation to test suite sizes suites for both coverage criteria complement each other. (listed in Table 1). Since valid and strong invalid test inputs are able to detect The AFDE values reflect what we discussed before faults nos. 1 to 8, the FDE values are complemented by since they aggregate FDE values. Because of the invalid the combination of both test suites. For faults nos. 9 to 13, input masking effect, test suites that satisfy 𝑡-wise rele- the FDE values are not complemented by the combination vant coverage only reach an AFDE value of 0.62. of both test suites. This is because test suites that only In direct comparison, test suites that satisfy 𝑡-wise satisfy 𝑡-wise valid coverage cannot detect these faults. valid coverage reach a maximum AFDE value of 0.62 as Therefore, the FDE values of the combined test suites are well. The same AFDE value can be reached because they the same as the FDE values of the test suites that satisfy prevent invalid input masking. However, the AFDE value 𝑏-wise strong invalid coverage. cannot be further improved by increasing the testing 34 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) strength because faults nos. 1 to 8 are already detected gies is published as part of the c o f f e e 4 j open-source test reliably and faults nos. 9 to 13 cannot be detected by valid automation framework1 . test inputs. Comparing the two coverage criteria for each The effectiveness of CRT and CT highly depend on the testing strength individually shows that the AFDE value IPM and RIPM. Furthermore, the effectiveness depends of 𝑡-wise valid coverage is always higher than the AFDE on the faults that are considered in this case study. value of 𝑡-wise relevant coverage. Unfortunately, details of the case, i.e. source code For 𝑏-wise strong invalid coverage, the lowest AFDE of the validation rules and detailed descriptions of the value is 0.98 (no robustness interactions) which is always faults, are confidential. To improve transparency and higher than the AFDE values of 𝑡-wise relevant and valid reproducibility, we describe the faults and make the char- coverage. Furthermore, 𝑏-wise strong invalid coverage acteristics of the IPM and RIPM explicit. with robustness interactions has an AFDE value of 1 and To avoid any bias, both the IPM and RIPM are modeled therefore detects all faults reliably. systematically and share the same set of parameters and Overall, the combination of 𝑡-wise valid coverage and values. To prevent falsified results due to accidental fault 𝑏-wise strong invalid coverage performs the best and triggering, the orders of parameters and values are ran- always detects all faults reliably. domized and 20 different variants are used in test input When putting the AFDE values in relation to test suite selection. All presented FDE values are average values. sizes, it can be noted that 𝑡-wise relevant coverage has Since this is a case study with only one case, it is diffi- the worst efficiency as it requires 15023.70 test inputs for cult to generalize the findings [10]. Further on, it has to an AFDE value of 0.62. In contrast, 𝑡-wise valid coverage be noted that the archival data of this case study is only a only requires 48.30 test inputs for an AFDE value of 0.62. snapshot and the ground truth, i.e. the existing and pre- The best efficiency is offered by the combination of viously existing faults, is unknown. Hence, the data can 𝑡-wise valid coverage with 𝑡 = 1 and 𝑏-wise strong invalid be biased towards simpler faults that are easier to detect. coverage with 𝑏 = 0 which requires 308.00 test inputs To prevent too far-reaching conclusions, we describe the for an AFDE value of 1.00. When using an AFDE value characteristics of the SUT and also limit our conclusions of 0.92 as a lower boundary (12 out of 13 faults), 𝑏-wise to similar systems with many validation rules. strong invalid coverage with 𝑏 = 0 is sufficient and only requires 301.00 test inputs for an AFDE value of 0.98. This discussion about efficiency is, of course, influ- 7. Conclusion enced by the characteristics of the 13 faults and cannot CRT extends CT to generate separate test suites with be generalized. But as more general statements, it can be valid and strong invalid test inputs in order to avoid input observed that 𝑡-wise relevant coverage requires more test masking that is caused by EH. Therefore, CRT requires inputs to reach a similar AFDE value than 𝑡-wise valid additional effort to model error-constraints and intro- coverage, 𝑏-wise strong invalid coverage, or the combi- duces additional complexity to test selection strategies nation of both. At the same time, the combination of because error-constraints must be considered. This raises 𝑡-wise valid coverage and 𝑏-wise strong invalid coverage the question about the usefulness of CRT and whether always has an AFDE value of 1.00 while at most 2224.30 the avoidance of input masking outweighs the additional test inputs are used. This finding is also consistent with effort and complexity. Until now, only artificial test sce- our prior experimental evaluation (cf. [7]). narios are used to compare CT with CRT and it remains Therefore, we draw the conclusion that 𝑡-wise valid unclear if indicated advantages of CRT can be transferred coverage, 𝑏-wise strong invalid coverage, and the combi- to real-world scenarios. nation of both perform as well as or better than 𝑡-wise In this paper, we therefore present the results of a case relevant coverage in terms of effectiveness and efficiency. study based on a real-world system with 31 validation Although, the findings are only derived from one partic- rules and 13 previously existing faults. To compare CT ular case. Therefore, we do not consider this to be true with CRT, we construct a IPM and a RIPM, select test for all SUTs but for SUTs with many validation rules. inputs, and stimulate 13 implementations of the real- world system of which each implementation contains one 6. Threats to Validity of the 13 previously existing faults. For the subsequent discussion, we introduce the FDE and AFDE metrics. We compare the effectiveness of CRT using an imple- To summarize the findings of this case study, we dis- mentation of the R O B U S T A test selection strategy with CT cuss both research questions individually. using an implementation of the I P O G - C test selection strat- Research Question 1: Our results indicate that the egy. To ensure an unbiased implementation, both imple- CRT test method is applicable in real-world test scenar- mentations follow the guidelines of Kleine & Simos [20]. ios. This case study demonstrated that RIPMs with 32 Further on, the source code of the test selection strate- 1 See https://coffee4j.github.io for more information. 35 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) parameters and 31 error-constraints can be constructed. Workshop on Quantitative Approaches to Software Further on, the R O B U S T A test selection strategy is capable Quality co-located with 26th Asia-Pacific Software of selecting test suites for RIPMs with 32 parameters and Engineering Conference (APSEC 2019), Putrajaya, 31 error-constraints. Malaysia, December 2, 2019., 2019, pp. 27–36. Research Question 2: The comparison of CRT with [9] B. A. Kitchenham, L. Pickard, S. L. Pfleeger, Case CT is consistent with the findings of our previously con- studies for method and tool evaluation, IEEE Softw. ducted controlled experiment with artificial test scenarios 12 (1995) 52–62. (cf. [7]). Since the case under analysis has much EH, CRT [10] P. Runeson, M. Höst, Guidelines for conducting performs better than CT in terms of FDE. Further on, it and reporting case study research in software engi- requires fewer test inputs to achieve better AFDE values neering, Empirical Software Engineering 14 (2009) than CT. 131–164. Therefore, we draw the conclusion that 𝑡-wise valid [11] M. Grindal, J. Offutt, S. F. Andler, Combination test- coverage, 𝑏-wise strong invalid coverage, and the combi- ing strategies: a survey, Softw. Test., Verif. Reliab. nation of both perform as well as or better than 𝑡-wise 15 (2005) 167–199. relevant coverage in terms of effectiveness and efficiency. [12] C. Nie, H. Leung, The minimal failure-causing Although, the FDE and AFDE values are influenced by schema of combinatorial testing, ACM Trans. Softw. the characteristics of the 13 faults and cannot be general- Eng. Methodol. 20 (2011) 15:1–15:38. ized. Therefore, we do not consider this to be true for all [13] L. Yu, Y. Lei, M. N. Borazjany, R. Kacker, D. R. Kuhn, SUTs but for SUTs with much EH. An efficient algorithm for constraint handling in In future work, we plan to conduct further case studies combinatorial test generation, in: Sixth IEEE In- to learn more about the FDE of CRT and CT. ternational Conference on Software Testing, Ver- ification and Validation, ICST 2013, Luxembourg, Luxembourg, March 18-22, 2013, 2013, pp. 242–251. References [14] D. M. Cohen, S. R. Dalal, M. L. Fredman, G. C. Patton, The AETG system: An approach to testing based on [1] IEEE, IEEE Standard Glossary of Software Engi- combinatiorial design, IEEE Trans. Software Eng. neering Terminology, IEEE Std 610.12-1990 (1990). 23 (1997) 437–444. [2] A. Avižienis, J. Laprie, B. Randell, C. E. Landwehr, [15] J. Czerwonka, Pairwise testing in real world, in: Basic concepts and taxonomy of dependable and 24th Pacific Northwest Software Quality Confer- secure computing, IEEE Trans. Dependable Sec. ence, volume 200, Citeseer, 2006. Comput. 1 (2004) 11–33. [16] J. Petke, M. B. Cohen, M. Harman, S. Yoo, Practical [3] C. Marinescu, Are the classes that use exceptions combinatorial interaction testing: Empirical find- defect prone?, in: Proceedings of the 12th Interna- ings on efficiency and early fault detection, IEEE tional Workshop on Principles of Software Evolu- Trans. Software Eng. 41 (2015) 901–924. tion and the 7th annual ERCIM Workshop on Soft- [17] H. Wu, n. changhai, J. Petke, Y. Jia, M. Harman, ware Evolution, EVOL/IWPSE 2011, Szeged, Hun- An empirical comparison of combinatorial testing, gary, September 5-6, 2011., 2011, pp. 56–60. random testing and adaptive random testing, IEEE [4] P. Sawadpong, E. B. Allen, B. J. Williams, Exception Transactions on Software Engineering (2018) 1–1. handling defects: An empirical study, in: 2012 IEEE [18] K. Fögen, H. Lichter, A case study on robust- 14th International Symposium on High-Assurance ness fault characteristics for combinatorial test- Systems Engineering, 2012, pp. 90–97. ing - results and challenges, in: Proceedings of [5] C. Nie, H. Leung, A survey of combinatorial testing, the 6th International Workshop on Quantitative ACM Comput. Surv. 43 (2011) 11:1–11:29. Approaches to Software Quality co-located with [6] G. B. Sherwood, Effective testing of factor combi- 25th Asia-Pacific Software Engineering Conference nations, in: Proceedings of the Third International (APSEC 2018), Nara, Japan, December 4, 2018., 2018, Conference on Software Testing, Analysis and Re- pp. 22–29. view, Washington, DC, 1994, pp. 151–166. [19] P. Wojciak, R. Tzoref-Brill, System level combina- [7] K. Fögen, H. Lichter, Combinatorial robustness torial testing in practice - the concurrent mainte- testing with negative test cases, in: Proceedings of nance case study, in: Seventh IEEE International the 19th IEEE International Conference on Software Conference on Software Testing, Verification and Quality, Reliability and Security, QRS 2019, Sofia, Validation, ICST 2014, March 31 2014-April 4, 2014, Bulgaria, July 22-26, 2019, 2019, pp. 34–45. Cleveland, Ohio, USA, 2014, pp. 103–112. [8] K. Fögen, H. Lichter, An experiment to compare [20] K. Kleine, D. E. Simos, An efficient design and im- combinatorial testing in the presence of invalid plementation of the in-parameter-order algorithm, values, in: Proceedings of the 7th International Mathematics in Computer Science 12 (2018) 51–67. 36