1. Introduction

An Industrial Case Study on Fault Detection Efectiveness of Combinatorial Robustness Testing

Konrad Fögen

0 1 2

Horst Lichter

lichter@swc.rwth-aachen.de 0 1 2

Software Testing, Combinatorial Testing, Robustness Testing

0 Research Group Software Construction, RWTH Aachen University , Aachen , Germany 1 Workshop Proce dings 2 Workshop Proceedings , CEUR-WS.org

2020

29 36

Combinatorial robustness testing (CRT) is an extension of combinatorial testing (CT) to separate test suites with valid and strong invalid test inputs. Until now, only one controlled experiment using artificial test scenarios was conducted to compare CRT with CT. The results indicate advantages of CRT when much exception handling is involved. But, it is unclear if these advantages are also valid in the real-world. In this paper, we present the results of a case study conducted to compare the fault detection efectiveness of CRT and CT by testing an industrial system with 31 validation rules and 13 injected faults.

1. Introduction

Robustness is an important property of software. It describes “the degree to which a system [...] can function correctly in the presence of [invalid inputs]” [ 1 ]. Invalid inputs are caused by external faults, i.e. faults in other amples are inputs to the system under test (SUT) that contain invalid values like a string value when a numerical value is expected, or invalid value combinations like a begin date which is after the end date. When invalid inputs remain undetected, they can propagate to failures in the SUT resulting in abnormal behavior or crashes [2].

Developers attempt to improve robustness of systems

by implementing exception handling (EH) to detect and recover from invalid inputs. Unfortunately, EH is itself a significant source of faults (cf. [ 3, 4]). Therefore, it is important to test the exceptional behavior as well.

Combinatorial testing (CT) is a black-box test method

that is based on an input parameter model (IPM) [5].

When considering the exceptional behavior, an IPM must

describe invalid values and invalid value combinations that trigger EH. Unfortunately, invalid values and invalid value combinations can cause input masking (cf. [ 6, 7, 8 ]).

When a SUT is stimulated with an invalid input, the EH

and to terminate the SUT without resuming the normal is expected to detect it, to respond with an error message, as CT. behavior. Consequently, the remaining values and value combinations of the test input remain untested as they are masked. CEUR

CEUR

To avoid input masking, combinatorial robustness test

ing (CRT) is developed as an extension to CT using a robustness input parameter model (RIPM) being an extension of an IPM with additional semantic information to annotate values and value combinations as invalid [ 7 ].

With this semantic information, valid test inputs can be value combination. Further on, strong invalid test inputs can be selected which contain exactly one invalid value or one invalid value combination. Due to the separation of valid and strong invalid test

inputs, the input masking efect can be avoided when testing the normal behavior and the exceptional behavior.

However, in comparison to CT which does not separate

valid and strong invalid test inputs, CRT requires efort to model the additional semantic information.

Despite the presence of input masking, CT can still be efective in detecting faults as a previous controlled experiment indicates [ 8 ]. Nevertheless, the fault detection efectiveness ( FDE) of CT decreases for systems with much EH. Even for high testing strengths and large test suites, the FDE of CT deteriorates. For systems with much EH, CRT is a promising approach that can achieve a higher FDE while requiring fewer test inputs than CT [ 7 ]. For systems with little EH, CRT is at least as efective

Although, the current assessment is solely based on one controlled experiment with artificial test scenarios (cf. [7]). Therefore, our objective is to further compare CRT with CT guided by the following two research questions. RQ 1 Is the CRT test method applicable in real-world RQ 2 How does the CRT test method compare with CT test scenarios? in real-world test scenarios?

To answer these research questions, we conducted a from a test selection strategy that supports constraint case study. According to Kitchenham et al. [9], a case handling, e.g. I P O G - C [13], satisfy the -wise relevant study helps to evaluate the benefits of methods and tools coverage criterion. This criterion is satisfied if the relein industrial settings. When applied to compare methods vant test inputs of a test suite cover all relevant schemata and tools, a case study is of explanatory nature “seek- of degree = that are described by an IPM [11, 5]. ing an explanation of a situation or a problem” [10]. As Runeson & Höst state, a case study “will never provide 2.2. Combinatorial Robustness Testing conclusions with statistical significance” [ 10]. But it can provide suficient information to help you judge if specific technologies will benefit your own organization or project” [9]. Since a case study has, by definition, a higher degree of realism than a controlled experiment [10], a case study that compares CRT with CT can provide additional insights that complement and extend the findings of the previously conducted controlled experiment.

The paper is structured as follows. Section 2 introduces basic concepts of CT and CRT. Related work is discussed in Section 3. Next, the design of the case study is introduced (Section 4) and its results are presented (Section 5). Afterwards, threats to validity are discussed (Section 6) before the paper is concluded in Section 7.

To avoid input masking, CRT is developed as an extension to CT that separates valid and invalid test inputs [7]. To better separate the concepts, we say that CT relies on

IPMs while CRT relies on robustness input parameter models (RIPM). A RIPM contains additional error-constraints which is another set of constraints to annotate relevant schemata as invalid. A relevant schema is also a valid schema if it satisfies all error-constraints. A relevant schema is an invalid schema if at least one error-constraint remains unsatisfied. Further on, an invalid schema is a strong invalid schema if exactly one error-constraint remains unsatisfied.

Test selection strategies like R O B U S T A [ 7 ] not only consider exclusion-constraints to exclude irrelevant schemata, they also consider error-constraints and exclude invalid schemata from valid test inputs. Further on, strong invalid test inputs are selected such that each invalid value and invalid value combination that is modeled by error-constraints appears in strong invalid test inputs.

Valid test inputs are selected to satisfy -wise valid coverage. The -wise valid coverage criterion is an extension of the -wise relevant coverage criterion. It is satisfied if all valid schemata with a degree of = that are described by a RIPM are covered at least once by a valid test input.

Strong invalid test inputs are selected to satisfy -wise strong invalid coverage where denotes the robustness interaction degree. Without robustness interaction ( = 0 ), the coverage criterion is called single error coverage (cf. [ 11, 7 ]). It is satisfied if each invalid schema that is described by an error-constraint appears in a strong invalid test input. With robustness interaction ( ≥ 1 ), each described invalid schema is combined with all valid schemata of degree = . The coverage criterion is satisifed if all combinations of invalid schemata and -sized valid schemata are covered by strong invalid test inputs.

Following these brief introductions of CT and CRT, the conceptual diference between the two approaches should become clear. CT and CRT use the same parameters and values. But CT does not distinguish between valid and invalid schemata. Instead, both types of schemata are mixed and the FDE purely relies on the combinatorics, i.e. diferent testing strengths . In contrast, CRT distinguishes valid and invalid schemata to avoid the efect of input masking. Here too the FDE relies on combinatorics but the avoidance of input masking has an additional influence.

2. Background In the following, CT and CRT are briefly introduced. For more information, please refer to [11, 5, 7]. 2.1. Combinatorial Testing

CT is a black-box test method [5]. It is based on an input parameter model (IPM) which declares parameters and each parameter is associated with a non-empty set of values. A schema is a set of parameter-value pairs for distinct parameters [12]. A schema with = parameter-value pairs is a test input. A schema covers another schema if and only if schema includes all parameter-value pairs of schema .

Real-world systems are often constrained and certain values should not be combined to schemata and test inputs [5]. These schemata are irrelevant because they are not of any interest for the test. Test inputs that cover irrelevant schemata are irrelevant as well and their test results have no informative value. Hence, they should be excluded from testing.

Constraint handling is often used to exclude irrelevant schemata [13]. Therefore, irrelevant schemata are explicitly modeled by a set of logical expressions (called exclusion-constraints). A schema is relevant if it satisfies all exclusion-constraints. A schema is irrelevant if at least one exclusion-constraint remains unsatisfied.

A coverage criterion is a condition that must be satisfied by a test suite. A test selection strategy describes how values are combined to test inputs such that a given coverage criterion is satisfied [ 11]. Test suites resulting

CRT requires the efort to model error-constraints. data according to a set of validation rules and with forTest selection strategies that consider error-constraints warding the data when it satisfies the validation rules. It also become more complex. This raises the question is the same project which we analyzed in a previous case whether the avoidance of input masking outweighs the study (cf. [18]). additional efort and complexity of CRT. Until now, only Altogether, 31 validation rules are defined to check artificial test scenarios are used to compare CT with CRT insurance application data. The order of the validation (cf. [ 7 ]) and it remains unclear if indicated advantages of rules is predefined and all validation rules are traversed CRT can be transferred to real-world scenarios. There- for each insurance application data. Whenever a validafore, this case study was conducted. tion rule is not satisfied by an insurance application, a corresponding error code is returned and the remaining validation rules are skipped. If all validation rules are 3. Related Work satisfied, the subsystem returns S U C C E S S and the insurance application data is further processed. Although, the further processing is out of scope for this case study.

Each validation rule is built as an implication consisting of two parts:

To the best of our knowledge, Sherwood [6] first men

tioned invalid values in the context of C A T S which is a test selection strategy and tool for CT. Cohen et al. [14] and Czerwonka [15] also acknowledged the necessity to separate valid and strong invalid test inputs. They also pub- isApplicable(application) ⇒ isValid(application) lished test selection strategies and tools and the IPMs contain semantic information to distinguish relevant from The first part determines whether a given validation rule irrelevant schemata and to distinguish valid from invalid is applicable to the insurance application data or not. If values. However, invalid value combinations are not di- a rule is applicable, the insurance application must not rectly supported. Therefore, we proposed R O B U S T A and violate the rule, i.e. isValid(application). Otherwise, the the structure of RIPMs with error-constraints [ 7 ]. validation rule is ignored.

Many studies exist that demonstrate the usefulness and Because details of the case are confidential, a generic efectiveness of CT (cf. [16, 17, 18]). But most studies example is given to provide further illustration of valido not distinguish between relevance and validness and dation rules. The example depicts two validation rules focus on testing the normal behavior. to define maximum sums that can be insured depending

One case study by Wojciak & Tzoref-Brill [19] reports on the permissions of the insurance agents. The first on applying CT and also considers testing with invalid validation rule is applicable to all applications created inputs. They report that single error coverage was not by insurance agents with the highest level of permission. suficient because EH depended on interactions between The second validation rule is applicable to all applicainvalid and valid values. In particular, “the same [excep- tions that are created by insurance agents with lower tion] would often be handled diferently depending on permission level. the firmware in control [...] or depending on the config- The distinction between the two validation rules is uration of the system”. A further remark is concerned made by the first part of the implication: with the ratio of valid versus invalid test inputs: “Since a lot of attention was given to [robustness] testing [...] Rule 1: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶ where full recovery in the presence of [exceptions] was a p p l i c a t i o n .a g e n t .p e r m i s s i o n = h i g h e s t _l e v e l expected, the [test suite] contained a ratio of up to 2:1 Rule 2: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶ [invalid test inputs vs. valid test inputs].” a p p l i c a t i o n .a g e n t .p e r m i s s i o n ≠ h i g h e s t _l e v e l

4. Case Study Design In this section, the case under analysis and the data collection procedure are introduced. 4.1. Case Under Analysis The case is a development project conducted by an IT

service provider of an insurance company, where a new software was developed to manage the life-cycle of life insurance contracts. One subsystem of the software is concerned with the validation of insurance application The second part of the implication is used to enforce the maximum insured sum. As an application may consist of several partial contracts, the individual insured sums of all partial contracts are collected first. Afterwards, it is checked whether the total sum exceeds the threshold. While the structure of both rule’s isValid() parts is the same, diferent values for the m a x i m u m _i n s u r e d _s u m constant are used: i s V a l i d (a p p l i c a t i o n ) ∶ t o t a l _s u m = ∑ p a r t i a l .i n s u r e d _s u m

p a r t i a l ∈ a p p l i c a t i o n t o t a l _s u m ≤ m a x i m u m _i n s u r e d _s u m

This example shows that many parameters may be

involved in a validation rule, that intermediate calculations may be required, and that intermediate calculations may be reused in diferent validation rules. Therefore, all validation rules should be tested thoroughly.

For this case study, we consider the current set of validation rules as correct and treat them as our specification. By browsing the source code repository, we have identified 13 changes that have been made to the validation rules in order to correct them. Each change documents a fault that existed previously but is fixed prior to release. Based on these 13 changes, we reconstructed 13 implementation versions of which each contains one fault.

The 13 faults can also be classified according to our robustness fault classification (cf. [ 7 ]). Five faults can only be detected by invalid test inputs, while eight faults can be detected by both valid and invalid test inputs. Two of these five faults can be classified as faults in error-signaling. To reveal them, invalid test inputs must trigger EH which responds with an incorrect error code. The other three faults can be classified as faults in error-detection conditions. The conditions are too weak and do not detect invalid test inputs. Hence, the SUT incorrectly continues with its normal behavior.

The remaining eight faults can be detected by both valid and invalid test inputs. They are faults in error-detection conditions. Four of theses faults have conditions that are too strong and therefore incorrectly detect exception occurrences for valid test inputs. The other four faults have characteristics of being too weak and too strict at the same time because wrong parameters with similar characteristics are used in the exception condition. As a consequence, an invalid test input may not violate the condition (too weak) while a valid test input may not satisfy the condition (too strong).

4.2. Data Collection Procedure Data collection refers to the measurement and calculation

of metric values from test execution. Therefore, metrics are defined in this section. Furthermore, the modeling of the IPM and RIPM as well as the selection and execution of test inputs is described. 4.2.1. Metrics

The resources available from the software development

project are not directly analyzed and compared. Instead, they are used to reconstruct the implementation versions for test execution and to create a RIPM and an IPM that represent variations of insurance application data.

Based on the RIPM and IPM, test inputs are selected using a CT and a CRT test selection strategy. Then, the test inputs are executed on the 13 reconstructed implementations to assess the efectiveness.

A common metric to assess the efectiveness is fault detection efectiveness (FDE) [11, 16]. A test suite is denoted as failing for a test scenario if at least one of the test inputs ∈ detects the fault in .

failing( , ) =

1 if ∃ ∈ that fails for { 0 otherwise

Using the failing function, FDE is defined as the ratio between the number of test suites of a test suite family ∗ that fail for a test scenario and the number of all test suites in the family ∗. In this case study, the family of test suites contains 20 diferent variants. In other words, the FDE is based on 20 randomized test suites that all satisfy the same coverage criterion for the same IPM or RIPM. They all test the same test scenario.

FDE( ∗, ) = ∑ ∈ ∗ failing( , ) | ∗|

Further on, the average fault detection efectiveness (AFDE) denotes the average FDE over a family of test scenarios ∗. In our case study, the family of test scenarios ∗ consists of the 13 reconstructed implementations. The AFDE represents the average efectiveness of CRT and CT equally distributed over the 13 faults.

AFDE( ∗, ∗) = ∑∈ ∗ FDE( ∗, ) | ∗| 4.2.2. Modeling of IPM and RIPM

Since the FDE and AFDE metrics highly depend on the

quality of the RIPM and IPM, a systematic modeling approach is necessary. We model the IPM first and later extend it with error-constraints to get a RIPM.

The IPM is modeled iteratively for one validation rule at a time. In each iteration, parameters and values are added to ensure that test inputs with the following three characteristics can be detected: (1) test inputs that are not applicable; (2) test inputs that are applicable and valid; (3) test inputs that are applicable but not valid. In addition, some exclusion-constraints are introduced to ensure syntactic correctness of selected test inputs. The IPM is considered as complete once the IPM contains all parameters and values necessary to satisfy branch coverage of each validation rule.

For the RIPM, the modeling of additional error-constraints is required. The error-constraints are modeled iteratively and we add new or update existing ones until the separation of valid and strong invalid test inputs conforms to the responses of the SUT, i.e. the SUT returns S U C C E S S for each valid test input and the SUT returns an error code for each strong invalid test input.

In total, the IPM and RIPM consist of 32 parameters and 106 values. Most parameters have two, three, or four values each. But two parameters have six values each and one parameter has even nine values. Three exclusion-constraints of which each restricts combinations of two parameters are required to ensure syntactical correctness of the insurance applications. Furthermore, the RIPM contains 31 error-constraints. 15 error-constraints annotate single values as invalid. The remaining 16 error-constraints annotate schemata with 2, 3, or 5 values.

The complete IPM and RIPM are described below in exponential notation. For parameters and values, refers to parameters with values. For exclusion- and errorconstraints, refers to constraints with parameters.

Parameters & Values: 9162514838212 Exclusion-Constraints: 23

Error-Constraints: 523628115 4.2.3. Selecting and Executing Test Inputs

After creating the IPM and RIPM, both models are used to

select sets of test inputs. Since we compare CRT with CT, two diferent test selection strategies are used. R O B U S T A one fault are tested to determine which test suite is able is used to select test inputs for the RIPM and I P O G - C is to detect which fault. The results are discussed in the used to select test inputs for the IPM. following section.

To compare the FDE and AFDE of CRT with CT, test suites that satisfy diferent coverage criteria are used. 5. Results & Discussion We apply I P O G - C to select test suites that satisfy -wise relevant coverage for ∈ {1, ..., 5} . Furthermore, we ap- In this section, the case study results regarding the comply R O B U S T A to select test suites that satisfy -wise valid puted FDE and AFDE values are reported and discussed. coverage with ∈ {1, ..., 3} and that satisfy -wise strong invalid coverage with ∈ {0, 1}.

To reduce the efect of accidental fault detection caused 5.1. Fault Detection Efectiveness by ordering, the order of parameters and values of the Table 2 lists the FDE values of all test suites families input parameter models is randomly reordered and 20 applied to all 13 implementations. For better readability, diferent model variants are used to select test suites for + is used to indicate an FDE value of 1.00. The faults nos. each coverage criteria. 1 to 8 can all be detected by both valid and invalid test

Table 1 depicts the average sizes of test suites that inputs, while the faults nos. 9 to 13 can only be detected satisfy the diferent coverage criteria. Since R O B U S T A en- by invalid test inputs. Again, the shown FDE value is an compasses two coverage criteria ( -wise valid coverage average value for one test suite family with 20 diferent and -wise strong invalid coverage), the test suites are test suites that are created by randomizing the order of considered both, separately and combined. parameters and values before selecting test inputs. As

The largest test suite is selected by I P O G - C which is an example, in the first row for fault no. 3, an FDE value required to satisfy -wise relevant coverage with = 5 of 0.05 means that one out of 20 test suites detected the (15023.70 test inputs). The second-largest test suite is also fault at least once per test suite. selected by I P O G - C to satisfy -wise relevant coverage with As can be observed, -wise relevant coverage is not = 4 (2813.45 test inputs). The third-largest test suite able to detect all faults reliably. The FDE values increase is selected by R O B U S T A and satisfies -wise valid coverage when testing strength grows. But even with = 5 with = 3 and -wise strong invalid coverage with = 1 (15023.70 test inputs), only 7 faults are detected reliably (2224.30 test inputs). (FDE value of 1.00). Further on, fault no. 10 remains

When comparing the test suite sizes of -wise relevant undetected (FDE value of 0) and faults nos. 9 and 13 are coverage of I P O G - C with -wise valid coverage of R O B U S T A , only detected by one out of 20 test suites (FDE value of it can be seen that the error-constraints drastically reduce 0.05). the number of valid test inputs. The CRT coverage criteria are characterized by avoid

After test input selection, the test suites are used to ing the invalid input masking efect. Since all invalid stimulate the SUT in 13 diferent versions. Therefore, the schemata are excluded by -wise valid coverage, the faults 13 reconstructed implementations of which each contains nos. 9 to 13 cannot be detected. But for all other faults, In order to detect all faults reliably, the -wise strong -wise valid coverage has higher FDE values for the same invalid coverage must be selected because faults nos. 9 testing strength when compared to -wise relevant cov- to 13 remain undetected otherwise. Either robustness erage. Because invalid input masking is avoided, a testing interaction ( > 0 ) or the combination of -wise strong strength of = 2 is suficient to detect faults nos. 1 to 8 invalid coverage with -wise valid coverage is required reliably (FDE values of 1.00). to reliably detect faults nos. 1 to 8. Even though = 1

Using -wise strong invalid coverage with = 0 , 11 is only suficient to detect three of the first eight faults out of 13 faults can already be detected reliably and the reliably, the combination with -wise strong invalid covtwo remaining faults have high FDE values of 0.90 and erage improves the FDE and all faults can be detected 0.80. The efectiveness of robustness interactions is even reliably. higher and all faults can be detected reliably with = 1. The discussion of the FDE shows which coverage cri

Four faults that have too strong error detection con- teria are appropriate to reliably detect diferent types of ditions and that actually require valid test inputs to be faults. Next, we discuss the AFDE over all 13 faults. detected are also reliably detected by -wise strong invalid coverage. We could observe that a strong invalid 5.2. Average Fault Detection test input that is expected to violate the error detection Efectiveness condition of the -th validation rule is also expected to satisfy all prior validation rules from 1 to − 1 . Therefore, Because AFDE values are average values over a set of strong invalid test inputs can be considered as “partially- faults, AFDE allows making general statements about valid” test inputs that are able to accidentally detect faults both the efectiveness and the eficiency of coverage crithat require valid test inputs. This efect is strengthened teria. First, we discuss the efectiveness in terms of AFDE by robustness interactions because more test inputs are values of diferent coverage criteria. Therefore, Table 2 selected and more interactions are covered by them. lists the AFDE values for test suites that satisfy diferent

R O B U S T A combines -wise valid coverage and -wise coverage criteria. Afterwards, we discuss the eficiency strong invalid coverage and the FDE values show that test in terms of AFDE values in relation to test suite sizes suites for both coverage criteria complement each other. (listed in Table 1).

Since valid and strong invalid test inputs are able to detect The AFDE values reflect what we discussed before faults nos. 1 to 8, the FDE values are complemented by since they aggregate FDE values. Because of the invalid the combination of both test suites. For faults nos. 9 to 13, input masking efect, test suites that satisfy -wise relethe FDE values are not complemented by the combination vant coverage only reach an AFDE value of 0.62. of both test suites. This is because test suites that only In direct comparison, test suites that satisfy -wise satisfy -wise valid coverage cannot detect these faults. valid coverage reach a maximum AFDE value of 0.62 as Therefore, the FDE values of the combined test suites are well. The same AFDE value can be reached because they the same as the FDE values of the test suites that satisfy prevent invalid input masking. However, the AFDE value -wise strong invalid coverage. cannot be further improved by increasing the testing strength because faults nos. 1 to 8 are already detected gies is published as part of the c o f f e e 4 j open-source test reliably and faults nos. 9 to 13 cannot be detected by valid automation framework1. test inputs. Comparing the two coverage criteria for each The efectiveness of CRT and CT highly depend on the testing strength individually shows that the AFDE value IPM and RIPM. Furthermore, the efectiveness depends of -wise valid coverage is always higher than the AFDE on the faults that are considered in this case study. value of -wise relevant coverage. Unfortunately, details of the case, i.e. source code

For -wise strong invalid coverage, the lowest AFDE of the validation rules and detailed descriptions of the value is 0.98 (no robustness interactions) which is always faults, are confidential. To improve transparency and higher than the AFDE values of -wise relevant and valid reproducibility, we describe the faults and make the charcoverage. Furthermore, -wise strong invalid coverage acteristics of the IPM and RIPM explicit. with robustness interactions has an AFDE value of 1 and To avoid any bias, both the IPM and RIPM are modeled therefore detects all faults reliably. systematically and share the same set of parameters and

Overall, the combination of -wise valid coverage and values. To prevent falsified results due to accidental fault -wise strong invalid coverage performs the best and triggering, the orders of parameters and values are ranalways detects all faults reliably. domized and 20 diferent variants are used in test input

When putting the AFDE values in relation to test suite selection. All presented FDE values are average values. sizes, it can be noted that -wise relevant coverage has Since this is a case study with only one case, it is difithe worst eficiency as it requires 15023.70 test inputs for cult to generalize the findings [ 10]. Further on, it has to an AFDE value of 0.62. In contrast, -wise valid coverage be noted that the archival data of this case study is only a only requires 48.30 test inputs for an AFDE value of 0.62. snapshot and the ground truth, i.e. the existing and pre

The best eficiency is ofered by the combination of viously existing faults, is unknown. Hence, the data can -wise valid coverage with = 1 and -wise strong invalid be biased towards simpler faults that are easier to detect. coverage with = 0 which requires 308.00 test inputs To prevent too far-reaching conclusions, we describe the for an AFDE value of 1.00. When using an AFDE value characteristics of the SUT and also limit our conclusions of 0.92 as a lower boundary (12 out of 13 faults), -wise to similar systems with many validation rules. strong invalid coverage with = 0 is suficient and only requires 301.00 test inputs for an AFDE value of 0.98.

This discussion about eficiency is, of course, influ- 7. Conclusion enced by the characteristics of the 13 faults and cannot be generalized. But as more general statements, it can be observed that -wise relevant coverage requires more test inputs to reach a similar AFDE value than -wise valid coverage, -wise strong invalid coverage, or the combination of both. At the same time, the combination of -wise valid coverage and -wise strong invalid coverage always has an AFDE value of 1.00 while at most 2224.30 test inputs are used. This finding is also consistent with our prior experimental evaluation (cf. [ 7 ]).

Therefore, we draw the conclusion that -wise valid coverage, -wise strong invalid coverage, and the combination of both perform as well as or better than -wise relevant coverage in terms of efectiveness and eficiency.

Although, the findings are only derived from one particular case. Therefore, we do not consider this to be true for all SUTs but for SUTs with many validation rules.

CRT extends CT to generate separate test suites with valid and strong invalid test inputs in order to avoid input masking that is caused by EH. Therefore, CRT requires additional efort to model error-constraints and introduces additional complexity to test selection strategies because error-constraints must be considered. This raises the question about the usefulness of CRT and whether the avoidance of input masking outweighs the additional efort and complexity. Until now, only artificial test scenarios are used to compare CT with CRT and it remains unclear if indicated advantages of CRT can be transferred to real-world scenarios.

In this paper, we therefore present the results of a case study based on a real-world system with 31 validation rules and 13 previously existing faults. To compare CT with CRT, we construct a IPM and a RIPM, select test inputs, and stimulate 13 implementations of the realworld system of which each implementation contains one 6. Threats to Validity of the 13 previously existing faults. For the subsequent discussion, we introduce the FDE and AFDE metrics.

We compare the efectiveness of CRT using an imple- To summarize the findings of this case study, we dismentation of the R O B U S T A test selection strategy with CT cuss both research questions individually. using an implementation of the I P O G - C test selection strat- Research Question 1: Our results indicate that the egy. To ensure an unbiased implementation, both imple- CRT test method is applicable in real-world test scenarmentations follow the guidelines of Kleine & Simos [20]. ios. This case study demonstrated that RIPMs with 32 Further on, the source code of the test selection strate

1See https://cofee4j.github.io for more information.

parameters and 31 error-constraints can be constructed . Workshop on Quantitative Approaches to Software Further on, the R O B U S T A test selection strategy is capable Quality co-located with 26th Asia-Pacific Software of selecting test suites for RIPMs with 32 parameters and Engineering Conference (APSEC 2019 ), Putrajaya, 31 error -constraints. Malaysia, December 2 , 2019 ., 2019 , pp. 27 - 36 . Research Question 2: The comparison of CRT with [9 ]

B. A.

Kitchenham ,

Pickard ,

S. L.

Pfleeger , Case CT is consistent with the findings of our previously con- studies for method and tool evaluation , IEEE Softw.

ducted controlled experiment with artificial test scenarios 12 (

1995 ) 52 - 62 .

(cf. [7]). Since the case under analysis has much EH , CRT [10]

Runeson , M. Höst, Guidelines for conducting performs better than CT in terms of FDE. Further on, it and reporting case study research in software engirequires fewer test inputs to achieve better AFDE values neering , Empirical Software Engineering 14 ( 2009 ) than CT . 131 - 164 . Therefore, we draw the conclusion that -wise valid [11]

Grindal ,

Ofutt ,

S. F.

Andler , Combination testcoverage, -wise strong invalid coverage, and the combi- ing strategies: a survey, Softw . Test., Verif . Reliab.

nation of both perform as well as or better than -wise 15 (

2005 ) 167 - 199 .

relevant coverage in terms of efectiveness and eficiency . [12]

Nie ,

Leung , The minimal failure-causing Although, the FDE and AFDE values are influenced by schema of combinatorial testing , ACM Trans. Softw.

the characteristics of the 13 faults and cannot be general- Eng. Methodol. 20 ( 2011 ) 15 : 1 - 15 : 38 .

ized. Therefore, we do not consider this to be true for all [13]

Yu ,

Lei ,

M. N.

Borazjany ,

Kacker ,

D. R.

Kuhn , SUTs but for SUTs with much EH. An eficient algorithm for constraint handling in In future work, we plan to conduct further case studies combinatorial test generation, in: Sixth IEEE Into learn more about the FDE of CRT and CT . ternational Conference on Software Testing, Verification and Validation , ICST 2013 , Luxembourg, Luxembourg, March 18 -22, 2013 , 2013 , pp. 242 - 251 .

References [14] D. M.

Cohen , S. R.

Dalal , M. L.

Fredman , G. C. Patton, The AETG system: An approach to testing based on

[1] IEEE, IEEE Standard Glossary of Software Engi- combinatiorial design , IEEE Trans. Software Eng. neering Terminology , IEEE Std 610 . 12 - 1990 ( 1990 ). 23 ( 1997 ) 437 - 444 .

[8]

Fögen ,

Lichter , An experiment to compare [20]

Kleine ,

D. E.

Simos , An eficient design and imcombinatorial testing in the presence of invalid plementation of the in-parameter-order algorithm, values , in: Proceedings of the 7th International Mathematics in Computer Science 12 ( 2018 ) 51 - 67 .