8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


An Industrial Case Study on Fault Detection Effectiveness
of Combinatorial Robustness Testing
Konrad Fögen, Horst Lichter
Research Group Software Construction, RWTH Aachen University, Aachen, Germany
https://www.swc.rwth-aachen.de


                                          Abstract
                                          Combinatorial robustness testing (CRT) is an extension of combinatorial testing (CT) to separate test suites with valid and
                                          strong invalid test inputs. Until now, only one controlled experiment using artificial test scenarios was conducted to compare
                                          CRT with CT. The results indicate advantages of CRT when much exception handling is involved. But, it is unclear if these
                                          advantages are also valid in the real-world. In this paper, we present the results of a case study conducted to compare the
                                          fault detection effectiveness of CRT and CT by testing an industrial system with 31 validation rules and 13 injected faults.

                                          Keywords
                                          Software Testing, Combinatorial Testing, Robustness Testing


1. Introduction                                                                                                            To avoid input masking, combinatorial robustness test-
                                                                                                                        ing (CRT) is developed as an extension to CT using a
Robustness is an important property of software. It de-                                                                 robustness input parameter model (RIPM) being an ex-
scribes “the degree to which a system [...] can function                                                                tension of an IPM with additional semantic information
correctly in the presence of [invalid inputs]” [1]. Invalid                                                             to annotate values and value combinations as invalid [7].
inputs are caused by external faults, i.e. faults in other                                                              With this semantic information, valid test inputs can be
systems or made by users interacting with a system. Ex-                                                                 selected which do not cover any invalid value or invalid
amples are inputs to the system under test (SUT) that                                                                   value combination. Further on, strong invalid test inputs
contain invalid values like a string value when a numeri-                                                               can be selected which contain exactly one invalid value
cal value is expected, or invalid value combinations like                                                               or one invalid value combination.
a begin date which is after the end date. When invalid                                                                     Due to the separation of valid and strong invalid test
inputs remain undetected, they can propagate to failures                                                                inputs, the input masking effect can be avoided when
in the SUT resulting in abnormal behavior or crashes [2].                                                               testing the normal behavior and the exceptional behavior.
   Developers attempt to improve robustness of systems                                                                  However, in comparison to CT which does not separate
by implementing exception handling (EH) to detect and                                                                   valid and strong invalid test inputs, CRT requires effort
recover from invalid inputs. Unfortunately, EH is itself                                                                to model the additional semantic information.
a significant source of faults (cf. [3, 4]). Therefore, it is                                                              Despite the presence of input masking, CT can still
important to test the exceptional behavior as well.                                                                     be effective in detecting faults as a previous controlled
   Combinatorial testing (CT) is a black-box test method                                                                experiment indicates [8]. Nevertheless, the fault detec-
that is based on an input parameter model (IPM) [5].                                                                    tion effectiveness (FDE) of CT decreases for systems with
When considering the exceptional behavior, an IPM must                                                                  much EH. Even for high testing strengths and large test
describe invalid values and invalid value combinations                                                                  suites, the FDE of CT deteriorates. For systems with
that trigger EH. Unfortunately, invalid values and invalid                                                              much EH, CRT is a promising approach that can achieve
value combinations can cause input masking (cf. [6, 7, 8]).                                                             a higher FDE while requiring fewer test inputs than CT
When a SUT is stimulated with an invalid input, the EH                                                                  [7]. For systems with little EH, CRT is at least as effective
is expected to detect it, to respond with an error message,                                                             as CT.
and to terminate the SUT without resuming the normal                                                                       Although, the current assessment is solely based on
behavior. Consequently, the remaining values and value                                                                  one controlled experiment with artificial test scenarios (cf.
combinations of the test input remain untested as they                                                                  [7]). Therefore, our objective is to further compare CRT
are masked.                                                                                                             with CT guided by the following two research questions.
QuASoQ 2020: 8th International Workshop on Quantitative                                                                 RQ 1 Is the CRT test method applicable in real-world
Approaches to Software Quality, December 01, 2020, Singapore
Envelope-Open foegen@swc.rwth-aachen.de (K. Fögen);
                                                                                                                              test scenarios?
lichter@swc.rwth-aachen.de (H. Lichter)
Orcid 0000-0002-3440-1238 (H. Lichter)
                                                                                                                        RQ 2 How does the CRT test method compare with CT
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative             in real-world test scenarios?
                                    Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                   29
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


   To answer these research questions, we conducted a            from a test selection strategy that supports constraint
case study. According to Kitchenham et al. [9], a case           handling, e.g. I P O G - C [13], satisfy the 𝑡-wise relevant
study helps to evaluate the benefits of methods and tools        coverage criterion. This criterion is satisfied if the rele-
in industrial settings. When applied to compare methods          vant test inputs of a test suite cover all relevant schemata
and tools, a case study is of explanatory nature “seek-          of degree 𝑑 = 𝑡 that are described by an IPM [11, 5].
ing an explanation of a situation or a problem” [10]. As
Runeson & Höst state, a case study “will never provide           2.2. Combinatorial Robustness Testing
conclusions with statistical significance” [10]. But it can
provide sufficient information to help you judge if spe-         To avoid input masking, CRT is developed as an exten-
cific technologies will benefit your own organization or         sion to CT that separates valid and invalid test inputs [7].
project” [9]. Since a case study has, by definition, a higher    To better separate the concepts, we say that CT relies on
degree of realism than a controlled experiment [10], a           IPMs while CRT relies on robustness input parameter
case study that compares CRT with CT can provide addi-           models (RIPM). A RIPM contains additional error-con-
tional insights that complement and extend the findings          straints which is another set of constraints to annotate
of the previously conducted controlled experiment.               relevant schemata as invalid. A relevant schema is also
   The paper is structured as follows. Section 2 intro-          a valid schema if it satisfies all error-constraints. A
duces basic concepts of CT and CRT. Related work is              relevant schema is an invalid schema if at least one
discussed in Section 3. Next, the design of the case study       error-constraint remains unsatisfied. Further on, an in-
is introduced (Section 4) and its results are presented          valid schema is a strong invalid schema if exactly one
(Section 5). Afterwards, threats to validity are discussed       error-constraint remains unsatisfied.
(Section 6) before the paper is concluded in Section 7.             Test selection strategies like R O B U S T A [7] not only con-
                                                                 sider exclusion-constraints to exclude irrelevant schema-
                                                                 ta, they also consider error-constraints and exclude in-
2. Background                                                    valid schemata from valid test inputs. Further on, strong
                                                                 invalid test inputs are selected such that each invalid
In the following, CT and CRT are briefly introduced. For         value and invalid value combination that is modeled by
more information, please refer to [11, 5, 7].                    error-constraints appears in strong invalid test inputs.
                                                                    Valid test inputs are selected to satisfy 𝑡-wise valid
2.1. Combinatorial Testing                                       coverage. The 𝑡-wise valid coverage criterion is an ex-
                                                                 tension of the 𝑡-wise relevant coverage criterion. It is
CT is a black-box test method [5]. It is based on an input
                                                                 satisfied if all valid schemata with a degree of 𝑑 = 𝑡 that
parameter model (IPM) which declares 𝑛 parameters
                                                                 are described by a RIPM are covered at least once by a
and each parameter is associated with a non-empty set
                                                                 valid test input.
of values. A schema is a set of parameter-value pairs
                                                                    Strong invalid test inputs are selected to satisfy 𝑏-wise
for 𝑑 distinct parameters [12]. A schema with 𝑑 = 𝑛
                                                                 strong invalid coverage where 𝑏 denotes the robust-
parameter-value pairs is a test input. A schema 𝑎 covers
                                                                 ness interaction degree. Without robustness interaction
another schema 𝑏 if and only if schema 𝑎 includes all
                                                                 (𝑏 = 0), the coverage criterion is called single error cover-
parameter-value pairs of schema 𝑏.
                                                                 age (cf. [11, 7]). It is satisfied if each invalid schema that
   Real-world systems are often constrained and certain
                                                                 is described by an error-constraint appears in a strong
values should not be combined to schemata and test in-
                                                                 invalid test input. With robustness interaction (𝑏 ≥ 1),
puts [5]. These schemata are irrelevant because they are
                                                                 each described invalid schema is combined with all valid
not of any interest for the test. Test inputs that cover
                                                                 schemata of degree 𝑑 = 𝑏. The coverage criterion is satis-
irrelevant schemata are irrelevant as well and their test
                                                                 fied if all combinations of invalid schemata and 𝑏-sized
results have no informative value. Hence, they should be
                                                                 valid schemata are covered by strong invalid test inputs.
excluded from testing.
                                                                    Following these brief introductions of CT and CRT,
   Constraint handling is often used to exclude irrelevant
                                                                 the conceptual difference between the two approaches
schemata [13]. Therefore, irrelevant schemata are ex-
                                                                 should become clear. CT and CRT use the same param-
plicitly modeled by a set of logical expressions (called
                                                                 eters and values. But CT does not distinguish between
exclusion-constraints). A schema is relevant if it sat-
                                                                 valid and invalid schemata. Instead, both types of schema-
isfies all exclusion-constraints. A schema is irrelevant
                                                                 ta are mixed and the FDE purely relies on the combina-
if at least one exclusion-constraint remains unsatisfied.
                                                                 torics, i.e. different testing strengths 𝑡. In contrast, CRT
   A coverage criterion is a condition that must be sat-
                                                                 distinguishes valid and invalid schemata to avoid the
isfied by a test suite. A test selection strategy describes
                                                                 effect of input masking. Here too the FDE relies on com-
how values are combined to test inputs such that a given
                                                                 binatorics but the avoidance of input masking has an
coverage criterion is satisfied [11]. Test suites resulting
                                                                 additional influence.


                                                            30
                8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


   CRT requires the effort to model error-constraints.          data according to a set of validation rules and with for-
Test selection strategies that consider error-constraints       warding the data when it satisfies the validation rules. It
also become more complex. This raises the question              is the same project which we analyzed in a previous case
whether the avoidance of input masking outweighs the            study (cf. [18]).
additional effort and complexity of CRT. Until now, only           Altogether, 31 validation rules are defined to check
artificial test scenarios are used to compare CT with CRT       insurance application data. The order of the validation
(cf. [7]) and it remains unclear if indicated advantages of     rules is predefined and all validation rules are traversed
CRT can be transferred to real-world scenarios. There-          for each insurance application data. Whenever a valida-
fore, this case study was conducted.                            tion rule is not satisfied by an insurance application, a
                                                                corresponding error code is returned and the remaining
                                                                validation rules are skipped. If all validation rules are
3. Related Work                                                 satisfied, the subsystem returns S U C C E S S and the insur-
                                                                ance application data is further processed. Although, the
To the best of our knowledge, Sherwood [6] first men-
                                                                further processing is out of scope for this case study.
tioned invalid values in the context of C A T S which is a test
                                                                   Each validation rule is built as an implication consist-
selection strategy and tool for CT. Cohen et al. [14] and
                                                                ing of two parts:
Czerwonka [15] also acknowledged the necessity to sep-
arate valid and strong invalid test inputs. They also pub-           isApplicable(application) ⇒ isValid(application)
lished test selection strategies and tools and the IPMs con-
tain semantic information to distinguish relevant from The first part determines whether a given validation rule
irrelevant schemata and to distinguish valid from invalid is applicable to the insurance application data or not. If
values. However, invalid value combinations are not di- a rule is applicable, the insurance application must not
rectly supported. Therefore, we proposed R O B U S T A and violate the rule, i.e. isValid(application). Otherwise, the
the structure of RIPMs with error-constraints [7].              validation rule is ignored.
   Many studies exist that demonstrate the usefulness and          Because details of the case are confidential, a generic
effectiveness of CT (cf. [16, 17, 18]). But most studies example is given to provide further illustration of vali-
do not distinguish between relevance and validness and dation rules. The example depicts two validation rules
focus on testing the normal behavior.                           to define maximum sums that can be insured depending
   One case study by Wojciak & Tzoref-Brill [19] reports on the permissions of the insurance agents. The first
on applying CT and also considers testing with invalid validation rule is applicable to all applications created
inputs. They report that single error coverage was not by insurance agents with the highest level of permission.
sufficient because EH depended on interactions between The second validation rule is applicable to all applica-
invalid and valid values. In particular, “the same [excep- tions that are created by insurance agents with lower
tion] would often be handled differently depending on permission level.
the firmware in control [...] or depending on the config-          The distinction between the two validation rules is
uration of the system”. A further remark is concerned made by the first part of the implication:
with the ratio of valid versus invalid test inputs: “Since
                                                                    Rule 1: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶
a lot of attention was given to [robustness] testing [...]
where full recovery in the presence of [exceptions] was                a p p l i c a t i o n .a g e n t .p e r m i s s i o n = h i g h e s t _l e v e l
expected, the [test suite] contained a ratio of up to 2:1           Rule 2: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶
[invalid test inputs vs. valid test inputs].”                          a p p l i c a t i o n .a g e n t .p e r m i s s i o n ≠ h i g h e s t _l e v e l

                                                               The second part of the implication is used to enforce
4. Case Study Design                                       the maximum insured sum. As an application may consist
In this section, the case under analysis and the data col- of several partial contracts, the individual insured sums
lection procedure are introduced.                          of all partial contracts are collected first. Afterwards,
                                                           it is checked whether the total sum exceeds the thresh-
                                                           old. While the structure of both rule’s isValid() parts is
4.1. Case Under Analysis                                   the same, different values for the m a x i m u m _i n s u r e d _s u m
The case is a development project conducted by an IT constant are used:
service provider of an insurance company, where a new
                                                                    i s V a l i d (a p p l i c a t i o n ) ∶
software was developed to manage the life-cycle of life
insurance contracts. One subsystem of the software is                   t o t a l _s u m = ∑ p a r t i a l .i n s u r e d _s u m
concerned with the validation of insurance application                        partial ∈ application
                                                                        t o t a l _s u m ≤ m a x i m u m _i n s u r e d _s u m


                                                                         31
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


   This example shows that many parameters may be                    A common metric to assess the effectiveness is fault
involved in a validation rule, that intermediate calcula-         detection effectiveness (FDE) [11, 16]. A test suite 𝑇
tions may be required, and that intermediate calculations         is denoted as failing for a test scenario 𝑆𝐶 if at least one
may be reused in different validation rules. Therefore,           of the test inputs 𝜏 ∈ 𝑇 detects the fault in 𝑆𝐶.
all validation rules should be tested thoroughly.
   For this case study, we consider the current set of vali-                                 1 if ∃𝜏 ∈ 𝑇 that fails for 𝑆𝐶
                                                                       failing(𝑇 , 𝑆𝐶) = {
dation rules as correct and treat them as our specification.                                 0 otherwise
By browsing the source code repository, we have iden-                Using the failing function, FDE is defined as the ratio
tified 13 changes that have been made to the validation           between the number of test suites 𝑇 of a test suite family
rules in order to correct them. Each change documents a           𝑇 ∗ that fail for a test scenario 𝑆𝐶 and the number of all test
fault that existed previously but is fixed prior to release.      suites in the family 𝑇 ∗ . In this case study, the family of
Based on these 13 changes, we reconstructed 13 imple-             test suites contains 20 different variants. In other words,
mentation versions of which each contains one fault.              the FDE is based on 20 randomized test suites that all
   The 13 faults can also be classified according to our ro-      satisfy the same coverage criterion for the same IPM or
bustness fault classification (cf. [7]). Five faults can only     RIPM. They all test the same test scenario.
be detected by invalid test inputs, while eight faults can
be detected by both valid and invalid test inputs. Two of                                        ∑𝑇 ∈𝑇 ∗ failing(𝑇 , 𝑆𝐶)
these five faults can be classified as faults in error-signal-               FDE(𝑇 ∗ , 𝑆𝐶) =
                                                                                                           |𝑇 ∗ |
ing. To reveal them, invalid test inputs must trigger EH
which responds with an incorrect error code. The other               Further on, the average fault detection effective-
three faults can be classified as faults in error-detection       ness (AFDE) denotes the average FDE over a family of
conditions. The conditions are too weak and do not detect         test scenarios 𝑆𝐶 ∗ . In our case study, the family of test
invalid test inputs. Hence, the SUT incorrectly continues         scenarios 𝑆𝐶 ∗ consists of the 13 reconstructed implemen-
with its normal behavior.                                         tations. The AFDE represents the average effectiveness
   The remaining eight faults can be detected by both             of CRT and CT equally distributed over the 13 faults.
valid and invalid test inputs. They are faults in error-de-                                       ∑𝑆𝐶∈𝑆𝐶 ∗ FDE(𝑇 ∗ , 𝑆𝐶)
tection conditions. Four of theses faults have conditions                  AFDE(𝑇 ∗ , 𝑆𝐶 ∗ ) =
that are too strong and therefore incorrectly detect ex-                                                  |𝑆𝐶 ∗ |
ception occurrences for valid test inputs. The other four
faults have characteristics of being too weak and too             4.2.2. Modeling of IPM and RIPM
strict at the same time because wrong parameters with     Since the FDE and AFDE metrics highly depend on the
similar characteristics are used in the exception condi-  quality of the RIPM and IPM, a systematic modeling ap-
tion. As a consequence, an invalid test input may not     proach is necessary. We model the IPM first and later
violate the condition (too weak) while a valid test input extend it with error-constraints to get a RIPM.
may not satisfy the condition (too strong).                   The IPM is modeled iteratively for one validation rule
                                                          at a time. In each iteration, parameters and values are
4.2. Data Collection Procedure                            added to ensure that test inputs with the following three
                                                          characteristics can be detected: (1) test inputs that are
Data collection refers to the measurement and calculation not applicable; (2) test inputs that are applicable and
of metric values from test execution. Therefore, metrics valid; (3) test inputs that are applicable but not valid. In
are defined in this section. Furthermore, the modeling of addition, some exclusion-constraints are introduced to
the IPM and RIPM as well as the selection and execution ensure syntactic correctness of selected test inputs. The
of test inputs is described.                              IPM is considered as complete once the IPM contains
                                                          all parameters and values necessary to satisfy branch
4.2.1. Metrics                                            coverage of each validation rule.
                                                              For the RIPM, the modeling of additional error-con-
The resources available from the software development
                                                          straints is required. The error-constraints are modeled
project are not directly analyzed and compared. Instead,
                                                          iteratively and we add new or update existing ones until
they are used to reconstruct the implementation versions
                                                          the separation of valid and strong invalid test inputs con-
for test execution and to create a RIPM and an IPM that
                                                          forms to the responses of the SUT, i.e. the SUT returns
represent variations of insurance application data.
                                                          S U C C E S S for each valid test input and the SUT returns an
   Based on the RIPM and IPM, test inputs are selected
                                                          error code for each strong invalid test input.
using a CT and a CRT test selection strategy. Then, the
                                                              In total, the IPM and RIPM consist of 32 parameters
test inputs are executed on the 13 reconstructed imple-
                                                          and 106 values. Most parameters have two, three, or four
mentations to assess the effectiveness.
                                                          values each. But two parameters have six values each


                                                             32
               8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


and one parameter has even nine values. Three exclu- Table 1
sion-constraints of which each restricts combinations of Test suite sizes of test suites for different coverage criteria
two parameters are required to ensure syntactical cor-            Coverage Criteria           t    b         Size
rectness of the insurance applications. Furthermore, the          𝑡-wise relevant             1     -        9.00
RIPM contains 31 error-constraints. 15 error-constraints          coverage                    2     -       68.10
annotate single values as invalid. The remaining 16 er-                                       3     -     480.10
ror-constraints annotate schemata with 2, 3, or 5 values.                                     4     -    2813.45
   The complete IPM and RIPM are described below in ex-                                       5     -   15023.70
ponential notation. For parameters and values, 𝑥 𝑦 refers         𝑡-wise valid coverage       1     -        7.00
to 𝑦 parameters with 𝑥 values. For exclusion- and error-                                      2     -       48.30
constraints, 𝑥 𝑦 refers to 𝑦 constraints with 𝑥 parameters.                                   3     -     267.95
                                                                                 𝑏-wise strong           -   0      301.00
             Parameters & Values: 91 62 51 48 38 212                             invalid coverage        -   1     1956.35
                                                                                 𝑡-wise valid coverage   1   0      308.00
           Exclusion-Constraints: 23                                             and 𝑏-wise strong       1   1     1963.35
                Error-Constraints: 52 36 28 115                                  invalid coverage        2   0      349.30
                                                                                                         2   1     2004.65
                                                                                                         3   0      568.95
4.2.3. Selecting and Executing Test Inputs
                                                                                                         3   1     2224.30
After creating the IPM and RIPM, both models are used to
select sets of test inputs. Since we compare CRT with CT,
two different test selection strategies are used. R O B U S T A          one fault are tested to determine which test suite is able
is used to select test inputs for the RIPM and I P O G - C is            to detect which fault. The results are discussed in the
used to select test inputs for the IPM.                                  following section.
    To compare the FDE and AFDE of CRT with CT, test
suites that satisfy different coverage criteria are used.
We apply I P O G - C to select test suites that satisfy 𝑡-wise           5. Results & Discussion
relevant coverage for 𝑡 ∈ {1, ..., 5}. Furthermore, we ap-               In this section, the case study results regarding the com-
ply R O B U S T A to select test suites that satisfy 𝑡-wise valid        puted FDE and AFDE values are reported and discussed.
coverage with 𝑡 ∈ {1, ..., 3} and that satisfy 𝑏-wise strong
invalid coverage with 𝑏 ∈ {0, 1}.
    To reduce the effect of accidental fault detection caused            5.1. Fault Detection Effectiveness
by ordering, the order of parameters and values of the                  Table 2 lists the FDE values of all test suites families
input parameter models is randomly reordered and 20                     applied to all 13 implementations. For better readability,
different model variants are used to select test suites for             + is used to indicate an FDE value of 1.00. The faults nos.
each coverage criteria.                                                 1 to 8 can all be detected by both valid and invalid test
    Table 1 depicts the average sizes of test suites that               inputs, while the faults nos. 9 to 13 can only be detected
satisfy the different coverage criteria. Since R O B U S T A en-        by invalid test inputs. Again, the shown FDE value is an
compasses two coverage criteria (𝑡-wise valid coverage                  average value for one test suite family with 20 different
and 𝑏-wise strong invalid coverage), the test suites are                test suites that are created by randomizing the order of
considered both, separately and combined.                               parameters and values before selecting test inputs. As
    The largest test suite is selected by I P O G - C which is          an example, in the first row for fault no. 3, an FDE value
required to satisfy 𝑡-wise relevant coverage with 𝑡 = 5                 of 0.05 means that one out of 20 test suites detected the
(15023.70 test inputs). The second-largest test suite is also           fault at least once per test suite.
selected by I P O G - C to satisfy 𝑡-wise relevant coverage with           As can be observed, 𝑡-wise relevant coverage is not
𝑡 = 4 (2813.45 test inputs). The third-largest test suite               able to detect all faults reliably. The FDE values increase
is selected by R O B U S T A and satisfies 𝑡-wise valid coverage        when testing strength 𝑡 grows. But even with 𝑡 = 5
with 𝑡 = 3 and 𝑏-wise strong invalid coverage with 𝑏 = 1                (15023.70 test inputs), only 7 faults are detected reliably
(2224.30 test inputs).                                                  (FDE value of 1.00). Further on, fault no. 10 remains
    When comparing the test suite sizes of 𝑡-wise relevant              undetected (FDE value of 0) and faults nos. 9 and 13 are
coverage of I P O G - C with 𝑡-wise valid coverage of R O B U S T A ,   only detected by one out of 20 test suites (FDE value of
it can be seen that the error-constraints drastically reduce            0.05).
the number of valid test inputs.                                           The CRT coverage criteria are characterized by avoid-
    After test input selection, the test suites are used to             ing the invalid input masking effect. Since all invalid
stimulate the SUT in 13 different versions. Therefore, the              schemata are excluded by 𝑡-wise valid coverage, the faults
13 reconstructed implementations of which each contains


                                                                    33
                8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Table 2
FDE values for different coverage criteria
 Coverage                                                  FDE values for faults nos. 1 to 13                              AFDE
 Criteria           t   b    1      2         3      4       5     6       7      8       9     10     11     12     13    values
 𝑡-wise relevant    1   -     0      0       0.05   0.05      0     0       0      0      0      0    0.25   0.05     0     0.03
 coverage           2   -   0.10   0.10      0.45   0.20    0.10    0       0     0       0      0    0.65   0.20     0     0.14
                    3   -   0.75   0.75       +      +      0.65 0.05 0.10 0.05          0.05    0     +     0.65     0     0.47
                    4   -    +      +         +      +       +    0.15 0.10 0.05          0      0     +      +       0     0.56
                    5   -    +      +         +      +       +    0.50 0.35 0.15         0.05    0     +      +     0.05    0.62
 𝑡-wise valid       1   -   0.75   0.75       +      +      0.50 0.50      +     0.80     0      0      0      0      0     0.48
 coverage           2   -    +      +         +      +       +     +       +      +       0      0      0      0      0     0.62
                    3   -    +      +         +      +       +     +       +      +       0      0      0      0      0     0.62
 b-wise strong      -   0    +      +         +      +       +     +      0.90 0.80       +     +      +      +      +      0.98
 invalid            -   1    +      +         +      +       +     +       +      +       +     +      +      +      +       +
 𝑡-wise valid       1   0    +      +         +      +       +     +       +      +       +     +      +      +      +       +
 coverage and       1   1    +      +         +      +       +     +       +      +       +     +      +      +      +       +
 b-wise             2   0    +      +         +      +       +     +       +      +       +     +      +      +      +       +
 strong invalid     2   1    +      +         +      +       +     +       +      +       +     +      +      +      +       +
 coverage           3   0    +      +         +      +       +     +       +      +       +     +      +      +      +       +
                    3   1    +      +         +      +       +     +       +      +       +     +      +      +      +       +


nos. 9 to 13 cannot be detected. But for all other faults,                In order to detect all faults reliably, the 𝑏-wise strong
𝑡-wise valid coverage has higher FDE values for the same               invalid coverage must be selected because faults nos. 9
testing strength 𝑡 when compared to 𝑡-wise relevant cov-               to 13 remain undetected otherwise. Either robustness
erage. Because invalid input masking is avoided, a testing             interaction (𝑏 > 0) or the combination of 𝑏-wise strong
strength of 𝑡 = 2 is sufficient to detect faults nos. 1 to 8           invalid coverage with 𝑡-wise valid coverage is required
reliably (FDE values of 1.00).                                         to reliably detect faults nos. 1 to 8. Even though 𝑡 = 1
   Using 𝑏-wise strong invalid coverage with 𝑏 = 0, 11                 is only sufficient to detect three of the first eight faults
out of 13 faults can already be detected reliably and the              reliably, the combination with 𝑏-wise strong invalid cov-
two remaining faults have high FDE values of 0.90 and                  erage improves the FDE and all faults can be detected
0.80. The effectiveness of robustness interactions is even             reliably.
higher and all faults can be detected reliably with 𝑏 = 1.                The discussion of the FDE shows which coverage cri-
   Four faults that have too strong error detection con-               teria are appropriate to reliably detect different types of
ditions and that actually require valid test inputs to be              faults. Next, we discuss the AFDE over all 13 faults.
detected are also reliably detected by 𝑏-wise strong in-
valid coverage. We could observe that a strong invalid                 5.2. Average Fault Detection
test input that is expected to violate the error detection
condition of the 𝑙-th validation rule is also expected to
                                                                            Effectiveness
satisfy all prior validation rules from 1 to 𝑙 − 1. Therefore,         Because AFDE values are average values over a set of
strong invalid test inputs can be considered as “partially-            faults, AFDE allows making general statements about
valid” test inputs that are able to accidentally detect faults         both the effectiveness and the efficiency of coverage cri-
that require valid test inputs. This effect is strengthened            teria. First, we discuss the effectiveness in terms of AFDE
by robustness interactions because more test inputs are                values of different coverage criteria. Therefore, Table 2
selected and more interactions are covered by them.                    lists the AFDE values for test suites that satisfy different
   R O B U S T A combines 𝑡-wise valid coverage and 𝑏-wise             coverage criteria. Afterwards, we discuss the efficiency
strong invalid coverage and the FDE values show that test              in terms of AFDE values in relation to test suite sizes
suites for both coverage criteria complement each other.               (listed in Table 1).
Since valid and strong invalid test inputs are able to detect             The AFDE values reflect what we discussed before
faults nos. 1 to 8, the FDE values are complemented by                 since they aggregate FDE values. Because of the invalid
the combination of both test suites. For faults nos. 9 to 13,          input masking effect, test suites that satisfy 𝑡-wise rele-
the FDE values are not complemented by the combination                 vant coverage only reach an AFDE value of 0.62.
of both test suites. This is because test suites that only                In direct comparison, test suites that satisfy 𝑡-wise
satisfy 𝑡-wise valid coverage cannot detect these faults.              valid coverage reach a maximum AFDE value of 0.62 as
Therefore, the FDE values of the combined test suites are              well. The same AFDE value can be reached because they
the same as the FDE values of the test suites that satisfy             prevent invalid input masking. However, the AFDE value
𝑏-wise strong invalid coverage.                                        cannot be further improved by increasing the testing


                                                                  34
              8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


strength because faults nos. 1 to 8 are already detected            gies is published as part of the c o f f e e 4 j open-source test
reliably and faults nos. 9 to 13 cannot be detected by valid        automation framework1 .
test inputs. Comparing the two coverage criteria for each              The effectiveness of CRT and CT highly depend on the
testing strength individually shows that the AFDE value             IPM and RIPM. Furthermore, the effectiveness depends
of 𝑡-wise valid coverage is always higher than the AFDE             on the faults that are considered in this case study.
value of 𝑡-wise relevant coverage.                                     Unfortunately, details of the case, i.e. source code
   For 𝑏-wise strong invalid coverage, the lowest AFDE              of the validation rules and detailed descriptions of the
value is 0.98 (no robustness interactions) which is always          faults, are confidential. To improve transparency and
higher than the AFDE values of 𝑡-wise relevant and valid            reproducibility, we describe the faults and make the char-
coverage. Furthermore, 𝑏-wise strong invalid coverage               acteristics of the IPM and RIPM explicit.
with robustness interactions has an AFDE value of 1 and                To avoid any bias, both the IPM and RIPM are modeled
therefore detects all faults reliably.                              systematically and share the same set of parameters and
   Overall, the combination of 𝑡-wise valid coverage and            values. To prevent falsified results due to accidental fault
𝑏-wise strong invalid coverage performs the best and                triggering, the orders of parameters and values are ran-
always detects all faults reliably.                                 domized and 20 different variants are used in test input
   When putting the AFDE values in relation to test suite           selection. All presented FDE values are average values.
sizes, it can be noted that 𝑡-wise relevant coverage has               Since this is a case study with only one case, it is diffi-
the worst efficiency as it requires 15023.70 test inputs for        cult to generalize the findings [10]. Further on, it has to
an AFDE value of 0.62. In contrast, 𝑡-wise valid coverage           be noted that the archival data of this case study is only a
only requires 48.30 test inputs for an AFDE value of 0.62.          snapshot and the ground truth, i.e. the existing and pre-
   The best efficiency is offered by the combination of             viously existing faults, is unknown. Hence, the data can
𝑡-wise valid coverage with 𝑡 = 1 and 𝑏-wise strong invalid          be biased towards simpler faults that are easier to detect.
coverage with 𝑏 = 0 which requires 308.00 test inputs               To prevent too far-reaching conclusions, we describe the
for an AFDE value of 1.00. When using an AFDE value                 characteristics of the SUT and also limit our conclusions
of 0.92 as a lower boundary (12 out of 13 faults), 𝑏-wise           to similar systems with many validation rules.
strong invalid coverage with 𝑏 = 0 is sufficient and only
requires 301.00 test inputs for an AFDE value of 0.98.
   This discussion about efficiency is, of course, influ-           7. Conclusion
enced by the characteristics of the 13 faults and cannot
                                                                    CRT extends CT to generate separate test suites with
be generalized. But as more general statements, it can be
                                                                    valid and strong invalid test inputs in order to avoid input
observed that 𝑡-wise relevant coverage requires more test
                                                                    masking that is caused by EH. Therefore, CRT requires
inputs to reach a similar AFDE value than 𝑡-wise valid
                                                                    additional effort to model error-constraints and intro-
coverage, 𝑏-wise strong invalid coverage, or the combi-
                                                                    duces additional complexity to test selection strategies
nation of both. At the same time, the combination of
                                                                    because error-constraints must be considered. This raises
𝑡-wise valid coverage and 𝑏-wise strong invalid coverage
                                                                    the question about the usefulness of CRT and whether
always has an AFDE value of 1.00 while at most 2224.30
                                                                    the avoidance of input masking outweighs the additional
test inputs are used. This finding is also consistent with
                                                                    effort and complexity. Until now, only artificial test sce-
our prior experimental evaluation (cf. [7]).
                                                                    narios are used to compare CT with CRT and it remains
   Therefore, we draw the conclusion that 𝑡-wise valid
                                                                    unclear if indicated advantages of CRT can be transferred
coverage, 𝑏-wise strong invalid coverage, and the combi-
                                                                    to real-world scenarios.
nation of both perform as well as or better than 𝑡-wise
                                                                       In this paper, we therefore present the results of a case
relevant coverage in terms of effectiveness and efficiency.
                                                                    study based on a real-world system with 31 validation
Although, the findings are only derived from one partic-
                                                                    rules and 13 previously existing faults. To compare CT
ular case. Therefore, we do not consider this to be true
                                                                    with CRT, we construct a IPM and a RIPM, select test
for all SUTs but for SUTs with many validation rules.
                                                                    inputs, and stimulate 13 implementations of the real-
                                                                    world system of which each implementation contains one
6. Threats to Validity                                              of the 13 previously existing faults. For the subsequent
                                                                    discussion, we introduce the FDE and AFDE metrics.
We compare the effectiveness of CRT using an imple-                    To summarize the findings of this case study, we dis-
mentation of the R O B U S T A test selection strategy with CT      cuss both research questions individually.
using an implementation of the I P O G - C test selection strat-       Research Question 1: Our results indicate that the
egy. To ensure an unbiased implementation, both imple-              CRT test method is applicable in real-world test scenar-
mentations follow the guidelines of Kleine & Simos [20].            ios. This case study demonstrated that RIPMs with 32
Further on, the source code of the test selection strate-               1
                                                                            See https://coffee4j.github.io for more information.


                                                               35
              8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


parameters and 31 error-constraints can be constructed.                  Workshop on Quantitative Approaches to Software
Further on, the R O B U S T A test selection strategy is capable         Quality co-located with 26th Asia-Pacific Software
of selecting test suites for RIPMs with 32 parameters and                Engineering Conference (APSEC 2019), Putrajaya,
31 error-constraints.                                                    Malaysia, December 2, 2019., 2019, pp. 27–36.
   Research Question 2: The comparison of CRT with                   [9] B. A. Kitchenham, L. Pickard, S. L. Pfleeger, Case
CT is consistent with the findings of our previously con-                studies for method and tool evaluation, IEEE Softw.
ducted controlled experiment with artificial test scenarios              12 (1995) 52–62.
(cf. [7]). Since the case under analysis has much EH, CRT           [10] P. Runeson, M. Höst, Guidelines for conducting
performs better than CT in terms of FDE. Further on, it                  and reporting case study research in software engi-
requires fewer test inputs to achieve better AFDE values                 neering, Empirical Software Engineering 14 (2009)
than CT.                                                                 131–164.
   Therefore, we draw the conclusion that 𝑡-wise valid              [11] M. Grindal, J. Offutt, S. F. Andler, Combination test-
coverage, 𝑏-wise strong invalid coverage, and the combi-                 ing strategies: a survey, Softw. Test., Verif. Reliab.
nation of both perform as well as or better than 𝑡-wise                  15 (2005) 167–199.
relevant coverage in terms of effectiveness and efficiency.         [12] C. Nie, H. Leung, The minimal failure-causing
   Although, the FDE and AFDE values are influenced by                   schema of combinatorial testing, ACM Trans. Softw.
the characteristics of the 13 faults and cannot be general-              Eng. Methodol. 20 (2011) 15:1–15:38.
ized. Therefore, we do not consider this to be true for all         [13] L. Yu, Y. Lei, M. N. Borazjany, R. Kacker, D. R. Kuhn,
SUTs but for SUTs with much EH.                                          An efficient algorithm for constraint handling in
   In future work, we plan to conduct further case studies               combinatorial test generation, in: Sixth IEEE In-
to learn more about the FDE of CRT and CT.                               ternational Conference on Software Testing, Ver-
                                                                         ification and Validation, ICST 2013, Luxembourg,
                                                                         Luxembourg, March 18-22, 2013, 2013, pp. 242–251.
References                                                          [14] D. M. Cohen, S. R. Dalal, M. L. Fredman, G. C. Patton,
                                                                         The AETG system: An approach to testing based on
 [1] IEEE, IEEE Standard Glossary of Software Engi-
                                                                         combinatiorial design, IEEE Trans. Software Eng.
     neering Terminology, IEEE Std 610.12-1990 (1990).
                                                                         23 (1997) 437–444.
 [2] A. Avižienis, J. Laprie, B. Randell, C. E. Landwehr,
                                                                    [15] J. Czerwonka, Pairwise testing in real world, in:
     Basic concepts and taxonomy of dependable and
                                                                         24th Pacific Northwest Software Quality Confer-
     secure computing, IEEE Trans. Dependable Sec.
                                                                         ence, volume 200, Citeseer, 2006.
     Comput. 1 (2004) 11–33.
                                                                    [16] J. Petke, M. B. Cohen, M. Harman, S. Yoo, Practical
 [3] C. Marinescu, Are the classes that use exceptions
                                                                         combinatorial interaction testing: Empirical find-
     defect prone?, in: Proceedings of the 12th Interna-
                                                                         ings on efficiency and early fault detection, IEEE
     tional Workshop on Principles of Software Evolu-
                                                                         Trans. Software Eng. 41 (2015) 901–924.
     tion and the 7th annual ERCIM Workshop on Soft-
                                                                    [17] H. Wu, n. changhai, J. Petke, Y. Jia, M. Harman,
     ware Evolution, EVOL/IWPSE 2011, Szeged, Hun-
                                                                         An empirical comparison of combinatorial testing,
     gary, September 5-6, 2011., 2011, pp. 56–60.
                                                                         random testing and adaptive random testing, IEEE
 [4] P. Sawadpong, E. B. Allen, B. J. Williams, Exception
                                                                         Transactions on Software Engineering (2018) 1–1.
     handling defects: An empirical study, in: 2012 IEEE
                                                                    [18] K. Fögen, H. Lichter, A case study on robust-
     14th International Symposium on High-Assurance
                                                                         ness fault characteristics for combinatorial test-
     Systems Engineering, 2012, pp. 90–97.
                                                                         ing - results and challenges, in: Proceedings of
 [5] C. Nie, H. Leung, A survey of combinatorial testing,
                                                                         the 6th International Workshop on Quantitative
     ACM Comput. Surv. 43 (2011) 11:1–11:29.
                                                                         Approaches to Software Quality co-located with
 [6] G. B. Sherwood, Effective testing of factor combi-
                                                                         25th Asia-Pacific Software Engineering Conference
     nations, in: Proceedings of the Third International
                                                                         (APSEC 2018), Nara, Japan, December 4, 2018., 2018,
     Conference on Software Testing, Analysis and Re-
                                                                         pp. 22–29.
     view, Washington, DC, 1994, pp. 151–166.
                                                                    [19] P. Wojciak, R. Tzoref-Brill, System level combina-
 [7] K. Fögen, H. Lichter, Combinatorial robustness
                                                                         torial testing in practice - the concurrent mainte-
     testing with negative test cases, in: Proceedings of
                                                                         nance case study, in: Seventh IEEE International
     the 19th IEEE International Conference on Software
                                                                         Conference on Software Testing, Verification and
     Quality, Reliability and Security, QRS 2019, Sofia,
                                                                         Validation, ICST 2014, March 31 2014-April 4, 2014,
     Bulgaria, July 22-26, 2019, 2019, pp. 34–45.
                                                                         Cleveland, Ohio, USA, 2014, pp. 103–112.
 [8] K. Fögen, H. Lichter, An experiment to compare
                                                                    [20] K. Kleine, D. E. Simos, An efficient design and im-
     combinatorial testing in the presence of invalid
                                                                         plementation of the in-parameter-order algorithm,
     values, in: Proceedings of the 7th International
                                                                         Mathematics in Computer Science 12 (2018) 51–67.


                                                               36