<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Industrial Case Study on Fault Detection Efectiveness of Combinatorial Robustness Testing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konrad Fögen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horst Lichter</string-name>
          <email>lichter@swc.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Software Testing, Combinatorial Testing, Robustness Testing</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Group Software Construction, RWTH Aachen University</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proceedings</institution>
          ,
          <addr-line>CEUR-WS.org</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>Combinatorial robustness testing (CRT) is an extension of combinatorial testing (CT) to separate test suites with valid and strong invalid test inputs. Until now, only one controlled experiment using artificial test scenarios was conducted to compare CRT with CT. The results indicate advantages of CRT when much exception handling is involved. But, it is unclear if these advantages are also valid in the real-world. In this paper, we present the results of a case study conducted to compare the fault detection efectiveness of CRT and CT by testing an industrial system with 31 validation rules and 13 injected faults.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Robustness is an important property of software. It
describes “the degree to which a system [...] can function
correctly in the presence of [invalid inputs]” [
        <xref ref-type="bibr" rid="ref9">1</xref>
        ]. Invalid
inputs are caused by external faults, i.e. faults in other
amples are inputs to the system under test (SUT) that
contain invalid values like a string value when a
numerical value is expected, or invalid value combinations like
a begin date which is after the end date. When invalid
inputs remain undetected, they can propagate to failures
in the SUT resulting in abnormal behavior or crashes [2].
      </p>
      <sec id="sec-1-1">
        <title>Developers attempt to improve robustness of systems</title>
        <p>by implementing exception handling (EH) to detect and
recover from invalid inputs. Unfortunately, EH is itself
a significant source of faults (cf. [ 3, 4]). Therefore, it is
important to test the exceptional behavior as well.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Combinatorial testing (CT) is a black-box test method</title>
        <p>that is based on an input parameter model (IPM) [5].</p>
      </sec>
      <sec id="sec-1-3">
        <title>When considering the exceptional behavior, an IPM must</title>
        <p>
          describe invalid values and invalid value combinations
that trigger EH. Unfortunately, invalid values and invalid
value combinations can cause input masking (cf. [
          <xref ref-type="bibr" rid="ref10 ref3">6, 7, 8</xref>
          ]).
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>When a SUT is stimulated with an invalid input, the EH</title>
        <p>and to terminate the SUT without resuming the normal
is expected to detect it, to respond with an error message, as CT.
behavior. Consequently, the remaining values and value
combinations of the test input remain untested as they
are masked.
CEUR</p>
        <p>CEUR</p>
      </sec>
      <sec id="sec-1-5">
        <title>To avoid input masking, combinatorial robustness test</title>
        <p>
          ing (CRT) is developed as an extension to CT using a
robustness input parameter model (RIPM) being an
extension of an IPM with additional semantic information
to annotate values and value combinations as invalid [
          <xref ref-type="bibr" rid="ref3">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-6">
        <title>With this semantic information, valid test inputs can be value combination. Further on, strong invalid test inputs can be selected which contain exactly one invalid value or one invalid value combination.</title>
      </sec>
      <sec id="sec-1-7">
        <title>Due to the separation of valid and strong invalid test</title>
        <p>inputs, the input masking efect can be avoided when
testing the normal behavior and the exceptional behavior.</p>
      </sec>
      <sec id="sec-1-8">
        <title>However, in comparison to CT which does not separate</title>
        <p>valid and strong invalid test inputs, CRT requires efort
to model the additional semantic information.</p>
        <p>
          Despite the presence of input masking, CT can still
be efective in detecting faults as a previous controlled
experiment indicates [
          <xref ref-type="bibr" rid="ref10">8</xref>
          ]. Nevertheless, the fault
detection efectiveness ( FDE) of CT decreases for systems with
much EH. Even for high testing strengths and large test
suites, the FDE of CT deteriorates. For systems with
much EH, CRT is a promising approach that can achieve
a higher FDE while requiring fewer test inputs than CT
[
          <xref ref-type="bibr" rid="ref3">7</xref>
          ]. For systems with little EH, CRT is at least as efective
        </p>
      </sec>
      <sec id="sec-1-9">
        <title>Although, the current assessment is solely based on one controlled experiment with artificial test scenarios (cf. [7]). Therefore, our objective is to further compare CRT with CT guided by the following two research questions.</title>
      </sec>
      <sec id="sec-1-10">
        <title>RQ 1 Is the CRT test method applicable in real-world</title>
      </sec>
      <sec id="sec-1-11">
        <title>RQ 2 How does the CRT test method compare with CT test scenarios? in real-world test scenarios?</title>
        <p>To answer these research questions, we conducted a from a test selection strategy that supports constraint
case study. According to Kitchenham et al. [9], a case handling, e.g. I P O G - C [13], satisfy the -wise relevant
study helps to evaluate the benefits of methods and tools coverage criterion. This criterion is satisfied if the
relein industrial settings. When applied to compare methods vant test inputs of a test suite cover all relevant schemata
and tools, a case study is of explanatory nature “seek- of degree  =  that are described by an IPM [11, 5].
ing an explanation of a situation or a problem” [10]. As
Runeson &amp; Höst state, a case study “will never provide 2.2. Combinatorial Robustness Testing
conclusions with statistical significance” [ 10]. But it can
provide suficient information to help you judge if
specific technologies will benefit your own organization or
project” [9]. Since a case study has, by definition, a higher
degree of realism than a controlled experiment [10], a
case study that compares CRT with CT can provide
additional insights that complement and extend the findings
of the previously conducted controlled experiment.</p>
        <p>The paper is structured as follows. Section 2
introduces basic concepts of CT and CRT. Related work is
discussed in Section 3. Next, the design of the case study
is introduced (Section 4) and its results are presented
(Section 5). Afterwards, threats to validity are discussed
(Section 6) before the paper is concluded in Section 7.</p>
      </sec>
      <sec id="sec-1-12">
        <title>To avoid input masking, CRT is developed as an extension to CT that separates valid and invalid test inputs [7]. To better separate the concepts, we say that CT relies on</title>
        <p>IPMs while CRT relies on robustness input parameter
models (RIPM). A RIPM contains additional
error-constraints which is another set of constraints to annotate
relevant schemata as invalid. A relevant schema is also
a valid schema if it satisfies all error-constraints. A
relevant schema is an invalid schema if at least one
error-constraint remains unsatisfied. Further on, an
invalid schema is a strong invalid schema if exactly one
error-constraint remains unsatisfied.</p>
        <p>
          Test selection strategies like R O B U S T A [
          <xref ref-type="bibr" rid="ref3">7</xref>
          ] not only
consider exclusion-constraints to exclude irrelevant
schemata, they also consider error-constraints and exclude
invalid schemata from valid test inputs. Further on, strong
invalid test inputs are selected such that each invalid
value and invalid value combination that is modeled by
error-constraints appears in strong invalid test inputs.
        </p>
        <p>Valid test inputs are selected to satisfy -wise valid
coverage. The  -wise valid coverage criterion is an
extension of the  -wise relevant coverage criterion. It is
satisfied if all valid schemata with a degree of  =  that
are described by a RIPM are covered at least once by a
valid test input.</p>
        <p>
          Strong invalid test inputs are selected to satisfy -wise
strong invalid coverage where  denotes the
robustness interaction degree. Without robustness interaction
( = 0 ), the coverage criterion is called single error
coverage (cf. [
          <xref ref-type="bibr" rid="ref3">11, 7</xref>
          ]). It is satisfied if each invalid schema that
is described by an error-constraint appears in a strong
invalid test input. With robustness interaction ( ≥ 1 ),
each described invalid schema is combined with all valid
schemata of degree  =  . The coverage criterion is
satisifed if all combinations of invalid schemata and  -sized
valid schemata are covered by strong invalid test inputs.
        </p>
        <p>Following these brief introductions of CT and CRT,
the conceptual diference between the two approaches
should become clear. CT and CRT use the same
parameters and values. But CT does not distinguish between
valid and invalid schemata. Instead, both types of
schemata are mixed and the FDE purely relies on the
combinatorics, i.e. diferent testing strengths  . In contrast, CRT
distinguishes valid and invalid schemata to avoid the
efect of input masking. Here too the FDE relies on
combinatorics but the avoidance of input masking has an
additional influence.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>In the following, CT and CRT are briefly introduced. For more information, please refer to [11, 5, 7].</title>
        <sec id="sec-2-1-1">
          <title>2.1. Combinatorial Testing</title>
          <p>CT is a black-box test method [5]. It is based on an input
parameter model (IPM) which declares  parameters
and each parameter is associated with a non-empty set
of values. A schema is a set of parameter-value pairs
for  distinct parameters [12]. A schema with  = 
parameter-value pairs is a test input. A schema  covers
another schema  if and only if schema  includes all
parameter-value pairs of schema .</p>
          <p>Real-world systems are often constrained and certain
values should not be combined to schemata and test
inputs [5]. These schemata are irrelevant because they are
not of any interest for the test. Test inputs that cover
irrelevant schemata are irrelevant as well and their test
results have no informative value. Hence, they should be
excluded from testing.</p>
          <p>Constraint handling is often used to exclude irrelevant
schemata [13]. Therefore, irrelevant schemata are
explicitly modeled by a set of logical expressions (called
exclusion-constraints). A schema is relevant if it
satisfies all exclusion-constraints. A schema is irrelevant
if at least one exclusion-constraint remains unsatisfied.</p>
          <p>A coverage criterion is a condition that must be
satisfied by a test suite. A test selection strategy describes
how values are combined to test inputs such that a given
coverage criterion is satisfied [ 11]. Test suites resulting</p>
          <p>
            CRT requires the efort to model error-constraints. data according to a set of validation rules and with
forTest selection strategies that consider error-constraints warding the data when it satisfies the validation rules. It
also become more complex. This raises the question is the same project which we analyzed in a previous case
whether the avoidance of input masking outweighs the study (cf. [18]).
additional efort and complexity of CRT. Until now, only Altogether, 31 validation rules are defined to check
artificial test scenarios are used to compare CT with CRT insurance application data. The order of the validation
(cf. [
            <xref ref-type="bibr" rid="ref3">7</xref>
            ]) and it remains unclear if indicated advantages of rules is predefined and all validation rules are traversed
CRT can be transferred to real-world scenarios. There- for each insurance application data. Whenever a
validafore, this case study was conducted. tion rule is not satisfied by an insurance application, a
corresponding error code is returned and the remaining
validation rules are skipped. If all validation rules are
3. Related Work satisfied, the subsystem returns S U C C E S S and the
insurance application data is further processed. Although, the
further processing is out of scope for this case study.
          </p>
          <p>Each validation rule is built as an implication
consisting of two parts:</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>To the best of our knowledge, Sherwood [6] first men</title>
        <p>
          tioned invalid values in the context of C A T S which is a test
selection strategy and tool for CT. Cohen et al. [14] and
Czerwonka [15] also acknowledged the necessity to
separate valid and strong invalid test inputs. They also pub- isApplicable(application) ⇒ isValid(application)
lished test selection strategies and tools and the IPMs
contain semantic information to distinguish relevant from The first part determines whether a given validation rule
irrelevant schemata and to distinguish valid from invalid is applicable to the insurance application data or not. If
values. However, invalid value combinations are not di- a rule is applicable, the insurance application must not
rectly supported. Therefore, we proposed R O B U S T A and violate the rule, i.e. isValid(application). Otherwise, the
the structure of RIPMs with error-constraints [
          <xref ref-type="bibr" rid="ref3">7</xref>
          ]. validation rule is ignored.
        </p>
        <p>Many studies exist that demonstrate the usefulness and Because details of the case are confidential, a generic
efectiveness of CT (cf. [16, 17, 18]). But most studies example is given to provide further illustration of
valido not distinguish between relevance and validness and dation rules. The example depicts two validation rules
focus on testing the normal behavior. to define maximum sums that can be insured depending</p>
        <p>One case study by Wojciak &amp; Tzoref-Brill [19] reports on the permissions of the insurance agents. The first
on applying CT and also considers testing with invalid validation rule is applicable to all applications created
inputs. They report that single error coverage was not by insurance agents with the highest level of permission.
suficient because EH depended on interactions between The second validation rule is applicable to all
applicainvalid and valid values. In particular, “the same [excep- tions that are created by insurance agents with lower
tion] would often be handled diferently depending on permission level.
the firmware in control [...] or depending on the config- The distinction between the two validation rules is
uration of the system”. A further remark is concerned made by the first part of the implication:
with the ratio of valid versus invalid test inputs: “Since
a lot of attention was given to [robustness] testing [...] Rule 1: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶
where full recovery in the presence of [exceptions] was a p p l i c a t i o n .a g e n t .p e r m i s s i o n = h i g h e s t _l e v e l
expected, the [test suite] contained a ratio of up to 2:1 Rule 2: i s A p p l i c a b l e (a p p l i c a t i o n ) ∶
[invalid test inputs vs. valid test inputs].”
a p p l i c a t i o n .a g e n t .p e r m i s s i o n ≠ h i g h e s t _l e v e l</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Case Study Design</title>
      <sec id="sec-3-1">
        <title>In this section, the case under analysis and the data collection procedure are introduced.</title>
        <sec id="sec-3-1-1">
          <title>4.1. Case Under Analysis</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The case is a development project conducted by an IT</title>
        <p>service provider of an insurance company, where a new
software was developed to manage the life-cycle of life
insurance contracts. One subsystem of the software is
concerned with the validation of insurance application
The second part of the implication is used to enforce
the maximum insured sum. As an application may consist
of several partial contracts, the individual insured sums
of all partial contracts are collected first. Afterwards,
it is checked whether the total sum exceeds the
threshold. While the structure of both rule’s isValid() parts is
the same, diferent values for the m a x i m u m _i n s u r e d _s u m
constant are used:
i s V a l i d (a p p l i c a t i o n ) ∶
t o t a l _s u m = ∑ p a r t i a l .i n s u r e d _s u m</p>
        <p>p a r t i a l ∈ a p p l i c a t i o n
t o t a l _s u m ≤ m a x i m u m _i n s u r e d _s u m</p>
      </sec>
      <sec id="sec-3-3">
        <title>This example shows that many parameters may be</title>
        <p>involved in a validation rule, that intermediate
calculations may be required, and that intermediate calculations
may be reused in diferent validation rules. Therefore,
all validation rules should be tested thoroughly.</p>
        <p>For this case study, we consider the current set of
validation rules as correct and treat them as our specification.
By browsing the source code repository, we have
identified 13 changes that have been made to the validation
rules in order to correct them. Each change documents a
fault that existed previously but is fixed prior to release.
Based on these 13 changes, we reconstructed 13
implementation versions of which each contains one fault.</p>
        <p>
          The 13 faults can also be classified according to our
robustness fault classification (cf. [
          <xref ref-type="bibr" rid="ref3">7</xref>
          ]). Five faults can only
be detected by invalid test inputs, while eight faults can
be detected by both valid and invalid test inputs. Two of
these five faults can be classified as faults in
error-signaling. To reveal them, invalid test inputs must trigger EH
which responds with an incorrect error code. The other
three faults can be classified as faults in error-detection
conditions. The conditions are too weak and do not detect
invalid test inputs. Hence, the SUT incorrectly continues
with its normal behavior.
        </p>
        <p>The remaining eight faults can be detected by both
valid and invalid test inputs. They are faults in
error-detection conditions. Four of theses faults have conditions
that are too strong and therefore incorrectly detect
exception occurrences for valid test inputs. The other four
faults have characteristics of being too weak and too
strict at the same time because wrong parameters with
similar characteristics are used in the exception
condition. As a consequence, an invalid test input may not
violate the condition (too weak) while a valid test input
may not satisfy the condition (too strong).</p>
        <sec id="sec-3-3-1">
          <title>4.2. Data Collection Procedure</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Data collection refers to the measurement and calculation</title>
        <p>of metric values from test execution. Therefore, metrics
are defined in this section. Furthermore, the modeling of
the IPM and RIPM as well as the selection and execution
of test inputs is described.
4.2.1. Metrics</p>
      </sec>
      <sec id="sec-3-5">
        <title>The resources available from the software development</title>
        <p>project are not directly analyzed and compared. Instead,
they are used to reconstruct the implementation versions
for test execution and to create a RIPM and an IPM that
represent variations of insurance application data.</p>
        <p>Based on the RIPM and IPM, test inputs are selected
using a CT and a CRT test selection strategy. Then, the
test inputs are executed on the 13 reconstructed
implementations to assess the efectiveness.</p>
        <p>A common metric to assess the efectiveness is fault
detection efectiveness (FDE) [11, 16]. A test suite 
is denoted as failing for a test scenario  if at least one
of the test inputs  ∈  detects the fault in .</p>
        <p>failing( , ) =</p>
        <p>1 if ∃ ∈  that fails for 
{ 0 otherwise</p>
        <p>Using the failing function, FDE is defined as the ratio
between the number of test suites  of a test suite family
 ∗ that fail for a test scenario  and the number of all test
suites in the family  ∗. In this case study, the family of
test suites contains 20 diferent variants. In other words,
the FDE is based on 20 randomized test suites that all
satisfy the same coverage criterion for the same IPM or
RIPM. They all test the same test scenario.</p>
        <p>FDE( ∗, ) =
∑ ∈ ∗ failing( , )
| ∗|</p>
        <p>Further on, the average fault detection
efectiveness (AFDE) denotes the average FDE over a family of
test scenarios  ∗. In our case study, the family of test
scenarios  ∗ consists of the 13 reconstructed
implementations. The AFDE represents the average efectiveness
of CRT and CT equally distributed over the 13 faults.</p>
        <p>AFDE( ∗,  ∗) =
∑∈
∗ FDE( ∗, )
| ∗|
4.2.2. Modeling of IPM and RIPM</p>
      </sec>
      <sec id="sec-3-6">
        <title>Since the FDE and AFDE metrics highly depend on the</title>
        <p>quality of the RIPM and IPM, a systematic modeling
approach is necessary. We model the IPM first and later
extend it with error-constraints to get a RIPM.</p>
        <p>The IPM is modeled iteratively for one validation rule
at a time. In each iteration, parameters and values are
added to ensure that test inputs with the following three
characteristics can be detected: (1) test inputs that are
not applicable; (2) test inputs that are applicable and
valid; (3) test inputs that are applicable but not valid. In
addition, some exclusion-constraints are introduced to
ensure syntactic correctness of selected test inputs. The
IPM is considered as complete once the IPM contains
all parameters and values necessary to satisfy branch
coverage of each validation rule.</p>
        <p>For the RIPM, the modeling of additional
error-constraints is required. The error-constraints are modeled
iteratively and we add new or update existing ones until
the separation of valid and strong invalid test inputs
conforms to the responses of the SUT, i.e. the SUT returns
S U C C E S S for each valid test input and the SUT returns an
error code for each strong invalid test input.</p>
        <p>In total, the IPM and RIPM consist of 32 parameters
and 106 values. Most parameters have two, three, or four
values each. But two parameters have six values each
and one parameter has even nine values. Three
exclusion-constraints of which each restricts combinations of
two parameters are required to ensure syntactical
correctness of the insurance applications. Furthermore, the
RIPM contains 31 error-constraints. 15 error-constraints
annotate single values as invalid. The remaining 16
error-constraints annotate schemata with 2, 3, or 5 values.</p>
        <p>The complete IPM and RIPM are described below in
exponential notation. For parameters and values,   refers
to  parameters with  values. For exclusion- and
errorconstraints,   refers to  constraints with  parameters.</p>
        <p>Parameters &amp; Values: 9162514838212
Exclusion-Constraints: 23</p>
        <p>Error-Constraints: 523628115
4.2.3. Selecting and Executing Test Inputs</p>
      </sec>
      <sec id="sec-3-7">
        <title>After creating the IPM and RIPM, both models are used to</title>
        <p>select sets of test inputs. Since we compare CRT with CT,
two diferent test selection strategies are used. R O B U S T A one fault are tested to determine which test suite is able
is used to select test inputs for the RIPM and I P O G - C is to detect which fault. The results are discussed in the
used to select test inputs for the IPM. following section.</p>
        <p>To compare the FDE and AFDE of CRT with CT, test
suites that satisfy diferent coverage criteria are used. 5. Results &amp; Discussion
We apply I P O G - C to select test suites that satisfy  -wise
relevant coverage for  ∈ {1, ..., 5} . Furthermore, we ap- In this section, the case study results regarding the
comply R O B U S T A to select test suites that satisfy  -wise valid puted FDE and AFDE values are reported and discussed.
coverage with  ∈ {1, ..., 3} and that satisfy  -wise strong
invalid coverage with  ∈ {0, 1}.</p>
        <p>To reduce the efect of accidental fault detection caused 5.1. Fault Detection Efectiveness
by ordering, the order of parameters and values of the Table 2 lists the FDE values of all test suites families
input parameter models is randomly reordered and 20 applied to all 13 implementations. For better readability,
diferent model variants are used to select test suites for + is used to indicate an FDE value of 1.00. The faults nos.
each coverage criteria. 1 to 8 can all be detected by both valid and invalid test</p>
        <p>Table 1 depicts the average sizes of test suites that inputs, while the faults nos. 9 to 13 can only be detected
satisfy the diferent coverage criteria. Since R O B U S T A en- by invalid test inputs. Again, the shown FDE value is an
compasses two coverage criteria ( -wise valid coverage average value for one test suite family with 20 diferent
and  -wise strong invalid coverage), the test suites are test suites that are created by randomizing the order of
considered both, separately and combined. parameters and values before selecting test inputs. As</p>
        <p>The largest test suite is selected by I P O G - C which is an example, in the first row for fault no. 3, an FDE value
required to satisfy  -wise relevant coverage with  = 5 of 0.05 means that one out of 20 test suites detected the
(15023.70 test inputs). The second-largest test suite is also fault at least once per test suite.
selected by I P O G - C to satisfy  -wise relevant coverage with As can be observed,  -wise relevant coverage is not
 = 4 (2813.45 test inputs). The third-largest test suite able to detect all faults reliably. The FDE values increase
is selected by R O B U S T A and satisfies  -wise valid coverage when testing strength  grows. But even with  = 5
with  = 3 and  -wise strong invalid coverage with  = 1 (15023.70 test inputs), only 7 faults are detected reliably
(2224.30 test inputs). (FDE value of 1.00). Further on, fault no. 10 remains</p>
        <p>When comparing the test suite sizes of  -wise relevant undetected (FDE value of 0) and faults nos. 9 and 13 are
coverage of I P O G - C with  -wise valid coverage of R O B U S T A , only detected by one out of 20 test suites (FDE value of
it can be seen that the error-constraints drastically reduce 0.05).
the number of valid test inputs. The CRT coverage criteria are characterized by
avoid</p>
        <p>After test input selection, the test suites are used to ing the invalid input masking efect. Since all invalid
stimulate the SUT in 13 diferent versions. Therefore, the schemata are excluded by  -wise valid coverage, the faults
13 reconstructed implementations of which each contains
nos. 9 to 13 cannot be detected. But for all other faults, In order to detect all faults reliably, the  -wise strong
 -wise valid coverage has higher FDE values for the same invalid coverage must be selected because faults nos. 9
testing strength  when compared to  -wise relevant cov- to 13 remain undetected otherwise. Either robustness
erage. Because invalid input masking is avoided, a testing interaction ( &gt; 0 ) or the combination of  -wise strong
strength of  = 2 is suficient to detect faults nos. 1 to 8 invalid coverage with  -wise valid coverage is required
reliably (FDE values of 1.00). to reliably detect faults nos. 1 to 8. Even though  = 1</p>
        <p>Using  -wise strong invalid coverage with  = 0 , 11 is only suficient to detect three of the first eight faults
out of 13 faults can already be detected reliably and the reliably, the combination with  -wise strong invalid
covtwo remaining faults have high FDE values of 0.90 and erage improves the FDE and all faults can be detected
0.80. The efectiveness of robustness interactions is even reliably.
higher and all faults can be detected reliably with  = 1. The discussion of the FDE shows which coverage
cri</p>
        <p>Four faults that have too strong error detection con- teria are appropriate to reliably detect diferent types of
ditions and that actually require valid test inputs to be faults. Next, we discuss the AFDE over all 13 faults.
detected are also reliably detected by  -wise strong
invalid coverage. We could observe that a strong invalid 5.2. Average Fault Detection
test input that is expected to violate the error detection Efectiveness
condition of the  -th validation rule is also expected to
satisfy all prior validation rules from 1 to  − 1 . Therefore, Because AFDE values are average values over a set of
strong invalid test inputs can be considered as “partially- faults, AFDE allows making general statements about
valid” test inputs that are able to accidentally detect faults both the efectiveness and the eficiency of coverage
crithat require valid test inputs. This efect is strengthened teria. First, we discuss the efectiveness in terms of AFDE
by robustness interactions because more test inputs are values of diferent coverage criteria. Therefore, Table 2
selected and more interactions are covered by them. lists the AFDE values for test suites that satisfy diferent</p>
        <p>R O B U S T A combines  -wise valid coverage and  -wise coverage criteria. Afterwards, we discuss the eficiency
strong invalid coverage and the FDE values show that test in terms of AFDE values in relation to test suite sizes
suites for both coverage criteria complement each other. (listed in Table 1).</p>
        <p>Since valid and strong invalid test inputs are able to detect The AFDE values reflect what we discussed before
faults nos. 1 to 8, the FDE values are complemented by since they aggregate FDE values. Because of the invalid
the combination of both test suites. For faults nos. 9 to 13, input masking efect, test suites that satisfy  -wise
relethe FDE values are not complemented by the combination vant coverage only reach an AFDE value of 0.62.
of both test suites. This is because test suites that only In direct comparison, test suites that satisfy  -wise
satisfy  -wise valid coverage cannot detect these faults. valid coverage reach a maximum AFDE value of 0.62 as
Therefore, the FDE values of the combined test suites are well. The same AFDE value can be reached because they
the same as the FDE values of the test suites that satisfy prevent invalid input masking. However, the AFDE value
-wise strong invalid coverage. cannot be further improved by increasing the testing
strength because faults nos. 1 to 8 are already detected gies is published as part of the c o f f e e 4 j open-source test
reliably and faults nos. 9 to 13 cannot be detected by valid automation framework1.
test inputs. Comparing the two coverage criteria for each The efectiveness of CRT and CT highly depend on the
testing strength individually shows that the AFDE value IPM and RIPM. Furthermore, the efectiveness depends
of  -wise valid coverage is always higher than the AFDE on the faults that are considered in this case study.
value of -wise relevant coverage. Unfortunately, details of the case, i.e. source code</p>
        <p>For  -wise strong invalid coverage, the lowest AFDE of the validation rules and detailed descriptions of the
value is 0.98 (no robustness interactions) which is always faults, are confidential. To improve transparency and
higher than the AFDE values of  -wise relevant and valid reproducibility, we describe the faults and make the
charcoverage. Furthermore,  -wise strong invalid coverage acteristics of the IPM and RIPM explicit.
with robustness interactions has an AFDE value of 1 and To avoid any bias, both the IPM and RIPM are modeled
therefore detects all faults reliably. systematically and share the same set of parameters and</p>
        <p>Overall, the combination of  -wise valid coverage and values. To prevent falsified results due to accidental fault
 -wise strong invalid coverage performs the best and triggering, the orders of parameters and values are
ranalways detects all faults reliably. domized and 20 diferent variants are used in test input</p>
        <p>When putting the AFDE values in relation to test suite selection. All presented FDE values are average values.
sizes, it can be noted that  -wise relevant coverage has Since this is a case study with only one case, it is
difithe worst eficiency as it requires 15023.70 test inputs for cult to generalize the findings [ 10]. Further on, it has to
an AFDE value of 0.62. In contrast,  -wise valid coverage be noted that the archival data of this case study is only a
only requires 48.30 test inputs for an AFDE value of 0.62. snapshot and the ground truth, i.e. the existing and
pre</p>
        <p>The best eficiency is ofered by the combination of viously existing faults, is unknown. Hence, the data can
 -wise valid coverage with  = 1 and  -wise strong invalid be biased towards simpler faults that are easier to detect.
coverage with  = 0 which requires 308.00 test inputs To prevent too far-reaching conclusions, we describe the
for an AFDE value of 1.00. When using an AFDE value characteristics of the SUT and also limit our conclusions
of 0.92 as a lower boundary (12 out of 13 faults),  -wise to similar systems with many validation rules.
strong invalid coverage with  = 0 is suficient and only
requires 301.00 test inputs for an AFDE value of 0.98.</p>
        <p>
          This discussion about eficiency is, of course, influ- 7. Conclusion
enced by the characteristics of the 13 faults and cannot
be generalized. But as more general statements, it can be
observed that  -wise relevant coverage requires more test
inputs to reach a similar AFDE value than  -wise valid
coverage,  -wise strong invalid coverage, or the
combination of both. At the same time, the combination of
 -wise valid coverage and  -wise strong invalid coverage
always has an AFDE value of 1.00 while at most 2224.30
test inputs are used. This finding is also consistent with
our prior experimental evaluation (cf. [
          <xref ref-type="bibr" rid="ref3">7</xref>
          ]).
        </p>
        <p>Therefore, we draw the conclusion that  -wise valid
coverage,  -wise strong invalid coverage, and the
combination of both perform as well as or better than  -wise
relevant coverage in terms of efectiveness and eficiency.</p>
        <p>Although, the findings are only derived from one
particular case. Therefore, we do not consider this to be true
for all SUTs but for SUTs with many validation rules.</p>
        <p>CRT extends CT to generate separate test suites with
valid and strong invalid test inputs in order to avoid input
masking that is caused by EH. Therefore, CRT requires
additional efort to model error-constraints and
introduces additional complexity to test selection strategies
because error-constraints must be considered. This raises
the question about the usefulness of CRT and whether
the avoidance of input masking outweighs the additional
efort and complexity. Until now, only artificial test
scenarios are used to compare CT with CRT and it remains
unclear if indicated advantages of CRT can be transferred
to real-world scenarios.</p>
        <p>In this paper, we therefore present the results of a case
study based on a real-world system with 31 validation
rules and 13 previously existing faults. To compare CT
with CRT, we construct a IPM and a RIPM, select test
inputs, and stimulate 13 implementations of the
realworld system of which each implementation contains one
6. Threats to Validity of the 13 previously existing faults. For the subsequent
discussion, we introduce the FDE and AFDE metrics.</p>
        <p>We compare the efectiveness of CRT using an imple- To summarize the findings of this case study, we
dismentation of the R O B U S T A test selection strategy with CT cuss both research questions individually.
using an implementation of the I P O G - C test selection strat- Research Question 1: Our results indicate that the
egy. To ensure an unbiased implementation, both imple- CRT test method is applicable in real-world test
scenarmentations follow the guidelines of Kleine &amp; Simos [20]. ios. This case study demonstrated that RIPMs with 32
Further on, the source code of the test selection
strate</p>
      </sec>
      <sec id="sec-3-8">
        <title>1See https://cofee4j.github.io for more information.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>parameters and 31 error-constraints can be constructed</article-title>
          . Workshop on Quantitative Approaches to Software Further on,
          <string-name>
            <surname>the R O B U S T</surname>
          </string-name>
          <article-title>A test selection strategy is capable Quality co-located with 26th Asia-Pacific Software of selecting test suites for RIPMs with 32 parameters</article-title>
          and Engineering Conference (APSEC
          <year>2019</year>
          ), Putrajaya,
          <volume>31</volume>
          <fpage>error</fpage>
          -constraints.
          <source>Malaysia, December</source>
          <volume>2</volume>
          ,
          <year>2019</year>
          .,
          <year>2019</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>36</lpage>
          . Research Question 2:
          <article-title>The comparison of CRT with [9</article-title>
          ]
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Kitchenham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pickard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Pfleeger</surname>
          </string-name>
          ,
          <article-title>Case CT is consistent with the findings of our previously con- studies for method and tool evaluation</article-title>
          , IEEE Softw.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>ducted controlled experiment with artificial test scenarios 12 (</article-title>
          <year>1995</year>
          )
          <fpage>52</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>(cf. [7]). Since the case under analysis has much EH</article-title>
          , CRT [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Runeson</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Höst, Guidelines for conducting performs better than CT in terms of FDE. Further on, it and reporting case study research in software engirequires fewer test inputs to achieve better AFDE values neering</article-title>
          ,
          <source>Empirical Software Engineering</source>
          <volume>14</volume>
          (
          <year>2009</year>
          )
          <article-title>than CT</article-title>
          .
          <volume>131</volume>
          -
          <fpage>164</fpage>
          . Therefore,
          <article-title>we draw the conclusion that</article-title>
          -wise valid [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grindal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ofutt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Andler</surname>
          </string-name>
          , Combination testcoverage,
          <article-title>-wise strong invalid coverage, and the combi- ing strategies: a survey, Softw</article-title>
          . Test.,
          <string-name>
            <surname>Verif</surname>
          </string-name>
          . Reliab.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>nation of both perform as well as or better than  -wise 15 (</article-title>
          <year>2005</year>
          )
          <fpage>167</fpage>
          -
          <lpage>199</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>relevant coverage in terms of efectiveness and eficiency</article-title>
          . [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Leung</surname>
          </string-name>
          ,
          <article-title>The minimal failure-causing Although, the FDE and AFDE values are influenced by schema of combinatorial testing</article-title>
          ,
          <source>ACM Trans. Softw.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>the characteristics of the 13 faults and</article-title>
          cannot be general- Eng. Methodol.
          <volume>20</volume>
          (
          <year>2011</year>
          )
          <volume>15</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          :
          <fpage>38</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          ized. Therefore,
          <article-title>we do not consider this to be true for all</article-title>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Borazjany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <article-title>SUTs but for SUTs with much EH. An eficient algorithm for constraint handling in In future work, we plan to conduct further case studies combinatorial test generation, in: Sixth IEEE Into learn more about the FDE of CRT and CT</article-title>
          .
          <source>ternational Conference on Software Testing, Verification and Validation</source>
          ,
          <string-name>
            <surname>ICST</surname>
          </string-name>
          <year>2013</year>
          , Luxembourg, Luxembourg, March
          <volume>18</volume>
          -22,
          <year>2013</year>
          ,
          <year>2013</year>
          , pp.
          <fpage>242</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>References</surname>
            [14]
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          <string-name>
            <surname>Dalal</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <string-name>
            <surname>Fredman</surname>
          </string-name>
          , G. C.
          <article-title>Patton, The AETG system: An approach to testing based on</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[1] IEEE, IEEE Standard Glossary of Software Engi- combinatiorial design</article-title>
          ,
          <source>IEEE Trans. Software Eng. neering Terminology</source>
          ,
          <source>IEEE Std 610</source>
          .
          <fpage>12</fpage>
          -
          <lpage>1990</lpage>
          (
          <year>1990</year>
          ).
          <volume>23</volume>
          (
          <year>1997</year>
          )
          <fpage>437</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fögen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lichter</surname>
          </string-name>
          , An experiment to compare [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kleine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Simos</surname>
          </string-name>
          ,
          <article-title>An eficient design and imcombinatorial testing in the presence of invalid plementation of the in-parameter-order algorithm, values</article-title>
          ,
          <source>in: Proceedings of the 7th International Mathematics in Computer Science</source>
          <volume>12</volume>
          (
          <year>2018</year>
          )
          <fpage>51</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>