<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Experiment to Compare Combinatorial Testing in the Presence of Invalid Values</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konrad Fo¨gen</string-name>
          <email>foegen@swc.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horst Lichter</string-name>
          <email>lichter@swc.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Group Software Construction, RWTH Aachen University</institution>
          ,
          <addr-line>Aachen, NRW</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>-Robustness is an important property of software that should be thoroughly tested. Combinatorial testing (CT) is an effective black-box test approach. When using it for robustness testing, input masking can prevent faults from being detected. However, the impact is not yet clear. Therefore, we conducted a controlled experiment to understand how input masking affects the fault detection effectiveness of CT and how effective CT is in the presence of error-handling and invalid values.</p>
      </abstract>
      <kwd-group>
        <kwd>-Software Testing</kwd>
        <kwd>Combinatorial Testing</kwd>
        <kwd>Robustness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Robustness is an important property of software systems
that describes “the degree to which a system or component
can function correctly” in the presence of external faults like
invalid inputs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. External faults can have a severe impact on
the system’s robustness because they can propagate to system
failures resulting in abnormal behavior or system crashes
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To improve robustness, systems implement error-handling
to appropriately react to external faults [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Oftentimes, the
external fault cannot be resolved by the system internally
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Then, the system is terminated by the error-handling
procedure that returns an error-message to the client without
executing the normal procedure. This is also referred to as the
error-propagation strategy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Unfortunately, error-handling
procedures have a fault density that is up to three times higher
compared with normal procedures [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Therefore, testing is
important to check error-handling.
      </p>
      <p>
        The purpose of testing is to reveal failures by stimulating
a system under test (SUT) with test inputs and observing the
results via test oracles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To reveal a failure, a fault must
be triggered to produce an error and the error must propagate
to a failure of the SUT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Assuming that test oracles reveal
all propagated failures, the important factor in testing is the
selection of test inputs such that the faults are triggered.
      </p>
      <p>
        Combinatorial testing (CT) is a black-box approach for test
input selection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A test model describes the SUT via input
parameters and input values. Using a combination strategy,
test inputs are generated such that they satisfy a combinatorial
coverage criterion like t-wise that is satisfied if all value
combinations of t parameters appear in at least one test input.
      </p>
      <p>
        Combinatorial robustness testing (CRT) is an extension to
CT that incorporates robustness testing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It is argued that
CRT is necessary because of the input masking effect [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]–
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: The first invalid value that is evaluated by the SUT
triggers error-handling and the normal control-flow is left.
When the error-propagation strategy is used, the SUT returns
with an error-message and the normal control-flow is not
resumed. Then, the other values and value combinations of
the test input remain untested because they are masked.
      </p>
      <p>
        CRT avoids input masking by separating the testing with
valid test inputs, that do not contain any invalid value, from
testing with strong invalid test inputs, that contain exactly one
invalid value. Therefore, the test model must be enriched with
additional semantic information about invalid values. Previous
experiments [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have shown that CRT is an effective approach
in the presence of error-handling.
      </p>
      <p>Despite the presence of error-handling, CT can also reveal
failures without requiring additional semantic information.
However, the extent to which the input masking effect can
impact the fault detection effectiveness is is yet unknown.
Therefore, our aim is to answer the following research
question: How effective is CT in triggering faults when
errorhandling and invalid values are present? To answer the
research question, we apply the t-factor fault model, derive
influencing factors and conduct a controlled experiment.</p>
      <p>The paper is structured as follows. First, an example is
introduced. Then, Section III and IV summarize background and
related work. In Section V, the t-factor fault model is applied
to the context of error-handling and factors that influence fault
triggering are identified. Afterwards, the experiment design
is discussed in Section VI and the results are presented in
Section VII. Afterwards, threads to validity are discussed and
we conclude with a summary of our work.</p>
    </sec>
    <sec id="sec-2">
      <title>II. EXAMPLE</title>
      <p>To illustrate the impact of error-handling, we use a customer
registration service as an example. The example consists of
three validity checks to ensure that the entered data is not
invalid. Since the service cannot correct the data itself, an
error code is returned to the client asking to correct the data.</p>
      <p>A test model for CT is depicted in Figure 1 with 123
representing some invalid value. Each test input that contains
an invalid value like [Title:123] should yield an error code.
p1 : T itle
p2 : F amilyName
p3 : Address</p>
      <p>V1 = fMr; Mrs; 123g
V2 = fMiller; Davis; 123g</p>
      <p>V3 = fUK; US; 123g</p>
      <p>Further, assume that the implementation of the service
contains a fault in the validity check for family names that returns
a wrong error code (title instead of name error). It is triggered
whenever an invalid family name, e.g. [FamilyName:123],
is evaluated. An implementation is illustrated in Listing 1 with
INV_xxx string literals representing specific error codes.
String register(String title, family, addr){
if(isInvTitle(title)) return INV_TITLE;
if(isInvFamilyName(family)) return INV_TITLE;
if(isInvAddress(addr)) return INV_ADDRESS;
...
}</p>
      <p>Listing 1. Example of an Input Validity Check</p>
      <p>A test input like [Title:Mrs, FamilyName:123,
Address:UK] would trigger the fault. In contrast, a test
input like [Title:123, FamilyName:123, Address:UK]
would yield INV_TITLE because of the invalid title. But,
it would not trigger the name check fault because of input
masking.</p>
      <p>To satisfy 1-wise coverage, a minimal set of three test
inputs is generated with exactly one test input that
contains [FamilyName:123]. In total, nine test inputs with
[FamilyName:123] exist and the combination strategy must
select one. Of the nine test inputs, three contain [Title:123]
causing input masking. Therefore, the probability of triggering
the fault is 69 = 66%.</p>
      <p>
        Increasing the testing strength to t &gt; 1 will also increase the
fault triggering probability. However, finding a minimal set of
test inputs that satisfies t-wise coverage for t 2 is in general
NP hard [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Heuristics are used as combination strategies
which produce small but not always minimal sets of test inputs.
Then, some t-sized value combinations appear more than once
and the probability cannot be simply calculated.
      </p>
      <p>III. BACKGROUND</p>
      <sec id="sec-2-1">
        <title>A. Combinatorial Testing</title>
        <p>
          CT is a well-known approach to black-box testing where
test inputs are selected based on a test model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The test
model TM describes the input space of a program as a set of n
parameters P = fp1; :::; png and the domain of each parameter
pi is a finite nonempty set of mi values Vi = fv1; :::; vmi g. A
combination is a set of 0 &lt; d n parameter-value pairs
(pi; vj) for d distinct parameters pi with vj 2 Vi. A test input
is a combination of size d = n. Combination a covers another
combination b if every parameter-value pair of b is included
in a which we denote as b a.
        </p>
        <p>
          In CT, coverage criteria and combination strategies depend
on a specific test model TM [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. A coverage criterion C
describes requirements of TM that must be met by a set of
test inputs T and a combination strategy describes how to
select test inputs T such that C is satisfied.
        </p>
        <p>
          The t-wise coverage criterion is a common criterion that is
satisfied if all value combinations of t parameters appear in at
least one test input [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In addition, all smaller combinations
(d &lt; t) are covered and also some larger combinations of size
t0 = (t + k) with k &gt; 0 and t0 n are covered [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This
so-called collateral coverage can potentially help triggering
additional faults [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          The input domains of real-world systems are typically
restricted. As a consequence, certain values or value
combinations are not of any interest or may prevent a test from
being executed. Exclusion-constraints are commonly used to
exclude irrelevant value combinations from test input selection
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Every test input that satisfies the exclusion-constraints is
a relevant test input.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>B. Combinatorial Robustness Testing</title>
        <p>
          CRT is an extension to CT that incorporates robustness
testing [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It explicitly considers the input masking effect
that is caused by error-handling. To avoid input masking, CRT
separates the generation of valid and invalid test inputs.
        </p>
        <p>
          Therefore, additional semantic information is required to
model invalid values [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The additional information can be
modelled via error-constraints which denote a second set of
constraints [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Then, relevant test inputs are further partitioned
as follows. A relevant test input is a valid test input if it
satisfies all exclusion- and all error-constraints. A relevant test
input is invalid if it satisfies all exclusion-constraints but at
least one error-constraints remains unsatisfied. An invalid test
input is denoted a strong invalid test input if exactly one
error-constraint is unsatisfied.
        </p>
        <p>
          Then, valid test inputs and invalid test inputs are generated
separately such that they satisfy different coverage criteria.
The valid t-wise coverage criterion is satisfied if each valid
parameter value combination of size t appears in at least one
test input of which all other values and value combinations
are also valid [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In addition, the single error coverage
criterion is satisfied if each invalid value appears in at least
one strong invalid test input of which all other values and
value combinations are valid [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>When satisfying the aforementioned coverage criteria, both
normal procedures and error-handling are tested without input
masking. However, in comparison to CT, additional effort is
required to model the error-constraints.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>IV. RELATED WORK</title>
      <p>
        Cohen et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] first described the input masking effect
caused by error-handling [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They also noted the need to
separate valid and invalid test inputs to avoid input masking.
An evaluation of combination strategies by Grindal et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
discussed an example where input masking prevented a fault
from being triggered. A case study by Wojciak and
TzorefBrill [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] reported on CT including testing with invalid inputs.
      </p>
      <p>
        The CT tools AETG [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], ACTS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and PICT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] allow to
mark individual values as invalid and also generate separate
28
sets of test inputs. However, invalid value combinations are
not directly supported.
      </p>
      <p>
        In previous work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], we introduced error-constraints
as a modelling technique that allows to directly model both
invalid values and invalid value combinations. We also
conducted experiments that compared CRT with CT. But, they
focused on configuration-dependent faults where the
errorhandling depends on a certain configuration of valid parameter
combinations. Further, we discussed techniques to identify
and explain over-constrained test models [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and discussed
a technique to semi-automatically repair over-constrained test
models [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        In a case study [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], we analyzed bug reports of a software
for life insurances. 51 out of 212 analyzed bug reports describe
robustness faults. Many of them were triggered by invalid
value combinations and we concluded that it is not sufficient
for a CT tool to only consider invalid values.
      </p>
      <p>Despite the conclusion, we only consider invalid values in
this experiment because it allows a clearer separation between
valid and invalid values when extending a given test scenario.</p>
      <p>
        Other empirical studies also compared the efficiency of
CT. A recent study summarizes previous comparisons [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
However, none of these studies focused on error-handling.
      </p>
    </sec>
    <sec id="sec-4">
      <title>V. APPLYING THE T-FACTOR FAULT MODEL IN THE PRESENCE OF ERROR-HANDLING</title>
      <sec id="sec-4-1">
        <title>A. Overview</title>
        <p>
          The idea of the t-wise coverage criterion is based on
the corresponding t-factor fault model which is formally
introduced by Dalal and Mallows [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. In general, a fault
model is a description of hypothesized faults. The t-factor
fault model relies on a transformational model of the SUT
where the output is defined in terms of its input [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. The
input is modelled as a set of parameters p1; :::; pn with each
parameter pi having a domain Di consisting of a potentially
infinite number of values.
        </p>
        <p>The faults are defined in terms of the SUT’s input. It
is assumed that faults are caused by the interaction of t
parameters and a t-factor fault is triggered by a combination
of t parameter values. A t-factor fault can be described by
a condition over t parameters which must be satisfied by an
input to the SUT in order to trigger the fault. Each input that
satisfies the condition triggers the t-factor fault.</p>
        <p>
          The t-factor fault model is researched empirically [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ],
[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]–[
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] where bug reports for different types of software
are analyzed. An interaction rule is derived from the empirical
findings which states that “only a few factors are involved in
failure-inducing faults in software. Most failures are induced
by single factor faults or by the interaction of two factors;
progressively fewer failures are induced by interactions
between three, four, or more factors. The maximum degree of
interaction in actual faults so far observed is six” [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
        <p>Since the t-wise coverage criterion is defined relative to the
test model, the SUT model and test model share the same set
of input parameters. While the domain Di of a SUT input
parameter pi is potentially infinite, the domain Vi Di of a
test model parameter pi is a finite nonempty subset of values
that contains all values a tester is interested in.</p>
        <p>
          When testing with a test suite that satisfies the t-wise
coverage criterion, all parameter value combinations of size
t appear in some test input. Testing should fail for each SUT
that contains t0-factor faults with t0 t if the values of the
test model are selected properly such that the condition of the
t0-factor faults can be satisfied. Therefore, CT is also called
pseudo-exhaustive testing implying that t-wise testing is as
good as exhaustive testing for a particular class of software
with faults of factor t or smaller [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ].
        </p>
        <p>
          A test input that triggers a t-factor fault contains a
combination c that is a failure-inducing combination (FIC).
Each test input that covers c triggers a fault [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. A
FIC c is minimal (MFIC) if no proper subset c0 c triggers
a fault. The size of a MFIC is predetermined by the t-factor
fault and its condition.
        </p>
        <p>When applying the t-factor fault model to faults in the
presence of error-handling, characteristics that affect the capability
of triggering faults can be derived. The characteristics either
affect the t-factor faults or the FICs that trigger them. They
are discussed in the following two subsections.</p>
        <p>Recall the fault of the example implementation (Figure 1)
that is triggered whenever an invalid family name is evaluated.
To describe it as a t-factor fault, the error-handling must be
taken into account. It is not a 1-factor fault because not every
input that satisfies isInvFamilyName(family) triggers the
fault. Consequently, [FamilyName:123] is not a FIC.
Triggering the fault requires a valid title because the error-handling
isInvTitle(title) propagates otherwise. Therefore, the
fault can be modelled as a 2-factor fault using a
conjunction over title and family: :(isInvTitle(title)) ^
(isInvFamilyName(family)).</p>
        <p>For the given test model, the combinations [Title:Mr,
FamilyName:123] and [Title:Mrs, FamilyName:123]
are minimal failure-causing.</p>
      </sec>
      <sec id="sec-4-2">
        <title>B. Characteristics affecting Size of t-Factor Faults</title>
      </sec>
      <sec id="sec-4-3">
        <title>1) Number of Parameters involved in Error-Handling:</title>
        <p>
          In the presence of error-handling, the condition to trigger
a fault can be formulated as a conjunction of two
subconditions. First, the location of an incorrect error-handler
must be reached [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] by ensuring that no prior error-handler
terminates the SUT. We denote this as the prevention
subcondition. Second, an invalid value must cause an
errorhandler to produce an incorrect program state (infection [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ])
that can propagate to a failure. We denote this as the infection
sub-condition. For the example, :(isInvTitle(title)) is
used for prevention and (isInvFamilyName(family)) is
used for infection.
        </p>
        <p>The size of a t-factor fault increases with the number of
parameters involved in prevention and infection sub-conditions.
To guarantee that a t-factor fault is triggered, the test input set
must satisfy t-wise coverage of the same size.</p>
      </sec>
      <sec id="sec-4-4">
        <title>2) Priority of Error-Handlers: Each error-handler with an</title>
        <p>earlier position in the control-flow has the potential to
termi29
nate the SUT before the incorrect error-handler is reached.
Therefore, each prior error-handler increases the prevention
sub-conditions.</p>
        <p>A fault in the first error-handler of the example would be
modelled by an empty prevention sub-condition. In contrast, a
fault in the third error-handler would be modelled by a
prevention sub-condition that includes both prior error-handlers, e.g.
:(isInvTitle(title) _ isInvFamilyName(family)).</p>
        <p>However, the modelled fault depends on a specific
implementation whereas CT is a black-box approach which depends
on the SUT’s specification instead. While testing with 2-wise
coverage is sufficient to detect a fault in error-handling that
can be modelled as a 2-factor fault, it requires the location of
the incorrect error-handler to be known beforehand in order
to determine the appropriate testing strength of t = 2.</p>
        <p>The specification does typically not impose a specific order
of error-handling. For instance, an implementation that checks
the validity of the address first is as correct as the
implementation shown in Listing 1 where the address is checked
last. Then, the fault in the validity check of the family name
becomes a 3-factor fault.</p>
        <p>To make the determination of testing strength t independent
from the location of an incorrect validity check within the
control-flow, all error-handlers in every possible order must
be taken into account. Then, t would grow with the number
of parameters checked by error-handlers.</p>
        <p>Using a testing strength of t = 3 ensures that the fault
is triggered for all possible orderings of the error-handlers.
Thereby, the prior knowledge about the incorrect error-handler
is also abandoned. By deriving the testing strength from all
parameters that are involved in any error-detection condition,
it is ensured that each error-handler is reached and potential
faults are triggered.</p>
        <p>While this testing strength denotes the lower limit to ensure
that potential faults are triggered independently from the
ordering of error-handlers, testing is still conducted for a
specific implementation with a specific ordering. Therefore,
we distinguish the effective prevention sub-condition which
is sufficient for a specific implementation from the general
prevention sub-condition that is sufficient for all orderings. On
average, the effective prevention sub-condition considers fewer
error-handlers which improves the likelihood of triggering a
fault when using a testing strength that is not sufficient for the
general prevention sub-condition.</p>
        <p>C. Characteristics affecting the Number of Minimal
Failureinducing Combinations</p>
        <p>1) Number of Valid Values: Given a test model and a
t-factor fault f , a parameter p is involved if the
condition that describes f includes p. Otherwise, p is not
involved. For the example, parameters Title and FamilyName
are involved in the condition :(isInvTitle(title)) ^
(isInvFamilyName(family)) while parameter Address is
not involved.</p>
        <p>In the example, two MFICs of size t = 2 exist that trigger
the same fault. A test suite that satisfies 2-wise coverage
includes each MFIC at least once and guarantees that the
fault is triggered. Since two MFICs trigger the same fault,
the probability of selecting one of them is increased when
testing with (t0 &lt; 2)-wise collateral coverage. A set of test
inputs that only satisfies 1-wise coverage has a probability
of at least 23 = 66% to trigger the fault, i.e. at least one test
input covers [FamilyName:123] and there is a 32 chance that
[Title:123] is not covered by the same test input.</p>
        <p>The effect of values can be distinguished depending on
whether or not the value’s parameter is involved in the
infection or prevention sub-condition.</p>
        <p>A valid value or valid value combination that is involved
in the infection sub-condition does not affect the
failureinducing combinations because satisfying the infection
subcondition and being valid are mutually exclusive. For instance,
adding another valid family name [FamilyName:Smith] to
the example test model does not affect the FICs because it
cannot satisfy (isInvFamilyName(family)).</p>
        <p>But, valid values and valid value combinations of
parameters that are involved in the prevention sub-condition increase
the number of MFICs. For instance, adding another valid
value [Title:Sir] to the example results in another MFIC
[Title:Sir, FamilyName:123]. For 1-wise testing, the
probability to trigger the fault increases to 34 = 75%.</p>
        <p>Valid values and valid value combinations of parameters that
are not involved in the prevention and infection sub-condition
of a t-factor fault do not directly affect the MFICs. They
contribute to the set of t-sized parameter values combinations
that must be covered by some test input to satisfy t-wise
coverage, though.</p>
        <p>For our example, nine test inputs can be created that cover
[FamilyName:123], i.e. fM r; M rs; 123g fU K; U S; 123g.
Since three of them cover [Title:123] and do not
satisfy the effective prevention sub-condition, the probability of
selecting a test input that covers a MFIC are 69 = 66%.
When another valid address is added, 12 test inputs that cover
[FamilyName:123] can be created and four of them do not
satisfy the effective prevention sub-condition. The probability
of triggering the fault remains the same.</p>
        <p>It has to be noted that additional values may affect the
overall selection of test inputs. Maybe a combination strategy
cannot find a small test suite for the given values or another
reason that is inherent to the combination strategy increases
or decreases redundancy and thus affects the fault triggering
probability. These effects are beyond the scope of this paper.</p>
        <p>2) Number of Invalid Values: Invalid values of parameters
that are involved in the condition of a t-factor fault f can
increase or decrease the probability of triggering f . It depends
on whether the parameters are involved in the prevention or
infection sub-condition.</p>
        <p>When the parameter of the invalid value is involved in
the prevention sub-condition, the probability of input masking
is increased. For the example, adding another invalid value
[Title:456] decreases the probability of selecting a test
input that covers one of the two MFICs to 24 = 50%.</p>
        <p>When the parameter of the invalid value is involved
30
in the infection sub-condition, the probability of input
masking is decreased. For instance, adding another invalid
value [FamilyName:456] that triggers the same fault as
[FamilyName:123] adds two more MFICs [Title:Mr,
FamilyName:456] and [Title:Mrs, FamilyName:456]
to the example.</p>
        <p>Invalid values of parameters that are not involved in
tfactor fault f only exist for effective prevention sub-conditions
because general prevention sub-conditions consider all
errorhandlers. As an example, consider an additional invalid address
[Address:456]. From the perspective of general prevention
sub-conditions, they can be treated as above invalid values that
increase the probability of input masking. From the perspective
of effective prevention sub-conditions, the invalid values do
not affect the MFICs because the evaluation would happen
after the evaluation of the incorrect error-handler.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>VI. EXPERIMENT DESIGN</title>
      <sec id="sec-5-1">
        <title>A. Test Scenarios</title>
        <p>The objective of our experiment is to evaluate the
effectiveness of CT when generating test inputs in the presence
of error-handling. Especially, we want to determine in which
cases CT is sufficient enough such that CRT and its additional
effort can be avoided. Therefore, we generated test inputs with
different characteristics and executed them in test scenarios.
The source code and the results of the experiment are available
at our companion website1.</p>
        <p>For an experiment, it is important to control the factors that
can influence the results. Therefore, we designed artificial test
scenarios and changed different characteristics in a controlled
and traceable way. Each test scenario contains exactly one
faulty error-handler which always propagates to a failure when
triggered and the failure is always revealed by the test oracle.
Thus, the selection of test inputs is the only factor that
influences the results of test execution.</p>
        <p>The implementation of a test scenario is illustrated in
Listing 2. A test scenario accepts input values for a number of
parameters and includes a sequence of error-handlers with one
error-handler for one parameter implemented by an if
statements. The result of a test scenario execution is INV_INPUT
if an error-handler correctly identifies an invalid value and
terminates the execution. If an invalid value is identified by
a faulty implemented error-handler, NULL is returned instead;
VAL_INPUT is returned if all error-handlers are passed because
all values are valid.</p>
        <p>String checkInput (Object a, b, c){
if(isInvalid(a)) return INV_INPUT;
else if(isInvalid(b)) return NULL;
else if(isInvalid(c)) return INV_INPUT;
else return VAL_INPUT;
}</p>
        <p>Listing 2. Illustration of a Test Scenario Implementation
1https://github.com/coffee4j/quasoq-2019</p>
        <p>Based on the application of the t-factor fault model, we
describe each test scenario S in terms of (1) the number of
parameters, (2) the number of parameters that are involved
in error-handling, (3) the number of valid values, (4) the
number of invalid values per parameter, and (5) the index of
the position of the incorrect error-handler that contains a fault.</p>
        <p>The illustrated test scenario uses 3 parameters where the
second error-handler(index i = 1) is incorrect. The number of
valid and invalid values per parameter is implicitly encoded
by the test model that is used to generate test inputs.</p>
        <p>By varying the index of the incorrect error-handler, three
different test scenarios for our example can be created, where
either the first, second or third error-handler is incorrect. A set
of test scenarios, that shares the same parameters and values
but differs in the index of the incorrect error-handler is called
a test scenario family S .</p>
        <p>As a notation, we use P-V-I-E where P refers to the
number of parameters involved in error-handling, V refers to
the number of valid values per parameter, I refers to the
number of invalid values per parameter, and E refers to the
number of parameters that are involved in error-handling.</p>
        <p>The experiment starts with a root test scenario family
6-1-1-6 representing a simple application of CT. The root
test scenario family is then extended as follows. The number
of parameters is increased by six up to 30 parameters. The
number of error-handlers in a test scenario family either
remains at six or is equal to the number of parameters. The
total number of number of values per test scenario family is
extended up to six with one to five valid and invalid values.</p>
      </sec>
      <sec id="sec-5-2">
        <title>B. Test Input Generation</title>
        <p>The test models used in this experiment share the same set
of parameters with the test scenario and define the number of
31
AFDE(T ; S ) =</p>
        <p>
          PS2S FDE(T ; S)
jS j
2https://math.nist.gov/coveringarrays/
32
valid and invalid values per parameter. To avoid any bias, we
use test suites from the NIST Covering Array Tables2. They
are publicly available and contain many of the smallest known
test suites [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. Table I depicts the sizes of the test suites.
Column P refers to the number of parameters and column
V refers to the total number of values per parameter. The test
suites are reused for different ratios of valid and invalid values.
        </p>
        <p>The order of parameters and values in a test model has
no impact on whether or not a generated test suite satisfies
t-wise coverage. However, it has an impact on which t-wise
parameter value combinations are combined in a single test
input. To reduce the effect of accidental fault triggering that is
caused by ordering, the parameters and values of a test suite
are randomly reordered and 100 different variants of each test
suite are generated. The set of all test suite variants is called
a test suite family T .</p>
        <p>The testing strengths used in the experiment range from
t = 1 to t = 5 because most failures are induced by this range,
according to the interaction rule.</p>
      </sec>
      <sec id="sec-5-3">
        <title>C. Evaluation Metrics</title>
        <p>
          A common metric to evaluate combination strategies is
called Fault Detection Effectiveness (FDE) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>A test suite T is denoted as failing for a test scenario S if
at least one of the test inputs 2 T triggers the t-factor fault
f 2 S and the test suite consequently fails.</p>
        <p>failing(T ; S) =
1 if 9 2 T that fails for S
0 otherwise</p>
        <p>Using the failing function, FDE is defined as the ratio
between the number of test suites T T of a family that
fail for a test scenario S and the number of all test suites in a
family T that is used to test S.</p>
        <p>FDE(T ; S) =</p>
        <p>PT 2T failing(T ; S)
jT j</p>
        <p>In other words, the FDE is based on randomized variants
of a test suite that all satisfy the same testing strength. They
all are used to test the same test scenario S which has a
fixed incorrect error-handler. While this metric can be used
to identify characteristics that may influence the FDE, the
information cannot be used in practice because one must know
which error-handler is incorrect.</p>
        <p>Therefore, we introduce the average fault detection
effectiveness (AFDE) which is the average FDE over a family of
test scenarios S . Thus, AFDE represents the effectiveness of
a test scenario family when knowing that one error-handler is
incorrect but without knowing its index.
(1)
(2)
(3)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VII. RESULTS &amp; DISCUSSION</title>
      <sec id="sec-6-1">
        <title>A. Overview</title>
        <p>The overall results of the experiment are consistent with
the application of the t-factor fault model. Whenever the
first error-handler is incorrect, the prevention sub-condition
is empty and one parameter is involved in the infection
subcondition. The resulting 1-factor fault is triggered in each test
scenario by all test suites that satisfy the testing strength t = 1.</p>
        <p>Whenever an error-handler at a higher index is incorrect,
the prevention sub-condition includes the parameters checked
by all error-handlers with lower indices. Since one parameter
is involved in the infection sub-condition, the faults can be
described as (index + 1)-factor faults, where the index of the
first error-handler is 0. For all considered testing strengths
(1 t 5), all (index + 1)-factor faults are triggered by all test
suites that satisfy the corresponding testing strength.</p>
        <p>Beyond that, collateral coverage causes higher (index +
1)factor faults to be repeatedly triggered by test suites that satisfy
lower testing strengths.</p>
      </sec>
      <sec id="sec-6-2">
        <title>B. Fault Detection Effectiveness</title>
        <p>Table II depicts an excerpt of FDEs computed for test
scenario families with six to 30 parameters consisting of two
valid values and one invalid value. Each test scenario family
contains six error-handlers. The I column denotes the index
of the incorrect error-handler, t denotes the testing strength
that is satisfied by the family of test suites and the remaining
columns denote the computed FDE values.</p>
        <p>According to the t-factor fault model, a testing strength of
t = 2 is required to guarantee that an incorrect error-handler
with index = 1 is detected. Among all depicted test scenario
families, the fault is on average also triggered by 67.6% test
suite families that only satisfy t = 1. The exact numbers are
depicted in the second row of Table II.</p>
        <p>Testing with higher testing strengths is even more effective.
On average, test suite families that satisfy a testing strength
of t = 2 detects incorrect error-handlers at index = 2 in 99.2%
of all cases. As expected, a family of test suites that satisfies
t = 3 always detects the incorrect error-handlers at index = 2.
However, incorrect error-handlers at indices 3 and 4 are always
detected as well. The incorrect error-handlers at index 5 are
detected by 98.6% of all test suite families.</p>
        <p>When increasing the number of parameters while the
number of error-handlers remains at six, the data indicates no
general trend for the FDE. In many cases, the FDE improves
slightly. Although, there are cases where the FDE deteriorates.
For instance, the FDE for detecting an incorrect error-handler
at index 1 with a test suite family that satisfies only testing
strength t = 1 improves from 64% for the test scenario family
with six parameters to 74% for the test scenario family with
12 parameters. But, it deteriorates to 65% for the test scenario
family with 18 parameters.</p>
        <p>Table III depicts the FDE for test scenario families with
six to 30 parameters and an equal number of error-handlers.
Increasing the number of error-handlers such that one
errorhandler exists for each parameter has no direct impact on the
FDE. Over all indices, testing strengths and parameter sizes,
the difference between the FDE for six error-handlers and
the FDE for jP j error-handlers is 1.25 percentage points on
average. For instance, the difference for I = 1, t = 1 and 12
parameters is 11 percentage points with an FDE of 74% for
6 error-handlers (Table II) and an FDE of 63% for 12
errorhandlers (Table III).</p>
        <p>The biggest deviations are noticed for test scenario families
with 30 parameters. For I = 1 and t = 1, the difference
between the FDE for six error-handlers (57%) and the FDE
for 30 error-handlers (42%) is 15 percentage points. For I = 5
and t = 2, the difference between the FDE for six
errorhandlers (40%) and the FDE for 30 error-handlers (60%) is
-20 percentage points.</p>
        <p>Analysing higher indices of incorrect error-handlers (I &gt; 5)
emphasizes that the required testing strength to detect a fault
increases as well. Although, the testing strength grows slower
compared to lower indices (I 5). For instance, t = 3 is
sufficient to detect incorrect error-handlers at index I = 4 in
100% of all cases and it is almost sufficient (98.6% on average)
to detect all incorrect error-handlers at index I = 5. For I = 7,
the average FDE decreases to 85.5%. For I = 17, only 2% of
all incorrect error-handlers are detected. In comparison, t = 4
is sufficient to detect an incorrect error-handler at index I = 7
in 100% of all cases. t = 5 is sufficient to detect a fault for
index I = 10 in 100% of all cases and even a fault for index
I = 11 is detected in 99.75% of all cases. Afterwards, the FDE
for t = 5 decreases to an average of 39.3% for index I = 17
and to 4% for I = 23.</p>
        <p>Overall, the findings are consistent with the application of
the t-factor fault model. Changing the number of parameters of
a test scenario family has no clear effect on the FDE as well
as changing the total number of error-handlers. In contrast,
increasing or decreasing the specific index of the incorrect
error-handler has the biggest effect on the FDE.</p>
        <p>Following the t-factor fault model, a testing strength t
guarantees to detect incorrect error-handlers up to index I =
t 1 because one parameter is involved in the infection
subcondition. Beyond that, the collateral coverage effect of CT
causes a reliable detection of incorrect error-handlers with
higher indices. Although for index I = 11 and higher, even
t = 5 is not sufficient to detect incorrect error-handlers for the
test scenario families depicted in Table III.</p>
        <p>To use the knowledge acquired from FDE metric,
knowledge about the index of an incorrect error-handler is required
which makes it hard to use the information in practice. In
contrast, the AFDE values allow to draw conclusions regarding
the effectiveness of CT assuming that one error-handler is
incorrect when only the number of checked parameters as well
as the number of valid and invalid values is known. Therefore,
all further analyzes are based on the AFDE metric.</p>
      </sec>
      <sec id="sec-6-3">
        <title>C. Average Fault Detection Effectiveness</title>
        <p>Table IV and Table V list the AFDE values for all test suite
families and the testing strengths from t = 1 to t = 5. Column
P denotes the number of parameters, column E denotes the
number of error-handlers, and column t denotes the testing
strength. Avg. depicts the average AFDE among all testing
strengths. The remaining columns follow the pattern V-I with
V describing the number of valid values and I describing the
number of invalid values. Table IV contains the AFDE values
for test scenario families with six error-handlers. In Table V,
the number of error-handlers equals the number of parameters.</p>
        <p>As discussed in the prior subsection, increasing the number
of parameters only has no clear effect on the FDE. The AFDE
metric reflects this as depicted by Table IV.</p>
        <p>Increasing the total number of error-handlers had no impact
on the FDE of existing error-handling indices. But, the FDE
for the additional indices is worse because more parameters
belong to the prevention sub-condition. Since AFDE
represents the average FDE among all indices of incorrect
errorhandlers, the worse FDE of additional error-handlers decreases
the AFDE value. This is depicted in Table V.</p>
        <p>Changing the number of values per parameter has a great
effect on the AFDE. Depending on whether the additional
values are valid or invalid, the AFDE increases or decreases.
33
The AFDE always improves when adding only valid values.
For the largest test scenario P = 30 and E = 30 with four and
five valid values, the testing strengths t = 4 and t = 5 are
sufficient to detect an incorrect error-handler almost always.
Although, testing with t = 5 can be perceived as impractical
as it would require the execution of 23369 (4-1) and 58468
(5-1) test cases (See Table I).</p>
        <p>This result is also consistent with the characteristics of
valid values when applying the t-factor fault model. The FDE
is improved because additional MFICs are created by the
additional valid values.</p>
        <p>Besides the two favorable cases with (4-1) and (5-1), t = 5
is not sufficient to detect all incorrect error-handlers reliably.
For 2 valid values (2-1) and E = 18, the AFDE is 90.17%.
A higher testing strength would be necessary which further
increases the time for test input generation and the time for
test execution due to the larger the test suite size.</p>
        <p>There is also a trend indicating that the AFDE decreases
when adding only invalid values. For instance, the average
AFDE for P = 30 and E = 6 decreases from 76.1% for one
valid and one invalid value (1-1) to 70.9% for five invalid
values (1-5). However, this trend is not as clear as for valid
values. One exception exists for P = 6 and E = 6 where
the average AFDE improves by 0.57 (1-3) and 0.1 (1-5)
percentage points.</p>
        <p>But, the trend is also consistent with the characteristics of
invalid values when applying the t-factor fault model. The
parameter of one invalid value belongs to the infection
subcondition and improves the probability of triggering the fault.
But, all other invalid values either deteriorate the probability
because their respective parameter belongs to the effective
prevention sub-condition or the invalid value has no effect.</p>
        <p>When adding valid and invalid values equally (2-2 and
2-2), the AFDE always increases in comparison to 1-1. The
AFDE is also always higher when comparing it to test scenario
families with more invalid than valid values. But, the AFDE
is always lower compared to test scenario families with more
valid than invalid values.</p>
        <p>This finding is still consistent with the t-factor fault model.
Although, it cannot be directly derived from it. The numbers
indicate that the additional MFICs introduced by additional
valid values have a stronger effect on the AFDE than the
prevention sub-conditions increased by additional invalid values.</p>
        <p>To summarize the findings, CT triggers all t-factor faults
as guaranteed by the t-wise coverage criterion. Furthermore,
even more faults are triggered via collateral coverage. Test
suites that satisfy higher testing strengths t 3 trigger many
incorrect error-handlers with higher indices.</p>
        <p>Adding valid values to the test suite increases the AFDE.
But, the required test suite size increases as well. Conversely,
adding invalid values decreases the AFDE. Additional
parameters with error-handling have no impact on FDE for existing
error-handlers. But, the faults in the additional error-handlers
are harder to detect. Therefore, AFDE deteriorates with an
increasing number of parameters involved in error-handling.</p>
        <p>Overall, the experiment shows that CT can be an effective
approach to detect incorrect error-handlers. Although, not all
incorrect error-handlers can be detected reliably. Depending
on the number of error-handlers and the distribution of valid
and invalid values, a high testing strength is necessary which
requires the execution of large test suites.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>VIII. THREATS TO VALIDITY</title>
      <p>We compared the effectiveness of CT in the presence of
error-handling and invalid values. Therefore, test inputs are
generated and executed on test scenarios. Publicly available
test suites are used to avoid bias in the test input generation.</p>
      <p>The used test scenarios are artificial and do not necessarily
represent realistic scenarios. In addition, it is possible that
we unconsciously designed the test scenarios in a way that
their pre-established characteristics are supported. However,
the considered characteristics are explicit and all information
is available online1 so that it is comprehensible and repeatable.
In addition, the AFDE metric is designed to derive knowledge
and to apply it to real-world scenarios.</p>
      <p>To prevent falsified results due to accidental fault triggering,
the parameters and values of each test suite are randomized
and 100 variants of each test suite are combined to a test suite
family. All presented numbers are average numbers.</p>
    </sec>
    <sec id="sec-8">
      <title>IX. CONCLUSION</title>
      <p>CT is a generally effective approach to black-box test input
generation. When considering invalid values to also test for
robustness, the input masking effect can prevent faults from
being triggered. CRT is an extension to CT for robustness
testing that avoids the input masking effect. But, CRT imposes
extra costs since it requires additional semantic information
in the test model. The implications of input masking on
the effectiveness of CT are unclear beyond the general idea.
Therefore, it is unclear when CT can be used and when CRT
should be used despite the extra costs.</p>
      <p>In this paper, we designed and conducted a controlled
experiment to measure the effectiveness of CT in different
test scenarios. Therefore, we applied the t-factor fault model
and discussed characteristics that are specific to error-handling
and invalid values. Based on these characteristics, artificial
test scenarios are designed and tested using publicly available
small test suites.</p>
      <p>The results of the experiment show that CT triggers all
tfactor faults as guaranteed by the t-wise coverage criterion.
Even more faults are triggered via collateral coverage. Test
suites that satisfy higher testing strengths t 3 trigger many
incorrect error-handlers with higher indices.</p>
      <p>Valid values increase FDE and AFDE and invalid values
decrease them. Additional parameters with error-handling have
the highest impact which deteriorates AFDE the most. While
t = 4 is sufficient in favourable cases with many valid values
and few parameters involved in error-handling, not even t = 5
is sufficient in unfavourable cases with many invalid values
and many parameters involved in error-handling.</p>
      <p>Overall, the experiment shows that CT can be effective
in the presence of error-handling and invalid values. But in
35
many cases, an advantageous distribution of valid and invalid
values, a high testing strength and thus very large test suites
are required.</p>
      <p>Despite the good performance of CT in many cases, CRT
is a promising approach that requires potentially fewer test
executions. Although, a direct comparison is necessary to
decide if the execution of very large test suites required by
CT outweighs the additional modelling effort.</p>
      <p>In future work, we will extend the experiment to consider
invalid value combinations and configuration-dependent faults
as well. As they potentially increase the number of
parameters involved in the prevention and infection sub-conditions.
Further, we will compare not only the effectiveness but also
the efficiency of CT with CRT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] IEEE, “
          <article-title>IEEE Standard Glossary of Software Engineering Terminology</article-title>
          ,” IEEE Std, vol.
          <volume>610</volume>
          .
          <fpage>12</fpage>
          -
          <lpage>1990</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Avizienis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Laprie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Randell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Landwehr</surname>
          </string-name>
          , “
          <article-title>Basic concepts and taxonomy of dependable and secure computing,”</article-title>
          <source>IEEE Trans. Dependable Sec. Comput.</source>
          , vol.
          <volume>1</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Young</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pezze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Software</given-names>
            <surname>Testing</surname>
          </string-name>
          and
          <article-title>Analysis: Process, Principles and Techniques</article-title>
          . USA: John Wiley &amp; Sons, Inc.,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , G. Cheng, B.
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Xuan</surname>
          </string-name>
          , “
          <article-title>Eh-recommender: Recommending exception handling strategies based on program context,” in 23rd International Conference on Engineering of Complex Computer Systems</article-title>
          , ICECCS 2018, Melbourne, Australia,
          <source>December 12-14</source>
          ,
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>104</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fo</surname>
          </string-name>
          <article-title>¨gen and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Lichter</surname>
          </string-name>
          , “
          <article-title>Combinatorial robustness testing with negative test cases</article-title>
          ,” in
          <source>2019 IEEE International Conference on Software Quality, Reliability and Security</source>
          ,
          <string-name>
            <surname>QRS</surname>
          </string-name>
          <year>2019</year>
          , Sofia, Bulgaria,
          <source>July 22- 26</source>
          ,
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sawadpong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Allen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          , “
          <article-title>Exception handling defects: An empirical study</article-title>
          ,
          <source>” in 14th International IEEE Symposium on High-Assurance Systems Engineering, HASE</source>
          <year>2012</year>
          ,
          <article-title>Omaha</article-title>
          ,
          <string-name>
            <surname>NE</surname>
          </string-name>
          , USA, October
          <volume>25</volume>
          -
          <issue>27</issue>
          ,
          <year>2012</year>
          ,
          <year>2012</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Offutt</surname>
          </string-name>
          , “
          <article-title>Test oracle strategies for model-based testing</article-title>
          ,
          <source>” IEEE Trans. Software Eng.</source>
          , vol.
          <volume>43</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>395</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grindal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Offutt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Andler</surname>
          </string-name>
          , “
          <article-title>Combination testing strategies: a survey,”</article-title>
          <string-name>
            <given-names>Softw. Test.</given-names>
            ,
            <surname>Verif</surname>
          </string-name>
          . Reliab., vol.
          <volume>15</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>167</fpage>
          -
          <lpage>199</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Fredman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Patton</surname>
          </string-name>
          , “
          <article-title>The AETG system: An approach to testing based on combinatiorial design,”</article-title>
          <source>IEEE Trans. Software Eng.</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>437</fpage>
          -
          <lpage>444</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Czerwonka</surname>
          </string-name>
          , “
          <article-title>Pairwise testing in real world: Practical extensions to test case generators,”</article-title>
          <source>in 24th Pacific Northwest Software Quality Conference</source>
          , vol.
          <volume>200</volume>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grindal</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Lindstro¨m, A. Offutta, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Andler</surname>
          </string-name>
          , “
          <article-title>An evaluation of combination strategies for test case selection</article-title>
          ,” Department of Computer Science, University of Sko¨vde,
          <source>Tech. Rep. HS-IDA-TR-03-001</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Tai</surname>
          </string-name>
          , “
          <article-title>In-parameter-order: A test generation strategy for pairwise testing</article-title>
          ,
          <source>” in 3rd IEEE International Symposium on HighAssurance Systems Engineering (HASE '98)</source>
          ,
          <fpage>13</fpage>
          -
          <lpage>14</lpage>
          November 1998, Washington, D.C, USA, Proceedings,
          <year>1998</year>
          , pp.
          <fpage>254</fpage>
          -
          <lpage>261</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , “
          <article-title>Tuple density: a new metric for combinatorial test suites</article-title>
          ,”
          <source>in Proceedings of the 33rd International Conference on Software Engineering, ICSE</source>
          <year>2011</year>
          , Waikiki, Honolulu ,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA, May
          <volume>21</volume>
          -28,
          <year>2011</year>
          ,
          <year>2011</year>
          , pp.
          <fpage>876</fpage>
          -
          <lpage>879</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Petke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoo</surname>
          </string-name>
          , “
          <article-title>Practical combinatorial interaction testing: Empirical findings on efficiency and early fault detection</article-title>
          ,
          <source>” IEEE Trans. Software Eng.</source>
          , vol.
          <volume>41</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>901</fpage>
          -
          <lpage>924</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wojciak and R.</surname>
          </string-name>
          Tzoref-Brill, “
          <article-title>System level combinatorial testing in practice - the concurrent maintenance case study,”</article-title>
          <source>in Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014, March</source>
          <volume>31</volume>
          2014-April 4,
          <year>2014</year>
          , Cleveland, Ohio, USA,
          <year>2014</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kacker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          , “
          <article-title>ACTS: A combinatorial test generation tool</article-title>
          ,” in
          <source>Sixth IEEE International Conference on Software Testing, Verification and Validation</source>
          ,
          <string-name>
            <surname>ICST</surname>
          </string-name>
          <year>2013</year>
          , Luxembourg, Luxembourg, March
          <volume>18</volume>
          -22,
          <year>2013</year>
          ,
          <year>2013</year>
          , pp.
          <fpage>370</fpage>
          -
          <lpage>375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fo</surname>
          </string-name>
          <article-title>¨gen and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Lichter</surname>
          </string-name>
          , “
          <article-title>Combinatorial testing with constraints for negative test cases</article-title>
          ,” in
          <source>2018 IEEE International Conference on Software Testing, Verification and Validation Workshops</source>
          , ICST Workshops, Va¨stera˚s, Sweden, April 9-
          <issue>13</issue>
          ,
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18] --, “
          <article-title>Repairing over-constrained models for combinatorial robustness testing</article-title>
          ,” in
          <source>2019 IEEE International Conference on Software Quality, Reliability and Security Companion, QRS Companion</source>
          <year>2019</year>
          , Sofia, Bulgaria,
          <source>July 22-26</source>
          ,
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19] --, “
          <article-title>Semi-automatic repair of over-constrained models for combinatorial robustness testing,” in 26th Asia-Pacific Software Engineering Conference (APSEC), Putrajaya</article-title>
          , Malaysia, Dec 2-
          <issue>5</issue>
          ,
          <year>2019</year>
          (to be published),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20] --, “
          <article-title>A case study on robustness fault characteristics for combinatorial testing - results and challenges</article-title>
          ,”
          <source>in Proceedings of the 6th International Workshop on Quantitative Approaches to Software Quality co-located with 25th Asia-Pacific Software Engineering Conference (APSEC</source>
          <year>2018</year>
          ), Nara, Japan, December 4,
          <year>2018</year>
          . CEUR-WS.org,
          <year>2018</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Changhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Petke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          , “
          <article-title>An empirical comparison of combinatorial testing, random testing and adaptive random testing</article-title>
          ,
          <source>” IEEE Transactions on Software Engineering</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Mallows</surname>
          </string-name>
          , “
          <article-title>Factor-covering designs for testing software</article-title>
          ,
          <source>” Technometrics</source>
          , vol.
          <volume>40</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Harel</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Pnueli</surname>
          </string-name>
          , “
          <article-title>On the development of reactive systems,” in Logics and Models of Concurrent Systems,</article-title>
          K. R. Apt, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg,
          <year>1985</year>
          , pp.
          <fpage>477</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>WALLACE</surname>
          </string-name>
          and
          <string-name>
            <surname>D. R. KUHN</surname>
          </string-name>
          , “
          <article-title>Failure modes in medical device software: An analysis of 15 years of recall data</article-title>
          ,”
          <source>International Journal of Reliability, Quality and Safety Engineering</source>
          , vol.
          <volume>08</volume>
          , no.
          <issue>04</issue>
          , pp.
          <fpage>351</fpage>
          -
          <lpage>371</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Reilly</surname>
          </string-name>
          , “
          <article-title>An investigation of the applicability of design of experiments to software testing,” in 27th Annual NASA Goddard/</article-title>
          IEEE Software Engineering Workshop,
          <year>2002</year>
          . Proceedings.,
          <source>Dec</source>
          <year>2002</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Gallo</surname>
          </string-name>
          , “
          <article-title>Software fault interactions and implications for software testing</article-title>
          ,
          <source>” IEEE Trans. Software Eng.</source>
          , vol.
          <volume>30</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>418</fpage>
          -
          <lpage>421</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>K. Z.</given-names>
            <surname>Bell</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Vouk</surname>
          </string-name>
          , “
          <article-title>On effectiveness of pairwise methodology for testing network-centric software</article-title>
          ,
          <source>” in 2005 International Conference on Information and Communication Technology, Dec</source>
          <year>2005</year>
          , pp.
          <fpage>221</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Okun</surname>
          </string-name>
          , “
          <article-title>Pseudo-exhaustive testing for software,” in 30th Annual IEEE</article-title>
          / NASA Software Engineering Workshop (SEW-30
          <year>2006</year>
          ),
          <fpage>25</fpage>
          -
          <lpage>28</lpage>
          April 2006, Loyola College Graduate Center, Columbia,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA. IEEE Computer Society,
          <year>2006</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Kacker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          , “
          <article-title>Estimating t-way fault profile evolution during testing,” in 2016 IEEE 40th Annual Computer Software</article-title>
          and Applications Conference (COMPSAC),
          <year>June 2016</year>
          , pp.
          <fpage>596</fpage>
          -
          <lpage>597</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Kacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          , “
          <article-title>Combinatorial testing for software: An adaptation of design of experiments,” Measurement</article-title>
          , vol.
          <volume>46</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>3745</fpage>
          -
          <lpage>3752</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>P.</given-names>
            <surname>Arcaini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gargantini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Radavelli</surname>
          </string-name>
          , “
          <article-title>Efficient and guaranteed detection of t-way failure-inducing combinations</article-title>
          ,” in
          <source>2019 IEEE International Conference on Software Testing, Verification and Validation Workshops, ICST Workshops</source>
          <year>2019</year>
          ,
          <article-title>Xi'an, China</article-title>
          ,
          <source>April 22-23</source>
          ,
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Forbes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Kacker</surname>
          </string-name>
          , and R. D. Kuhn, “
          <article-title>Refining the in-parameter-order strategy for constructing covering arrays</article-title>
          ,
          <source>” Journal of Research of the National Institute of Standards and Technology</source>
          , vol.
          <volume>113</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>09 2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>