=Paper= {{Paper |id=Vol-2273/QuASoQ-03 |storemode=property |title=A Case Study on Robustness Fault Characteristics for Combinatorial Testing - Results and Challenges |pdfUrl=https://ceur-ws.org/Vol-2273/QuASoQ-03.pdf |volume=Vol-2273 |authors=Konrad Fögen,Horst Lichter |dblpUrl=https://dblp.org/rec/conf/apsec/FogenL18 }} ==A Case Study on Robustness Fault Characteristics for Combinatorial Testing - Results and Challenges== https://ceur-ws.org/Vol-2273/QuASoQ-03.pdf

6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

A Case Study on Robustness Fault Characteristics
for Combinatorial Testing - Results and Challenges
Konrad Fögen Horst Lichter
Research Group Software Construction Research Group Software Construction
RWTH Aachen University RWTH Aachen University
Aachen, NRW, Germany Aachen, NRW, Germany
foegen@swc.rwth-aachen.de lichter@swc.rwth-aachen.de

Abstract—Combinatorial testing is a well-known black-box an interaction of d or fewer parameter values, then testing all
testing approach. Empirical studies suggest the effectiveness of d-wise parameter value combinations should be as effective
combinatorial coverage criteria. So far, the research focuses as exhaustive testing [4]. No analyzed failure required an
on positive test scenarios. But, robustness is an important
characteristic of software systems and testing negative scenarios interaction of more than six parameter values to be triggered
is crucial. Combinatorial strategies are extended to generate [7], [8]. The results indicate that 2-wise (pairwise) testing
invalid test inputs but the effectiveness of negative test scenarios should trigger most failures and 4 to 6-wise testing should
is yet unclear. Therefore, we conduct a case study and analyze 434 trigger all failures of a SUT.
failures reported as bugs of an financial enterprise application. However, so far research focuses on positive test scenarios,
As a result, 51 robustness failures are identified including failures
triggered by invalid value combinations and failures triggered by i.e. test inputs with valid values to test the implemented
interactions of valid and invalid values. Based on the findings, operations based on their specification. Since robustness is
four challenges for combinatorial robustness testing are derived. an important characteristic of software systems [9], testing of
negative test scenarios is crucial. Invalid test inputs contain
Keywords-Software Testing, Combinatorial Testing, Robustness invalid values, e.g. a string value when a numerical value is
Testing, Test Design expected, or invalid combinations of otherwise valid values,
e.g. a begin date which is after the end date.
I. I NTRODUCTION They are used to check proper error-handling to avoid ab-
Combinatorial testing (CT) is a black-box approach to reveal normal behavior and system crashes. Error-handling is usually
conformance faults between the system under test (SUT) and separated from normal program execution. It is triggered by
its specification. An input parameter model (IPM) with input an invalid value or an invalid value combination and all other
parameters and interesting values is derived from the specifi- values of the test input remain untested. Therefore, a strict
cation. Test inputs are generated where each input parameter separation of valid and invalid test inputs is suggested and
has a value assigned. The generation is usually automated and combination strategies are extended to support generation of
a combination strategy defines how values are selected [1]. invalid test inputs [1], [10]–[13].
CT can help detecting interaction failures, e.g. failures But, the effectiveness of negative test scenarios is unclear
triggered by the interaction of two or more specific values. as this is not yet empirically researched. To the best of our
For instance, a bug report analyzed by Wallace and Kuhn [2] knowledge, it is only Pan et al. [14], [15] who characterize
describes that ”the ventilator could fail when the altitude ad- data of faults from robustness testing. Their results obtained
justment feature was set on 0 meters and the total flow volume from testing robustness of operating system APIs indicate that
was set at a delivery rate of less than 2.2 liter per minute“. most robustness failures are caused by single invalid values.
The failure is triggered by the interaction of altitude=0 and Though, there is no more information on failures caused by
delivery-rate<2.2. This is called a failure-triggering fault invalid value combinations. Because only one type of software
interaction (FTFI) and its dimension is d = 2 because the is analyzed, more empirical studies are required to confirm
interaction of two input parameter values is required. (or reject) the distribution and upper limit of FTFIs for other
Testing each value only once is not sufficient to detect software types (Kuhn and Wallace [4]).
interaction faults and exhaustively testing all interactions To gather more information on failures triggered by invalid
among all input parameters is almost never feasible in practice. value combinations, we conducted a case study to analyze bug
Therefore, other combinatorial coverage criteria like t-wise reports of a newly developed distributed enterprise application
where 1 ≤ t < n denotes the testing strength are proposed [1]. for financial services. In total, 683 bug reports are examined
The effectiveness of combinatorial coverage criteria is also and 434 of them describe failures which are further analyzed.
researched in empirical studies [2]–[7]. Collected bug reports The paper is structured as follows. Section II and III
are analyzed and FTFI dimensions are determined for different summarize foundations and related work. In Section IV, the
types of software [4]. If all failures of a SUT are triggered by design of the case study is explained. The results are discussed

Copyright © 2018 for this paper by its authors. 22
6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

p1 : P aymentT ype V1 = {CreditCard, Bill} Table I: Pairwise Test Suite
p2 : DeliveryT ype V2 = {Standard, Express} PaymentType DeliveryType TotalAmount
p3 : T otalAmount V3 = {1, 500} Bill Express 500
Bill Standard 1
Listing 1: Exemplary IPM for a Checkout Service CreditCard Standard 500
CreditCard Express 1

in Section V and challenges for combinatorial testing are specification. The IPM is represented as a set of n input
discussed in Section VII. Afterwards, potential threads to parameters IPM = {p1 , ..., pn } and each input parameter pi
validity are discussed and we conclude with a summary of is represented as a non-empty set of values Vi = {v1 , ..., vmi }.
our work.
Test inputs are composed from the IPM such that every test
II. BACKGROUND input contains a value for each input parameter. Formally, a
A. Robustness Testing test input is a set of parameter-value pairs for all n distinct pa-
rameters and a parameter-value pair (pi , vj ) denotes a selection
Testing is the activity of stimulating a system under test of value vj ∈ Vi for parameter pi .
(SUT) and observing its response [16]. System testing (also
Listing 1 depicts an exemplary IPM to test the check-
called functional testing) is concerned with the behavior of the
out service of an e-commerce system with three input pa-
entire system and usually corresponds to business processes,
rameters and two values for each input parameter. One
use cases or user stories [17]. Both, the stimulus and response
possible test input for this IPM is [PaymentType:Bill,
consist of values. They are called test input and test output, re-
DeliveryType:Standard, TotalAmount:1]. Formally, a
spectively. In this context, input comprises anything explicable
test input τ = {(pi1 , vj1 ), ..., (pin , vjn )} is denoted as a set of
that is used to change the observable behaviour of the SUT.
pairs. In this paper, we use the aforementioned notation with
Output comprises anything explicable that can be observed
brackets which is equal to τ = {(p1 , v2 ), (p2 , v1 ), (p3 , v1 )}.
after test execution.
The composition of test inputs is usually automated and a
A test case covers a certain scenario to check whether the
combination strategy defines how values are selected [1]. Since
SUT satisfies a particular requirement [18]. It consists of a
testing each value only once is not sufficient to detect inter-
test input and a test oracle [19]. The test input is necessary
action faults and exhaustively testing all interactions among
to induce the desired behavior. The test oracle provides the
all input parameters is almost never feasible in practice, other
expected results which can be observed after test execution if
coverage criteria like t-wise are proposed.
and only if the SUT behaves as intended by its specification.
Finally, the expected result to the actual result are compared For illustration, Table I depicts a test suite for the
to determine whether the test passes or fails. e-commerce example that satisfies the pairwise coverage
Since robustness is an important software quality [9], testing criterion. For more information on the different cover-
should not only cover positive but also negative scenarios to age criteria, please refer to Grindal et al. [1]. To satisfy
evaluate a SUT. Robustness is defined as “the degree to which the coverage criterion, all pairwise value combinations of
a system or component can function correctly in the presence P aymentT ype×DeliveryT ype, P aymentT ype×T otalAmount
of invalid inputs or stressful environmental conditions” [18]. and DeliveryT ype × T otalAmount must be included in at
Positive scenarios focus on valid intended operations of the least one test input. If the first test input was not exe-
SUT using valid test inputs that are within the specified cuted, pairwise coverage would not be satisfied because the
boundaries. Negative scenarios focus on the error-handling combinations [PaymentType:Bill, DeliveryType:Expr
using invalid test inputs that are outside of the specified ess],[PaymentType:Bill, Total Amount:1], [Delive
boundaries. For instance, input that is malformed, e.g. a string ryType: Express, Total Amount:1] would be untested.
input when numerical input is expected, or input that violates In comparison to exhaustive testing, fewer test inputs are
business rules, e.g. a begin date which is after the end date. required to satisfy the other coverage criteria. But as the
example illustrates, problems with only one test input might
B. Combinatorial Testing lead to combinations being not covered and failures that are
Combinatorial testing (CT) is a black-box approach to reveal triggered by these combinations remain undetected.
interaction failures, i.e. failures triggered by the interaction of If we suppose that the checkout service requires a total
two or more specific values, because the SUT is tested with amount of at least 25 dollar, then two test inputs of the example
varying test inputs. A generic test script describes a sequence (Table I) with [TotalAmount:1] are expected to abort with
of steps to exercise the SUT with placeholders (variables) a message to buy more products. In those cases, the SUT
that represent variation points [17]. The variation points can deviates from the normal control-flow and an error-handling
be used to vary different inputs to the system, configuration procedure is triggered. The value [TotalAmount:1] that is
variables or internal system states [8]. With CT, varying test responsible for triggering the error-handling is called invalid
inputs are created to instantiate the generic test script. value. If we also suppose that the checkout service rejects
An input parameter model (IPM) is created for which payment by bill for total amounts greater than 300 dollar, then
input parameters and interesting values are derived from the [PaymentType:Bill, TotalAmount:500] would trigger

Copyright © 2018 for this paper by its authors. 23
6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

error-handling as well. Even though both values are valid, the FTFI contains an invalid value or an invalid value combi-
combination of them denotes an invalid value combination. nation. Then, the number of parameters that constitute the
Valid test inputs do not contain any invalid values and invalid value or invalid value combination are denoted as the
invalid value combinations. In contrast, an invalid test input robustness size. In case of an invalid value, the robustness size
contains at least one invalid value or invalid value combination. is one.
If an invalid test input contains exactly one invalid value or The extension to support t-wise generation of invalid test
one invalid value combination, it is called a strong invalid test inputs is based on the assumption that failures are triggered
input. by an interaction of an invalid value (or invalid value com-
Once the SUT evaluates an invalid value or invalid value bination) and a (t − 1)-wise combination of valid values of
combination, error-handling is triggered. The normal control- the other parameters. This is a robustness interaction and
flow is left and all other values and value combinations of the its robustness interaction dimension can be computed by
test input remain untested. They are masked by the invalid subtracting the robustness size from the FTFI dimension.
value or invalid value combination [13]. This phenomenon is There is no robustness interaction if the robustness size
called input masking effect which we adapt from Yilmaz et and FTFI dimension are equal, i.e. the robustness interac-
al. [20]: “The input masking effect is an effect that prevents a tion dimension is zero. For instance, there is no robustness
test case from testing all combinations of input values, which interaction if [TotalAmount:1] or [PaymentType:Bill,
the test case is normally expected to test”. Total Amount:500] trigger failure. In contrast, the robust-
To prevent input masking, a strict separation of valid and ness interaction dimension of the aforementioned example
invalid test inputs is suggested [1], [10]–[13]. Combination [TotalAmount:1, DeliveryType:Express] is one be-
strategies are extended to support t-wise generation of invalid cause the invalid value interacts with one valid value.
test inputs. Values can be marked as invalid to exclude them
from valid test inputs and to include them in invalid test III. R ELATED W ORK
inputs. The invalid value is then combined with all (t−1)-wise If the highest dimension of parameters involved in FTFIs is
combinations of valid values. An extension that we proposed known before testing, then testing all d-wise parameter value
also allows to explicitly mark and generate t-wise invalid test combinations should be as effective as exhaustive testing [4].
inputs based on invalid value combinations [13]. However, d cannot be determined for a SUT a-priori because
the faults are not known before testing. Hence, the motivation
C. Fault Characteristics of fault characterization in black-box testing is to empirically
According to IEEE [18], an error is ”the difference between derive fault characteristics to guide future test activities.
a computed, observed, or measured value or condition and the Existing research on the effectiveness of black-box testing
true, specified, or theoretically correct value or condition.“ It derives the distribution and maximum of d among different
is the result of a mistake made by a human and is manifested types of software based on bug reports. Wallace and Kuhn
as a fault. In turn, a fault is statically present in the source [2] review 15 years of recall data from medical devices,
code and is the identified or hypothesized cause of a failure. i.e. software written for embedded systems. Kuhn and Reilly
A failure is an external behavior of the SUT, i.e. a behavior [3] analyze bug reports from two large open-source software
observable or perceivable by the user, which is incorrect with projects; namely the Apache web server and the Mozilla web
regards to the specified or expected behavior. browser. Kuhn and Wallace [4] report findings from analyzing
In CT, we assume that the execution path through the SUT 329 bug reports of a large distributed data management system
is determined by the values and value combination of the test developed at NASA Goddard Space Flight Center. Bell and
input. If an executed statement contains a fault that causes Vouk [5] analyze the effectiveness of pairwise testing network-
an observable failure and if a certain value or a certain value centric software. They derive their fault characteristics from a
combination is required for executing the statement, then the public database of security flaws and create simulations based
value or value combination is called a failure-triggering fault on that data. Kuhn and Okum [6] apply combinatorial testing
interaction (FTFI). The number of parameters involved in a with different strengths to a module of a traffic collision avoid-
FTFI is its dimension denoted as d with 0 ≤ d ≤ n. For ance system which is written in the C programming language.
instance, if the checkout service contains a fault and accepts a Though, the experiments use manually seeded ”realistic“ faults
total amount of one but only if express is chosen as the delivery rather than a specific bug database. Cotroneo et al. [21] again
type, then [TotalAmount:1, DeliveryType:Express] is analyze bug reports from Apache and the MySQL database
a FTFI with a dimension of two. system and Ratliff et al. [22] report the FTFI dimensions of
In general, different types of triggers exist to expose failures 242 bug reports from the MySQL database system.
[21]. A trigger is a set of conditions to expose a failure if As concluded by Kuhn et al. [7], the studies show that most
the conditions are satisfied . We focus on FTFI, i.e. failures failures in the investigated domains are triggered by single
triggered by test input variations, rather than on failures parameter values and parameter value pairs. Progressively
triggered by ordering or timing of stimulation. fewer failures are triggered by an interaction of three or more
In addition, we introduce the following terms for robustness parameter values. In addition to the distribution of FTFIs,
fault characteristics. A failure is a robustness failure if the a maximum interaction of four to six parameter values is

Copyright © 2018 for this paper by its authors. 24
6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

identified. No reported failure required an interaction of more RO1: Identify the different FTFI dimensions of robustness
than six parameter values to be triggered. Thus, pairwise failures that can be observed in our case study.
testing should trigger most failures and 4- to 6-wise testing The generation of t-wise invalid test inputs is based on the
should trigger all failures of a SUT [8]. assumption that failures are triggered by a robustness interac-
Several tools include the concept of invalid values to support tion between the invalid value (or invalid value combination)
the generation of invalid test inputs [10]–[12]. An algorithm and (t − 1)-wise combination of the valid values of the other
that we proposed [13] extends the concept to invalid value parameters. If the highest robustness interaction dimension
combinations. In their case study, Wojciak and Tzoref-Brill is known before testing, then testing all t-wise invalid test
[23] report on system level combinatorial testing that includes inputs of that dimension should be as effective as exhaustive
testing of negative scenarios. There, t-wise coverage of neg- robustness testing.
ative test inputs is required because error-handling depends RO2: Identify different robustness sizes and derive the robust-
on a robustness interaction between invalid and valid values. ness interaction dimensions that can be observed in our case
Another case study by Offut and Alluri [24] reports on the study.
first application of CT for financial calculation engines but
robustness is not further discussed. C. Case and Unit of Analysis
From an empirical point of view, the effectiveness of The case is a software development project from an IT
negative test scenarios is yet unclear. To the best of our service provider for an insurance company. A new system
knowledge, it is only Pan et al. [14], [15] who characterize is developed to manage the lifecycle of life insurances. It is
data on faults from robustness testing. The results of testing based on an off-the-shelf framework which is customized and
robustness of operating system APIs indicate that most robust- extended to meet the company’s requirements. In total, the
ness failures are caused by single invalid values. Though, there new system consists of 2.5 MLOC and an estimated workload
is no more information on failures triggered by invalid value of 5000 person days. The core is an inventory sub-system
combinations. Also, as Kuhn et al. [4] state, more empirical with a central database to store information on customer’s life
studies are required to confirm (or reject) the distribution insurance contracts. In addition, complex financial calculation
and upper limit of FTFIs for other software types. Therefore, engines and business processes like capturing and creating new
we conducted another case study which is described in the customer insurances are implemented. The business processes
subsequent sections. also integrate with a variety of different already existing
systems which are, for instance, responsible to manage infor-
IV. C ASE S TUDY D ESIGN mation about the contract partners, about claims and damages
A. Research Method and to the support insurance agents.
Since life insurance contracts have decade-long lifespans
We follow the guidelines for conducting and reporting and rely on complex financial models, the correctness of the
case study research in software engineering as suggested system is business critical. Mistakes can have severe effects
by Runeson and Höst [25]. As they state, a case study which can even amplify over the long-lasting lifespans and
”investigates a contemporary phenomenon within its real life cause enormous damage to the company. Therefore, thorough
context, especially when the boundaries between phenomenon testing is important.
and context are not clearly evident“. Case study research Even though, the business processes are managed by the
is typically used for exploratory purposes, e.g. seeking new new system, they rely on other systems of which each again
insights and generating hypotheses for new research. relies on other systems. This makes it hard to test the system
The guidelines suggest to conduct a case study in five steps. or its parts in isolation. It is also difficult to control the state
First, the objectives are defined and the case study is planned. of the systems and to observe the complete behavior which
As a second step, the data collection is prepared before the makes testing even more complicated.
data is collected in a third step. Afterwards, the collected data Therefore, most testing is conducted on a system level
is analyzed and finally, the results of the analysis are reported. within an integrated test environment in which all required
systems are deployed. The test design is often based on expe-
B. Research Objective rience and error-guessing. Tests are executed mostly manually
The overall objective of this case study is to gather informa- because of the low controllability and observability.
tion on the effectiveness of combinatorial testing with invalid
test inputs, and to compare the obtained results with the ones D. Data Collection Procedure
of other published case studies. For example, the work by To yield the research objectives, the case study relies on
Pan et al. [14], [15] indicates that most robustness failures archival data from the aforementioned software development
in operating system APIs are triggered by single values rather project. A project-wide issue management system contains all
than value combinations, i.e. a FTFI dimension and robustness bug reports from the project start in 2015 to the productive
size of one. Hence, our aim is to either confirm or reject deployment at the beginning of 2018. In general, a bug report
this indication for enterprise applications. This leads to the is a specifically categorized issue which coexists with other
following two concrete research objectives: project management- and development-related issues.

Copyright © 2018 for this paper by its authors. 25
6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

For our case study, we analyzed the issue’s title, its category, systems that timeout and do not response to requests. All these
its initial description, additional information in the comment bug reports are excluded form further analysis because CT is
section and its status. Further on, some bug reports are also about varying test inputs rather than varying sequences and
connected to a central source code management system. If timing.
a bug report does not contain sufficient information, the The remaining 388 bug reports describe failures triggered
corresponding source code modifications can be analyzed as by some test input. A subset of 176 bug reports describes
well. integration failures with other systems where values are not
The issues are filtered to restrict the analysis to only correctly mapped from one data structure to another. They can
reasonable bug reports. Therefore, issues created automatically be triggered by any test input. There are so many reported inte-
by static analysis tools are excluded. Further on, only issues gration issues (45% out of 434 bug reports) because the system
categorized as bug reports whose status is set to complete are consists of several independently developed components which
considered because we expect only them to contain a correct are early and often integrated with the other components and
description on how to systematically reproduce the failure. other systems using one of the test environments.
Finally, 212 bug reports are considered to be suitable for
E. Data Analysis Procedure
CT, which is 49% of all 434 bug reports that describe failures
Once the bug reports are exported from the issue manage- which is 55% of all 388 bug reports that are triggered by some
ment system, each bug report is analyzed one at a time. First, it test input To reproduce one of the bugs reported, the test input
is checked if the bug report describes a failure in the sense that requires at least one specific value.
an incorrect behaviour is observable by the user. Otherwise,
the bug report is rejected. B. Observed FTFI Dimensions
Afterwards, the trigger type of the reported failure is The observed FTFI dimensions for the 212 bug reports are
determined. The bug report is not further analyzed if no depicted in Table II. Most failures are triggered by single
systematically reproducible trigger is found. It is also rejected parameter values and parameter value pairs and progressively
if the failure is not triggered by a test input variation but rather fewer failures are triggered by 3- and 4-wise interactions. In
triggered by unlikely ordering or timing. If a specific value our case, no reported bug requires an interaction of more than
or value combination is identified to trigger the failure, the 4 parameters in order to trigger the failure.
dimension of the FTFI is determined in the next step. Table III presents the cumulative percentage of the FTFI
Then, the bug report is classified as either positive or dimensions. The last three columns refer to our case study and
negative depending on whether any invalid values or invalid shows that 76% of all reported failures require 1-wise (each
value combinations are contained. If it is classified as negative, choice) coverage to be reliably triggered. It adds up to 96%
i.e. if it is a robustness failure, the robustness size of the invalid when testing with pairwise coverage and 100% are covered
value or invalid value combination is determined as well. when all 4-wise parameter value combinations are used for
A robustness size which is lower than the FTFI dimension testing.
indicates a robustness interaction between the invalid value To compare our results, the first columns of the table show
(combination) and valid combinations of the other parame- the results of previous case studies, briefly introduced in the
ters. If possible, the robustness interaction dimension is also related work section. The numbers and also the average per-
extracted from the bug report. centage values are taken from Kuhn et al. [7]. The distribution
of FTFIs obtained in our case study is not in contradiction to
V. R ESULTS AND D ISCUSSION
the other cases. However, the distribution is mostly similar
A. Analyzed Data to cases [2] and [4]. While there are no obvious similarities
In total, 683 bug reports are analyzed. All reported bugs with embedded systems for medical devices [2], the large
are revealed and fixed during the development phase of the data management system [4] is probably quite similar to our
system. Even though filters are applied to export the bug case in terms of requirements and used technologies. Similar
reports, 249 bug reports are classified as unrelated, because to our case, the bug reports are also from a development
the issue management system is also used as a communication project whereas the other studies analyze fielded products
and task management tool. For instance, problems with config- [7]. For all three cases, most failures are triggered by single
urations of test environments, refactorings or build problems parameter values and almost all failures are triggered by the
are categorized as bug reports as well. combination of single parameter values and pairwise parameter
The remaining 434 bug reports describe failures, they are value combinations. All failures should be triggered by 4-wise
classified as follows. Eight bug reports do not provide enough parameter value combinations.
information for further analysis and classification. 38 reported So far, only the dimension of failure-triggering fault in-
bugs require specific timing and ordering of sequences to be teractions is considered but differences between positive and
triggered. For instance, one sequence to trigger a failure is negative scenarios are not discussed.
to search for a customer, open its details, edit the birthday, All in all, 51 robustness failures are identified which are
press cancel and edit the birthday again. Three reported bugs classified as follows. 22 failures are caused by incorrect error-
are related to robustness testing. They are triggered by other detection of abnormal situations because conditions to detect

Copyright © 2018 for this paper by its authors. 26
6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

Table II: Observed FTFI Dimensions Table III: Cumulative Percentage of FTFI Dimensions
Previous Studies Our Study
d All Positive Negative
d [2] [3]a [3]b [4] [5] [21] [22] Avg. All Pos. Neg.
1 162 121 41
1 66 28 41 67 18 9 49 39.7 76 75 80
2 40 31 9
2 97 76 70 93 62 47 86 75.9 96 94 98
3 6 5 1
3 99 95 89 98 87 75 97 91.4 98 98 100
4 4 4 -
4 100 97 96 100 97 97 99 98.0 100 100
5 - - -
5 99 96 100 100 100 99.0
6 - - -
6 100 100 100.0

abnormal situations are either wrong or missing. Consequently, the component that manages contracting parties ensure data
these abnormal situations are not discovered. For instance, quality by checking that, e.g. the title of a person matches the
bank transfer is accepted as a payment option even though gender of the first name and that the first name and family
incorrect or no bank account information is provided. name are correct and not confused with each other. However,
In 19 cases, reported failures are caused by incorrect error- when using an unknown invalid title, the system responds with
signaling. Errors are signaled if an abnormal situation is a wrong error message saying that the family name was wrong.
detected but the error should be handled somewhere else. For If an invalid family name was combined with the unknown
instance, a misspelled first name is detected by a user regis- title, the failure would not have been discovered.
tration service but the error message complaints a misspelled To yield the second research objective, the 10 invalid test
last name. inputs with a FTFI dimension greater than one are further
For three reported failures, the abnormal situation is cor- analyzed. As a result, two reported bugs that describe failures
rectly detected and the error is correctly signaled. However, with robustness interactions are discovered. Even though a
the system performs incorrect error-recovery because the combination of two and three specific parameter values is
instructions to recover from the abnormal situation contain required to trigger the robustness failures, the robustness size
faults. For instance, the user is asked to correct wrong input, is only one and two, respectively.
e.g. a misspelled first name. After the input is corrected, the Furthermore, two reported bugs require an interaction of
system does not recover and the corrected input cannot be invalid value (combinations) and a valid value of another
processed. parameter. One reported bug is related to the communication
In seven cases, failures are triggered by the system’s runtime between two systems. The response of the second system
environment. For instance, a NullPointerException is sig- contains one parameter value with error information to indicate
naled when the runtime environment detects unexpected and whether the requested operation succeeded or failed. Another
illegal access of NULL values. Since developers did not expect parameter provides details about the internal processing of the
NULL values, no respective error-handlers are implemented and request and a certain value indicates an internal resolution of
the processes terminate. These failures denote incorrect flows the error. In that case, the calling system is expected to handle
from error-signaling to error-recovery. the error in a different way.
Table II depicts the observed FTFI dimensions and their Another reported bug belongs to the storage of de-
distribution divided into positive and negative test scenarios. tails on contracting parties where one contracting party
As can be seen, the maximum dimension of robustness interac- must be responsible for paying the insurance premiums.
tion is three. Compared to positive test scenarios, the negative This responsibility is stored as a role called contribu-
scenarios discover fewer failures and the FTFI dimensions are tor. If direct debit is chosen as the payment method
also lower. For single parameter values and parameter value but an invalid bank account, i.e. an invalid IBAN num-
pairs, the ratio is 3:1 of valid vs invalid test inputs and no ber, is provided, the resulting error message remains even
invalid test inputs are identified for higher dimensions. after the invalid IBAN is replaced by a valid IBAN.
While these numbers indicate that most failures are trig- While the combination [payment-method:direct-debit,
gered by valid test inputs, we emphasize that the test design account-number:invalid] is required as an invalid com-
is based on experience and error-guessing, robustness testing bination, the bug report states that this phenomenon could only
was not in the focus. Hence, the ratio can also result from be observed for [role:contributor].
a general bias towards testing of positive scenarios which is
identified in research [26]–[28]. VI. T HREADS TO VALIDITY
Nevertheless, these findings underpin the results of Pan et The biggest thread to validity is that case studies are difficult
al. [14], [15] who observe that most robustness failures in to generalize from [25]. Especially, because only one particular
operating systems APIs are triggered by single invalid values. type of software of one company is analyzed. The archival
In their study, 82% of robustness failures are triggered by data of the case study is only a snapshot and the ground truth,
single invalid values. We observe the same ratio in our case. i.e. the set of all failures that can be triggered, is unknown.
The bug reports also demonstrate the importance of strong Hence, the data set can be biased, for instance, towards positive
invalid test inputs, i.e. test inputs with exactly one invalid scenarios which has been observed in research [26]–[28].
value or exactly one invalid value combination. For instance, Since the bug reports result from tests based on experience

an error-guessing, it may apply here as well. Challenge 4 - Support Alternative Coverage Criteria: To
The data can also be biased towards certain fault character- reveal incorrect handling and incorrect recovery, the SUT must
istics. Relevant and reasonable bug reports may be excluded be stimulated by the failure-triggering invalid test input. The
by our filtering because the bug reports are incorrectly cat- majority of analyzed robustness failures does not indicate any
egorized. Maybe not all triggered failures are reported. For robustness interaction between valid values and invalid values
instance, a developer who finds a fault might just fix it without or invalid value combinations. Then, the failure is triggered
creating a bug report. by the invalid value or invalid value combination. To satisfy
the coverage criterion, it is sufficient to have a separate test
VII. C HALLENGES FOR C OMBINATORIAL T ESTING input for each invalid value or invalid value combination.
One challenge in combinatorial testing is to find an effective However, robustness failures where invalid values or in-
coverage criteria. Based on the aforementioned empirical valid value combinations interact with valid values and
studies, a recommendation for positive scenarios is to use valid value combinations could also be observed. The fail-
pairwise coverage to trigger most failures and 4- to 6-wise ure is triggered by a t-wise interation of one or more
coverage that should trigger all failures. For the application valid values and the invalid value or invalid value combi-
in practice, one major challenge is to generate test suites of nation. For instance, suppose the valid role=contributor
minimal or small size with 4- or 6-wise coverage. is responsible for selecting the strategy which is used
To test negative scenarios, different challenges in combina- to process the bank data of a customer. If the invalid
torial testing can be observed. combination of [payment-method:direct-debit] and
In our case study, four classes of incorrect error-handling [account-number:invalid] is handled incorrectly by the
are identified. First, incorrect error-detection is caused by selected strategy, then the interaction of all three values is
conditions which are either too strict or too loose. Second, required to trigger the failure.
incorrect error-signaling results in a wrong type of error to The observed failures of our case study are in line with
be signaled. Third, incorrect recovery of a signaled error is a case study by Wojciak and Tzorref-Brill [23] who faced
caused by a fault in the appropriate recovery instructions. error-handling that would be different depending on firmware
Fourth, incorrect flow from error-signaling to error-recovery is in control and system configurations. Different configuration
caused by a signaled error for which no appropriate recovery options can also be modelled as input parameters, a robustness
instructions are implemented. interaction of configuration options with invalid values and
Challenge 1 - Avoid the Input Masking Effect: Incorrect valid value combinations is also reasonable.
error-detection that is caused by a too strict condition can be Since only low dimensions of robustness interaction are
revealed by positive test input that mistakenly triggers error- observed, we believe it is unlikely that the generation of 4-
recovery. But, revealing a condition that is too loose requires to 6-wise test suites is a challenge here as well. Instead,
invalid test input that mistakenly does not trigger error- alternative coverage criteria that, for instance, allow a variable
recovery. To ensure that too strict and too loose conditions strength interaction with some other input parameters can
can be detected, the generation of valid and invalid test inputs become a relevant to reduce the number of test inputs.
must be separated and both sets of test input must satisfy
separate coverage criteria. VIII. C ONCLUSION
Challenge 2 - Generate Strong Invalid Test Inputs: Another The effectiveness of negative test scenarios is unclear from
challenge is the generation of strong invalid test inputs such an empirical point of view. We conducted a case study to get
that one invalid value or invalid value combination cannot information on failures triggered by invalid test inputs. The
mask another. Incorrect error-detection and incorrect error- motivation for our and others studies was that if all failures
recovery may remain undetected if the signal that would result are triggered by an interaction of d or fewer parameter values,
from an incorrect condition is masked by the computation of then testing all d-wise parameter value combinations should
another invalid value or invalid value combination. be as effective as exhaustive testing [4].
Challenge 3 - Consider Invalid Value Combinations: Since In our case study we analyzed bug reports which originate
the error-detection conditions may depend on an arbitrary from a development project that manages life insurances. In
number of input values, it is not sufficient to only consider in- total, 683 bug reports are analyzed. 434 bug reports describe
valid values as most combinatorial testing tools do. As our case actual failures and 212 of them are failures triggered by a
study and Pan et al. [14], [15] show, 80% of the robustness 2-wise or higher interaction of parameter values.
failures are triggered by invalid values, i.e. a robustness size In general, the distribution of FTFI dimensions conforms
of one, but also 20% of the robustness failures require invalid to the pattern of previous empirical studies. But in contrast
value combinations to be triggered. Error-detection with more to positive test scenarios, fewer robustness failures with lower
complex conditions must be tested as well. Invalid value FTFI dimensions are identified. Overall, the robustness failures
combinations should be excluded when generating positive test are grouped in four classes: incorrect error-detection, incorrect
inputs but included when generating invalid test inputs [13]. error-signaling, incorrect recovery from a signaled error and
Therefore, appropriate modeling facilities and algorithms that incorrect flow from error-signaling to error-recovery. Most ro-
consider invalid value combinations are another challenge. bustness failures (80%) are triggered by single invalid values.

The remaining robustness failures require an interaction of two [10] D. M. Cohen, S. R. Dalal, M. L. Fredman, and G. C. Patton, “The aetg
and three input parameter values. Two reported bugs require system: An approach to testing based on combinatorial design,” IEEE
Transactions on Software Engineering, vol. 23, no. 7, 1997.
an interaction of valid values with an invalid value or invalid [11] J. Czerwonka, “Pairwise testing in real world,” in 24th Pacific Northwest
value combinations to trigger the robustness failure. Software Quality Conference, 2006.
Based on the findings of this case study, we derive chal- [12] L. Yu, Y. Lei, R. N. Kacker, and D. R. Kuhn, “Acts: A combinatorial
test generation tool,” in Software Testing, Verification and Validation
lenges for combinatorial robustness testing. To ensure that (ICST), 2013 IEEE Sixth International Conference on. IEEE, 2013,
failures do not remain hidden, possible masking should be pp. 370–375.
reduced. Valid and invalid test inputs should be separated and [13] K. Fögen and H. Lichter, “Combinatorial testing with constraints for
negative test cases,” in 2018 IEEE Eleventh International Conference
invalid test inputs should be strong, i.e. should only contain on Software Testing, Verification and Validation Workshops (ICSTW),
one invalid value or invalid value combination. 7th International Workshop on Combinatorial Testing (IWCT), 2018.
Further on, it is not sufficient to only consider invalid values [14] J. Pan, “The dimensionality of failures - a fault model for characterizing
software robustness,” Proc. FTCS ’99, June, 1999.
as most combinatorial testing tools do. Invalid value combi- [15] J. Pan, P. Koopman, and D. Siewiorek, “A dimensionality model
nations should be excluded when generating valid test inputs approach to testing and improving software robustness,” in AUTOTEST-
but considered for invalid test inputs. Therefore, appropriate CON’99. IEEE Systems Readiness Technology Conference, 1999. IEEE.
IEEE, 1999, pp. 493–501.
modeling facilities and algorithms are required. [16] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The
Since only low robustness interactions are observed, the oracle problem in software testing: A survey,” IEEE Transactions on
generation of test inputs with 4- to 6-wise coverage is not that Software Engineering, vol. 41, no. 5, 2015.
[17] G. Meszaros, xUnit Test Patterns: Refactoring Test Code. Upper Saddle
important for negative scenarios. But, the support of variable River, NJ, USA: Prentice Hall PTR, 2007.
strength generation for invalid inputs is another challenge. [18] IEEE, “Ieee standard glossary of software engineering terminology,”
Most robustness failures do not involve any robustness inter- IEEE Std, vol. 610.12-1990, 1990.
[19] N. Li and J. Offutt, “Test oracle strategies for model-based testing,”
action. But, there are situations where robustness interactions IEEE Transactions on Software Engineering, vol. 43, no. 4, pp. 372–
can be observed since different input values, configuration 395, April 2017.
options or internal states are modelled as input parameter [20] C. Yilmaz, E. Dumlu, M. B. Cohen, and A. Porter, “Reducing masking
effects in combinatorial interaction testing: A feedback driven adaptive
values. Depending on expected costs of failure, t-wise testing approach,” IEEE Transactions on Software Engineering, vol. 40, no. 1,
of invalid test inputs is an option. 2014.
In the future, we will work on facilities that support the [21] D. Cotroneo, R. Pietrantuono, S. Russo, and K. Trivedi, “How do bugs
surface? a comprehensive study on the characteristics of software bugs
modelling of invalid value combinations and we will integrate manifestation,” Journal of Systems and Software, vol. 113, pp. 27 – 43,
variable strength in a combinatorial algorithm for invalid input 2016.
generation. To reduce the number of invalid test inputs, we will [22] Z. B. Ratliff, D. R. Kuhn, R. N. Kacker, Y. Lei, and K. S. Trivedi,
“The relationship between software bug type and number of factors
conduct experiments to investigate the efficiency of different involved in failures,” in 2016 IEEE International Symposium on Software
coverage criteria. Reliability Engineering Workshops (ISSREW), Oct 2016, pp. 119–124.
[23] P. Wojciak and R. Tzoref-Brill, “System level combinatorial testing
R EFERENCES in practice - The concurrent maintenance case study,” Proceedings -
[1] M. Grindal, J. Offutt, and S. F. Andler, “Combination testing strategies: IEEE 7th International Conference on Software Testing, Verification and
A survey,” Software Testing, Verification and Reliability, vol. 15, no. 3, Validation, ICST 2014, 2014.
2005. [24] J. Offutt and C. Alluri, “An industrial study of applying input space
[2] D. R. WALLACE and D. R. KUHN, “Failure modes in medical device partitioning to test financial calculation engines,” Empirical Software
software: An analysis of 15 years of recall data,” International Journal Engineering, vol. 19, no. 3, pp. 558–581, Jun 2014.
of Reliability, Quality and Safety Engineering, vol. 08, no. 04, pp. 351– [25] P. Runeson and M. Höst, “Guidelines for conducting and reporting case
371, 2001. study research in software engineering,” Empirical Software Engineer-
[3] D. R. Kuhn and M. J. Reilly, “An investigation of the applicability ing, vol. 14, no. 2, p. 131, Dec 2008.
of design of experiments to software testing,” in 27th Annual NASA [26] L. M. Leventhal, B. M. Teasley, D. S. Rohlman, and K. Instone, “Positive
Goddard/IEEE Software Engineering Workshop, 2002. Proceedings., test bias in software testing among professionals: A review,” in Human-
Dec 2002, pp. 91–95. Computer Interaction, L. J. Bass, J. Gornostaev, and C. Unger, Eds.
[4] D. R. Kuhn, D. R. Wallace, and A. M. Gallo, “Software fault interactions Berlin, Heidelberg: Springer Berlin Heidelberg, 1993, pp. 210–218.
and implications for software testing,” IEEE Transactions on Software [27] B. E. Teasley, L. M. Leventhal, C. R. Mynatt, and D. S. Rohlman,
Engineering, vol. 30, no. 6, pp. 418–421, June 2004. “Why software testing is sometimes ineffective: Two applied studies of
[5] K. Z. Bell and M. A. Vouk, “On effectiveness of pairwise methodology positive test strategy.” Journal of Applied Psychology, vol. 79, no. 1, p.
for testing network-centric software,” in Information and Communica- 142, 1994.
tions Technology, 2005. Enabling Technologies for the New Knowledge [28] A. Causevic, R. Shukla, S. Punnekkat, and D. Sundmark, “Effects of
Society: ITI 3rd International Conference on. IEEE, 2005, pp. 221–235. negative testing on tdd: An industrial experiment,” in Agile Processes
[6] D. R. Kuhn and V. Okum, “Pseudo-exhaustive testing for software,” in in Software Engineering and Extreme Programming, H. Baumeister and
2006 30th Annual IEEE/NASA Software Engineering Workshop, April B. Weber, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,
2006, pp. 153–158. pp. 91–105.
[7] D. R. Kuhn, R. N. Kacker, and Y. Lei, “Estimating t-way fault profile
evolution during testing,” in Computer Software and Applications Con-
ference (COMPSAC), 2016 IEEE 40th Annual, vol. 2. IEEE, 2016, pp.
596–597.
[8] R. Tzoref-Brill, “Advances in combinatorial testing,” ser. Advances in
Computers. Elsevier, 2018.
[9] M. M. Hassan, W. Afzal, M. Blom, B. Lindstrom, S. F. Andler, and
S. Eldh, “Testability and software robustness: A systematic literature
review,” in 2015 41st Euromicro Conference on Software Engineering
and Advanced Applications (SEAA). IEEE, 2015.