6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


  A Case Study on Robustness Fault Characteristics
 for Combinatorial Testing - Results and Challenges
                             Konrad Fögen                                                     Horst Lichter
                  Research Group Software Construction                             Research Group Software Construction
                        RWTH Aachen University                                           RWTH Aachen University
                         Aachen, NRW, Germany                                             Aachen, NRW, Germany
                       foegen@swc.rwth-aachen.de                                        lichter@swc.rwth-aachen.de


   Abstract—Combinatorial testing is a well-known black-box                  an interaction of d or fewer parameter values, then testing all
testing approach. Empirical studies suggest the effectiveness of             d-wise parameter value combinations should be as effective
combinatorial coverage criteria. So far, the research focuses                as exhaustive testing [4]. No analyzed failure required an
on positive test scenarios. But, robustness is an important
characteristic of software systems and testing negative scenarios            interaction of more than six parameter values to be triggered
is crucial. Combinatorial strategies are extended to generate                [7], [8]. The results indicate that 2-wise (pairwise) testing
invalid test inputs but the effectiveness of negative test scenarios         should trigger most failures and 4 to 6-wise testing should
is yet unclear. Therefore, we conduct a case study and analyze 434           trigger all failures of a SUT.
failures reported as bugs of an financial enterprise application.               However, so far research focuses on positive test scenarios,
As a result, 51 robustness failures are identified including failures
triggered by invalid value combinations and failures triggered by            i.e. test inputs with valid values to test the implemented
interactions of valid and invalid values. Based on the findings,             operations based on their specification. Since robustness is
four challenges for combinatorial robustness testing are derived.            an important characteristic of software systems [9], testing of
                                                                             negative test scenarios is crucial. Invalid test inputs contain
  Keywords-Software Testing, Combinatorial Testing, Robustness               invalid values, e.g. a string value when a numerical value is
Testing, Test Design                                                         expected, or invalid combinations of otherwise valid values,
                                                                             e.g. a begin date which is after the end date.
                        I. I NTRODUCTION                                        They are used to check proper error-handling to avoid ab-
   Combinatorial testing (CT) is a black-box approach to reveal              normal behavior and system crashes. Error-handling is usually
conformance faults between the system under test (SUT) and                   separated from normal program execution. It is triggered by
its specification. An input parameter model (IPM) with input                 an invalid value or an invalid value combination and all other
parameters and interesting values is derived from the specifi-               values of the test input remain untested. Therefore, a strict
cation. Test inputs are generated where each input parameter                 separation of valid and invalid test inputs is suggested and
has a value assigned. The generation is usually automated and                combination strategies are extended to support generation of
a combination strategy defines how values are selected [1].                  invalid test inputs [1], [10]–[13].
   CT can help detecting interaction failures, e.g. failures                    But, the effectiveness of negative test scenarios is unclear
triggered by the interaction of two or more specific values.                 as this is not yet empirically researched. To the best of our
For instance, a bug report analyzed by Wallace and Kuhn [2]                  knowledge, it is only Pan et al. [14], [15] who characterize
describes that ”the ventilator could fail when the altitude ad-              data of faults from robustness testing. Their results obtained
justment feature was set on 0 meters and the total flow volume               from testing robustness of operating system APIs indicate that
was set at a delivery rate of less than 2.2 liter per minute“.               most robustness failures are caused by single invalid values.
The failure is triggered by the interaction of altitude=0 and                Though, there is no more information on failures caused by
delivery-rate<2.2. This is called a failure-triggering fault                 invalid value combinations. Because only one type of software
interaction (FTFI) and its dimension is d = 2 because the                    is analyzed, more empirical studies are required to confirm
interaction of two input parameter values is required.                       (or reject) the distribution and upper limit of FTFIs for other
   Testing each value only once is not sufficient to detect                  software types (Kuhn and Wallace [4]).
interaction faults and exhaustively testing all interactions                    To gather more information on failures triggered by invalid
among all input parameters is almost never feasible in practice.             value combinations, we conducted a case study to analyze bug
Therefore, other combinatorial coverage criteria like t-wise                 reports of a newly developed distributed enterprise application
where 1 ≤ t < n denotes the testing strength are proposed [1].               for financial services. In total, 683 bug reports are examined
   The effectiveness of combinatorial coverage criteria is also              and 434 of them describe failures which are further analyzed.
researched in empirical studies [2]–[7]. Collected bug reports                  The paper is structured as follows. Section II and III
are analyzed and FTFI dimensions are determined for different                summarize foundations and related work. In Section IV, the
types of software [4]. If all failures of a SUT are triggered by             design of the case study is explained. The results are discussed


      Copyright © 2018 for this paper by its authors.                   22
                          6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


     p1 : P aymentT ype         V1 = {CreditCard, Bill}                                       Table I: Pairwise Test Suite
     p2 : DeliveryT ype         V2 = {Standard, Express}                              PaymentType      DeliveryType    TotalAmount
     p3 : T otalAmount          V3 = {1, 500}                                             Bill           Express            500
                                                                                          Bill           Standard            1
      Listing 1: Exemplary IPM for a Checkout Service                                  CreditCard        Standard           500
                                                                                       CreditCard        Express             1


in Section V and challenges for combinatorial testing are                  specification. The IPM is represented as a set of n input
discussed in Section VII. Afterwards, potential threads to                 parameters IPM = {p1 , ..., pn } and each input parameter pi
validity are discussed and we conclude with a summary of                   is represented as a non-empty set of values Vi = {v1 , ..., vmi }.
our work.
                                                                              Test inputs are composed from the IPM such that every test
                        II. BACKGROUND                                     input contains a value for each input parameter. Formally, a
A. Robustness Testing                                                      test input is a set of parameter-value pairs for all n distinct pa-
                                                                           rameters and a parameter-value pair (pi , vj ) denotes a selection
   Testing is the activity of stimulating a system under test              of value vj ∈ Vi for parameter pi .
(SUT) and observing its response [16]. System testing (also
                                                                              Listing 1 depicts an exemplary IPM to test the check-
called functional testing) is concerned with the behavior of the
                                                                           out service of an e-commerce system with three input pa-
entire system and usually corresponds to business processes,
                                                                           rameters and two values for each input parameter. One
use cases or user stories [17]. Both, the stimulus and response
                                                                           possible test input for this IPM is [PaymentType:Bill,
consist of values. They are called test input and test output, re-
                                                                           DeliveryType:Standard, TotalAmount:1]. Formally, a
spectively. In this context, input comprises anything explicable
                                                                           test input τ = {(pi1 , vj1 ), ..., (pin , vjn )} is denoted as a set of
that is used to change the observable behaviour of the SUT.
                                                                           pairs. In this paper, we use the aforementioned notation with
Output comprises anything explicable that can be observed
                                                                           brackets which is equal to τ = {(p1 , v2 ), (p2 , v1 ), (p3 , v1 )}.
after test execution.
                                                                              The composition of test inputs is usually automated and a
   A test case covers a certain scenario to check whether the
                                                                           combination strategy defines how values are selected [1]. Since
SUT satisfies a particular requirement [18]. It consists of a
                                                                           testing each value only once is not sufficient to detect inter-
test input and a test oracle [19]. The test input is necessary
                                                                           action faults and exhaustively testing all interactions among
to induce the desired behavior. The test oracle provides the
                                                                           all input parameters is almost never feasible in practice, other
expected results which can be observed after test execution if
                                                                           coverage criteria like t-wise are proposed.
and only if the SUT behaves as intended by its specification.
Finally, the expected result to the actual result are compared                For illustration, Table I depicts a test suite for the
to determine whether the test passes or fails.                             e-commerce example that satisfies the pairwise coverage
   Since robustness is an important software quality [9], testing          criterion. For more information on the different cover-
should not only cover positive but also negative scenarios to              age criteria, please refer to Grindal et al. [1]. To satisfy
evaluate a SUT. Robustness is defined as “the degree to which              the coverage criterion, all pairwise value combinations of
a system or component can function correctly in the presence               P aymentT ype×DeliveryT ype, P aymentT ype×T otalAmount
of invalid inputs or stressful environmental conditions” [18].             and DeliveryT ype × T otalAmount must be included in at
Positive scenarios focus on valid intended operations of the               least one test input. If the first test input was not exe-
SUT using valid test inputs that are within the specified                  cuted, pairwise coverage would not be satisfied because the
boundaries. Negative scenarios focus on the error-handling                 combinations [PaymentType:Bill, DeliveryType:Expr
using invalid test inputs that are outside of the specified                ess],[PaymentType:Bill, Total Amount:1], [Delive
boundaries. For instance, input that is malformed, e.g. a string           ryType: Express, Total Amount:1] would be untested.
input when numerical input is expected, or input that violates                In comparison to exhaustive testing, fewer test inputs are
business rules, e.g. a begin date which is after the end date.             required to satisfy the other coverage criteria. But as the
                                                                           example illustrates, problems with only one test input might
B. Combinatorial Testing                                                   lead to combinations being not covered and failures that are
   Combinatorial testing (CT) is a black-box approach to reveal            triggered by these combinations remain undetected.
interaction failures, i.e. failures triggered by the interaction of           If we suppose that the checkout service requires a total
two or more specific values, because the SUT is tested with                amount of at least 25 dollar, then two test inputs of the example
varying test inputs. A generic test script describes a sequence            (Table I) with [TotalAmount:1] are expected to abort with
of steps to exercise the SUT with placeholders (variables)                 a message to buy more products. In those cases, the SUT
that represent variation points [17]. The variation points can             deviates from the normal control-flow and an error-handling
be used to vary different inputs to the system, configuration              procedure is triggered. The value [TotalAmount:1] that is
variables or internal system states [8]. With CT, varying test             responsible for triggering the error-handling is called invalid
inputs are created to instantiate the generic test script.                 value. If we also suppose that the checkout service rejects
   An input parameter model (IPM) is created for which                     payment by bill for total amounts greater than 300 dollar, then
input parameters and interesting values are derived from the               [PaymentType:Bill, TotalAmount:500] would trigger


      Copyright © 2018 for this paper by its authors.                 23
                          6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


error-handling as well. Even though both values are valid, the             FTFI contains an invalid value or an invalid value combi-
combination of them denotes an invalid value combination.                  nation. Then, the number of parameters that constitute the
   Valid test inputs do not contain any invalid values and                 invalid value or invalid value combination are denoted as the
invalid value combinations. In contrast, an invalid test input             robustness size. In case of an invalid value, the robustness size
contains at least one invalid value or invalid value combination.          is one.
If an invalid test input contains exactly one invalid value or                The extension to support t-wise generation of invalid test
one invalid value combination, it is called a strong invalid test          inputs is based on the assumption that failures are triggered
input.                                                                     by an interaction of an invalid value (or invalid value com-
   Once the SUT evaluates an invalid value or invalid value                bination) and a (t − 1)-wise combination of valid values of
combination, error-handling is triggered. The normal control-              the other parameters. This is a robustness interaction and
flow is left and all other values and value combinations of the            its robustness interaction dimension can be computed by
test input remain untested. They are masked by the invalid                 subtracting the robustness size from the FTFI dimension.
value or invalid value combination [13]. This phenomenon is                There is no robustness interaction if the robustness size
called input masking effect which we adapt from Yilmaz et                  and FTFI dimension are equal, i.e. the robustness interac-
al. [20]: “The input masking effect is an effect that prevents a           tion dimension is zero. For instance, there is no robustness
test case from testing all combinations of input values, which             interaction if [TotalAmount:1] or [PaymentType:Bill,
the test case is normally expected to test”.                               Total Amount:500] trigger failure. In contrast, the robust-
   To prevent input masking, a strict separation of valid and              ness interaction dimension of the aforementioned example
invalid test inputs is suggested [1], [10]–[13]. Combination               [TotalAmount:1, DeliveryType:Express] is one be-
strategies are extended to support t-wise generation of invalid            cause the invalid value interacts with one valid value.
test inputs. Values can be marked as invalid to exclude them
from valid test inputs and to include them in invalid test                                      III. R ELATED W ORK
inputs. The invalid value is then combined with all (t−1)-wise                If the highest dimension of parameters involved in FTFIs is
combinations of valid values. An extension that we proposed                known before testing, then testing all d-wise parameter value
also allows to explicitly mark and generate t-wise invalid test            combinations should be as effective as exhaustive testing [4].
inputs based on invalid value combinations [13].                           However, d cannot be determined for a SUT a-priori because
                                                                           the faults are not known before testing. Hence, the motivation
C. Fault Characteristics                                                   of fault characterization in black-box testing is to empirically
   According to IEEE [18], an error is ”the difference between             derive fault characteristics to guide future test activities.
a computed, observed, or measured value or condition and the                  Existing research on the effectiveness of black-box testing
true, specified, or theoretically correct value or condition.“ It          derives the distribution and maximum of d among different
is the result of a mistake made by a human and is manifested               types of software based on bug reports. Wallace and Kuhn
as a fault. In turn, a fault is statically present in the source           [2] review 15 years of recall data from medical devices,
code and is the identified or hypothesized cause of a failure.             i.e. software written for embedded systems. Kuhn and Reilly
A failure is an external behavior of the SUT, i.e. a behavior              [3] analyze bug reports from two large open-source software
observable or perceivable by the user, which is incorrect with             projects; namely the Apache web server and the Mozilla web
regards to the specified or expected behavior.                             browser. Kuhn and Wallace [4] report findings from analyzing
   In CT, we assume that the execution path through the SUT                329 bug reports of a large distributed data management system
is determined by the values and value combination of the test              developed at NASA Goddard Space Flight Center. Bell and
input. If an executed statement contains a fault that causes               Vouk [5] analyze the effectiveness of pairwise testing network-
an observable failure and if a certain value or a certain value            centric software. They derive their fault characteristics from a
combination is required for executing the statement, then the              public database of security flaws and create simulations based
value or value combination is called a failure-triggering fault            on that data. Kuhn and Okum [6] apply combinatorial testing
interaction (FTFI). The number of parameters involved in a                 with different strengths to a module of a traffic collision avoid-
FTFI is its dimension denoted as d with 0 ≤ d ≤ n. For                     ance system which is written in the C programming language.
instance, if the checkout service contains a fault and accepts a           Though, the experiments use manually seeded ”realistic“ faults
total amount of one but only if express is chosen as the delivery          rather than a specific bug database. Cotroneo et al. [21] again
type, then [TotalAmount:1, DeliveryType:Express] is                        analyze bug reports from Apache and the MySQL database
a FTFI with a dimension of two.                                            system and Ratliff et al. [22] report the FTFI dimensions of
   In general, different types of triggers exist to expose failures        242 bug reports from the MySQL database system.
[21]. A trigger is a set of conditions to expose a failure if                 As concluded by Kuhn et al. [7], the studies show that most
the conditions are satisfied . We focus on FTFI, i.e. failures             failures in the investigated domains are triggered by single
triggered by test input variations, rather than on failures                parameter values and parameter value pairs. Progressively
triggered by ordering or timing of stimulation.                            fewer failures are triggered by an interaction of three or more
   In addition, we introduce the following terms for robustness            parameter values. In addition to the distribution of FTFIs,
fault characteristics. A failure is a robustness failure if the            a maximum interaction of four to six parameter values is


      Copyright © 2018 for this paper by its authors.                 24
                          6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


identified. No reported failure required an interaction of more           RO1: Identify the different FTFI dimensions of robustness
than six parameter values to be triggered. Thus, pairwise                 failures that can be observed in our case study.
testing should trigger most failures and 4- to 6-wise testing                The generation of t-wise invalid test inputs is based on the
should trigger all failures of a SUT [8].                                 assumption that failures are triggered by a robustness interac-
   Several tools include the concept of invalid values to support         tion between the invalid value (or invalid value combination)
the generation of invalid test inputs [10]–[12]. An algorithm             and (t − 1)-wise combination of the valid values of the other
that we proposed [13] extends the concept to invalid value                parameters. If the highest robustness interaction dimension
combinations. In their case study, Wojciak and Tzoref-Brill               is known before testing, then testing all t-wise invalid test
[23] report on system level combinatorial testing that includes           inputs of that dimension should be as effective as exhaustive
testing of negative scenarios. There, t-wise coverage of neg-             robustness testing.
ative test inputs is required because error-handling depends              RO2: Identify different robustness sizes and derive the robust-
on a robustness interaction between invalid and valid values.             ness interaction dimensions that can be observed in our case
Another case study by Offut and Alluri [24] reports on the                study.
first application of CT for financial calculation engines but
robustness is not further discussed.                                      C. Case and Unit of Analysis
   From an empirical point of view, the effectiveness of                     The case is a software development project from an IT
negative test scenarios is yet unclear. To the best of our                service provider for an insurance company. A new system
knowledge, it is only Pan et al. [14], [15] who characterize              is developed to manage the lifecycle of life insurances. It is
data on faults from robustness testing. The results of testing            based on an off-the-shelf framework which is customized and
robustness of operating system APIs indicate that most robust-            extended to meet the company’s requirements. In total, the
ness failures are caused by single invalid values. Though, there          new system consists of 2.5 MLOC and an estimated workload
is no more information on failures triggered by invalid value             of 5000 person days. The core is an inventory sub-system
combinations. Also, as Kuhn et al. [4] state, more empirical              with a central database to store information on customer’s life
studies are required to confirm (or reject) the distribution              insurance contracts. In addition, complex financial calculation
and upper limit of FTFIs for other software types. Therefore,             engines and business processes like capturing and creating new
we conducted another case study which is described in the                 customer insurances are implemented. The business processes
subsequent sections.                                                      also integrate with a variety of different already existing
                                                                          systems which are, for instance, responsible to manage infor-
                   IV. C ASE S TUDY D ESIGN                               mation about the contract partners, about claims and damages
A. Research Method                                                        and to the support insurance agents.
                                                                             Since life insurance contracts have decade-long lifespans
   We follow the guidelines for conducting and reporting                  and rely on complex financial models, the correctness of the
case study research in software engineering as suggested                  system is business critical. Mistakes can have severe effects
by Runeson and Höst [25]. As they state, a case study                    which can even amplify over the long-lasting lifespans and
”investigates a contemporary phenomenon within its real life              cause enormous damage to the company. Therefore, thorough
context, especially when the boundaries between phenomenon                testing is important.
and context are not clearly evident“. Case study research                    Even though, the business processes are managed by the
is typically used for exploratory purposes, e.g. seeking new              new system, they rely on other systems of which each again
insights and generating hypotheses for new research.                      relies on other systems. This makes it hard to test the system
   The guidelines suggest to conduct a case study in five steps.          or its parts in isolation. It is also difficult to control the state
First, the objectives are defined and the case study is planned.          of the systems and to observe the complete behavior which
As a second step, the data collection is prepared before the              makes testing even more complicated.
data is collected in a third step. Afterwards, the collected data            Therefore, most testing is conducted on a system level
is analyzed and finally, the results of the analysis are reported.        within an integrated test environment in which all required
                                                                          systems are deployed. The test design is often based on expe-
B. Research Objective                                                     rience and error-guessing. Tests are executed mostly manually
   The overall objective of this case study is to gather informa-         because of the low controllability and observability.
tion on the effectiveness of combinatorial testing with invalid
test inputs, and to compare the obtained results with the ones            D. Data Collection Procedure
of other published case studies. For example, the work by                    To yield the research objectives, the case study relies on
Pan et al. [14], [15] indicates that most robustness failures             archival data from the aforementioned software development
in operating system APIs are triggered by single values rather            project. A project-wide issue management system contains all
than value combinations, i.e. a FTFI dimension and robustness             bug reports from the project start in 2015 to the productive
size of one. Hence, our aim is to either confirm or reject                deployment at the beginning of 2018. In general, a bug report
this indication for enterprise applications. This leads to the            is a specifically categorized issue which coexists with other
following two concrete research objectives:                               project management- and development-related issues.


      Copyright © 2018 for this paper by its authors.                25
                          6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


   For our case study, we analyzed the issue’s title, its category,           systems that timeout and do not response to requests. All these
its initial description, additional information in the comment                bug reports are excluded form further analysis because CT is
section and its status. Further on, some bug reports are also                 about varying test inputs rather than varying sequences and
connected to a central source code management system. If                      timing.
a bug report does not contain sufficient information, the                        The remaining 388 bug reports describe failures triggered
corresponding source code modifications can be analyzed as                    by some test input. A subset of 176 bug reports describes
well.                                                                         integration failures with other systems where values are not
   The issues are filtered to restrict the analysis to only                   correctly mapped from one data structure to another. They can
reasonable bug reports. Therefore, issues created automatically               be triggered by any test input. There are so many reported inte-
by static analysis tools are excluded. Further on, only issues                gration issues (45% out of 434 bug reports) because the system
categorized as bug reports whose status is set to complete are                consists of several independently developed components which
considered because we expect only them to contain a correct                   are early and often integrated with the other components and
description on how to systematically reproduce the failure.                   other systems using one of the test environments.
                                                                                 Finally, 212 bug reports are considered to be suitable for
E. Data Analysis Procedure
                                                                              CT, which is 49% of all 434 bug reports that describe failures
   Once the bug reports are exported from the issue manage-                   which is 55% of all 388 bug reports that are triggered by some
ment system, each bug report is analyzed one at a time. First, it             test input To reproduce one of the bugs reported, the test input
is checked if the bug report describes a failure in the sense that            requires at least one specific value.
an incorrect behaviour is observable by the user. Otherwise,
the bug report is rejected.                                                   B. Observed FTFI Dimensions
   Afterwards, the trigger type of the reported failure is                       The observed FTFI dimensions for the 212 bug reports are
determined. The bug report is not further analyzed if no                      depicted in Table II. Most failures are triggered by single
systematically reproducible trigger is found. It is also rejected             parameter values and parameter value pairs and progressively
if the failure is not triggered by a test input variation but rather          fewer failures are triggered by 3- and 4-wise interactions. In
triggered by unlikely ordering or timing. If a specific value                 our case, no reported bug requires an interaction of more than
or value combination is identified to trigger the failure, the                4 parameters in order to trigger the failure.
dimension of the FTFI is determined in the next step.                            Table III presents the cumulative percentage of the FTFI
   Then, the bug report is classified as either positive or                   dimensions. The last three columns refer to our case study and
negative depending on whether any invalid values or invalid                   shows that 76% of all reported failures require 1-wise (each
value combinations are contained. If it is classified as negative,            choice) coverage to be reliably triggered. It adds up to 96%
i.e. if it is a robustness failure, the robustness size of the invalid        when testing with pairwise coverage and 100% are covered
value or invalid value combination is determined as well.                     when all 4-wise parameter value combinations are used for
   A robustness size which is lower than the FTFI dimension                   testing.
indicates a robustness interaction between the invalid value                     To compare our results, the first columns of the table show
(combination) and valid combinations of the other parame-                     the results of previous case studies, briefly introduced in the
ters. If possible, the robustness interaction dimension is also               related work section. The numbers and also the average per-
extracted from the bug report.                                                centage values are taken from Kuhn et al. [7]. The distribution
                                                                              of FTFIs obtained in our case study is not in contradiction to
                V. R ESULTS AND D ISCUSSION
                                                                              the other cases. However, the distribution is mostly similar
A. Analyzed Data                                                              to cases [2] and [4]. While there are no obvious similarities
   In total, 683 bug reports are analyzed. All reported bugs                  with embedded systems for medical devices [2], the large
are revealed and fixed during the development phase of the                    data management system [4] is probably quite similar to our
system. Even though filters are applied to export the bug                     case in terms of requirements and used technologies. Similar
reports, 249 bug reports are classified as unrelated, because                 to our case, the bug reports are also from a development
the issue management system is also used as a communication                   project whereas the other studies analyze fielded products
and task management tool. For instance, problems with config-                 [7]. For all three cases, most failures are triggered by single
urations of test environments, refactorings or build problems                 parameter values and almost all failures are triggered by the
are categorized as bug reports as well.                                       combination of single parameter values and pairwise parameter
   The remaining 434 bug reports describe failures, they are                  value combinations. All failures should be triggered by 4-wise
classified as follows. Eight bug reports do not provide enough                parameter value combinations.
information for further analysis and classification. 38 reported                 So far, only the dimension of failure-triggering fault in-
bugs require specific timing and ordering of sequences to be                  teractions is considered but differences between positive and
triggered. For instance, one sequence to trigger a failure is                 negative scenarios are not discussed.
to search for a customer, open its details, edit the birthday,                   All in all, 51 robustness failures are identified which are
press cancel and edit the birthday again. Three reported bugs                 classified as follows. 22 failures are caused by incorrect error-
are related to robustness testing. They are triggered by other                detection of abnormal situations because conditions to detect


      Copyright © 2018 for this paper by its authors.                    26
                         6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


   Table II: Observed FTFI Dimensions                            Table III: Cumulative Percentage of FTFI Dimensions
                                                                                 Previous Studies                            Our Study
        d    All   Positive   Negative
                                                       d   [2]    [3]a    [3]b     [4]    [5]    [21]   [22]   Avg.    All     Pos.   Neg.
        1    162     121        41
                                                       1   66      28      41      67     18      9      49    39.7     76      75     80
        2     40      31         9
                                                       2   97      76      70      93     62      47     86     75.9    96      94     98
        3     6        5         1
                                                       3   99      95      89      98     87      75     97     91.4    98      98     100
        4     4        4         -
                                                       4   100     97      96      100     97     97     99     98.0   100     100
        5      -       -         -
                                                       5           99      96             100    100    100     99.0
        6      -       -         -
                                                       6          100     100                                  100.0


abnormal situations are either wrong or missing. Consequently,           the component that manages contracting parties ensure data
these abnormal situations are not discovered. For instance,              quality by checking that, e.g. the title of a person matches the
bank transfer is accepted as a payment option even though                gender of the first name and that the first name and family
incorrect or no bank account information is provided.                    name are correct and not confused with each other. However,
   In 19 cases, reported failures are caused by incorrect error-         when using an unknown invalid title, the system responds with
signaling. Errors are signaled if an abnormal situation is               a wrong error message saying that the family name was wrong.
detected but the error should be handled somewhere else. For             If an invalid family name was combined with the unknown
instance, a misspelled first name is detected by a user regis-           title, the failure would not have been discovered.
tration service but the error message complaints a misspelled                To yield the second research objective, the 10 invalid test
last name.                                                               inputs with a FTFI dimension greater than one are further
   For three reported failures, the abnormal situation is cor-           analyzed. As a result, two reported bugs that describe failures
rectly detected and the error is correctly signaled. However,            with robustness interactions are discovered. Even though a
the system performs incorrect error-recovery because the                 combination of two and three specific parameter values is
instructions to recover from the abnormal situation contain              required to trigger the robustness failures, the robustness size
faults. For instance, the user is asked to correct wrong input,          is only one and two, respectively.
e.g. a misspelled first name. After the input is corrected, the              Furthermore, two reported bugs require an interaction of
system does not recover and the corrected input cannot be                invalid value (combinations) and a valid value of another
processed.                                                               parameter. One reported bug is related to the communication
   In seven cases, failures are triggered by the system’s runtime        between two systems. The response of the second system
environment. For instance, a NullPointerException is sig-                contains one parameter value with error information to indicate
naled when the runtime environment detects unexpected and                whether the requested operation succeeded or failed. Another
illegal access of NULL values. Since developers did not expect           parameter provides details about the internal processing of the
NULL values, no respective error-handlers are implemented and            request and a certain value indicates an internal resolution of
the processes terminate. These failures denote incorrect flows           the error. In that case, the calling system is expected to handle
from error-signaling to error-recovery.                                  the error in a different way.
   Table II depicts the observed FTFI dimensions and their                   Another reported bug belongs to the storage of de-
distribution divided into positive and negative test scenarios.          tails on contracting parties where one contracting party
As can be seen, the maximum dimension of robustness interac-             must be responsible for paying the insurance premiums.
tion is three. Compared to positive test scenarios, the negative         This responsibility is stored as a role called contribu-
scenarios discover fewer failures and the FTFI dimensions are            tor. If direct debit is chosen as the payment method
also lower. For single parameter values and parameter value              but an invalid bank account, i.e. an invalid IBAN num-
pairs, the ratio is 3:1 of valid vs invalid test inputs and no           ber, is provided, the resulting error message remains even
invalid test inputs are identified for higher dimensions.                after the invalid IBAN is replaced by a valid IBAN.
   While these numbers indicate that most failures are trig-             While the combination [payment-method:direct-debit,
gered by valid test inputs, we emphasize that the test design            account-number:invalid] is required as an invalid com-
is based on experience and error-guessing, robustness testing            bination, the bug report states that this phenomenon could only
was not in the focus. Hence, the ratio can also result from              be observed for [role:contributor].
a general bias towards testing of positive scenarios which is
identified in research [26]–[28].                                                            VI. T HREADS TO VALIDITY
   Nevertheless, these findings underpin the results of Pan et              The biggest thread to validity is that case studies are difficult
al. [14], [15] who observe that most robustness failures in              to generalize from [25]. Especially, because only one particular
operating systems APIs are triggered by single invalid values.           type of software of one company is analyzed. The archival
In their study, 82% of robustness failures are triggered by              data of the case study is only a snapshot and the ground truth,
single invalid values. We observe the same ratio in our case.            i.e. the set of all failures that can be triggered, is unknown.
   The bug reports also demonstrate the importance of strong             Hence, the data set can be biased, for instance, towards positive
invalid test inputs, i.e. test inputs with exactly one invalid           scenarios which has been observed in research [26]–[28].
value or exactly one invalid value combination. For instance,            Since the bug reports result from tests based on experience


     Copyright © 2018 for this paper by its authors.                27
                         6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


an error-guessing, it may apply here as well.                            Challenge 4 - Support Alternative Coverage Criteria: To
   The data can also be biased towards certain fault character-          reveal incorrect handling and incorrect recovery, the SUT must
istics. Relevant and reasonable bug reports may be excluded              be stimulated by the failure-triggering invalid test input. The
by our filtering because the bug reports are incorrectly cat-            majority of analyzed robustness failures does not indicate any
egorized. Maybe not all triggered failures are reported. For             robustness interaction between valid values and invalid values
instance, a developer who finds a fault might just fix it without        or invalid value combinations. Then, the failure is triggered
creating a bug report.                                                   by the invalid value or invalid value combination. To satisfy
                                                                         the coverage criterion, it is sufficient to have a separate test
    VII. C HALLENGES FOR C OMBINATORIAL T ESTING                         input for each invalid value or invalid value combination.
   One challenge in combinatorial testing is to find an effective           However, robustness failures where invalid values or in-
coverage criteria. Based on the aforementioned empirical                 valid value combinations interact with valid values and
studies, a recommendation for positive scenarios is to use               valid value combinations could also be observed. The fail-
pairwise coverage to trigger most failures and 4- to 6-wise              ure is triggered by a t-wise interation of one or more
coverage that should trigger all failures. For the application           valid values and the invalid value or invalid value combi-
in practice, one major challenge is to generate test suites of           nation. For instance, suppose the valid role=contributor
minimal or small size with 4- or 6-wise coverage.                        is responsible for selecting the strategy which is used
   To test negative scenarios, different challenges in combina-          to process the bank data of a customer. If the invalid
torial testing can be observed.                                          combination of [payment-method:direct-debit] and
   In our case study, four classes of incorrect error-handling           [account-number:invalid] is handled incorrectly by the
are identified. First, incorrect error-detection is caused by            selected strategy, then the interaction of all three values is
conditions which are either too strict or too loose. Second,             required to trigger the failure.
incorrect error-signaling results in a wrong type of error to               The observed failures of our case study are in line with
be signaled. Third, incorrect recovery of a signaled error is            a case study by Wojciak and Tzorref-Brill [23] who faced
caused by a fault in the appropriate recovery instructions.              error-handling that would be different depending on firmware
Fourth, incorrect flow from error-signaling to error-recovery is         in control and system configurations. Different configuration
caused by a signaled error for which no appropriate recovery             options can also be modelled as input parameters, a robustness
instructions are implemented.                                            interaction of configuration options with invalid values and
Challenge 1 - Avoid the Input Masking Effect: Incorrect                  valid value combinations is also reasonable.
error-detection that is caused by a too strict condition can be             Since only low dimensions of robustness interaction are
revealed by positive test input that mistakenly triggers error-          observed, we believe it is unlikely that the generation of 4-
recovery. But, revealing a condition that is too loose requires          to 6-wise test suites is a challenge here as well. Instead,
invalid test input that mistakenly does not trigger error-               alternative coverage criteria that, for instance, allow a variable
recovery. To ensure that too strict and too loose conditions             strength interaction with some other input parameters can
can be detected, the generation of valid and invalid test inputs         become a relevant to reduce the number of test inputs.
must be separated and both sets of test input must satisfy
separate coverage criteria.                                                                   VIII. C ONCLUSION
Challenge 2 - Generate Strong Invalid Test Inputs: Another                  The effectiveness of negative test scenarios is unclear from
challenge is the generation of strong invalid test inputs such           an empirical point of view. We conducted a case study to get
that one invalid value or invalid value combination cannot               information on failures triggered by invalid test inputs. The
mask another. Incorrect error-detection and incorrect error-             motivation for our and others studies was that if all failures
recovery may remain undetected if the signal that would result           are triggered by an interaction of d or fewer parameter values,
from an incorrect condition is masked by the computation of              then testing all d-wise parameter value combinations should
another invalid value or invalid value combination.                      be as effective as exhaustive testing [4].
Challenge 3 - Consider Invalid Value Combinations: Since                    In our case study we analyzed bug reports which originate
the error-detection conditions may depend on an arbitrary                from a development project that manages life insurances. In
number of input values, it is not sufficient to only consider in-        total, 683 bug reports are analyzed. 434 bug reports describe
valid values as most combinatorial testing tools do. As our case         actual failures and 212 of them are failures triggered by a
study and Pan et al. [14], [15] show, 80% of the robustness              2-wise or higher interaction of parameter values.
failures are triggered by invalid values, i.e. a robustness size            In general, the distribution of FTFI dimensions conforms
of one, but also 20% of the robustness failures require invalid          to the pattern of previous empirical studies. But in contrast
value combinations to be triggered. Error-detection with more            to positive test scenarios, fewer robustness failures with lower
complex conditions must be tested as well. Invalid value                 FTFI dimensions are identified. Overall, the robustness failures
combinations should be excluded when generating positive test            are grouped in four classes: incorrect error-detection, incorrect
inputs but included when generating invalid test inputs [13].            error-signaling, incorrect recovery from a signaled error and
Therefore, appropriate modeling facilities and algorithms that           incorrect flow from error-signaling to error-recovery. Most ro-
consider invalid value combinations are another challenge.               bustness failures (80%) are triggered by single invalid values.


     Copyright © 2018 for this paper by its authors.                28
                             6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)


The remaining robustness failures require an interaction of two                        [10] D. M. Cohen, S. R. Dalal, M. L. Fredman, and G. C. Patton, “The aetg
and three input parameter values. Two reported bugs require                                 system: An approach to testing based on combinatorial design,” IEEE
                                                                                            Transactions on Software Engineering, vol. 23, no. 7, 1997.
an interaction of valid values with an invalid value or invalid                        [11] J. Czerwonka, “Pairwise testing in real world,” in 24th Pacific Northwest
value combinations to trigger the robustness failure.                                       Software Quality Conference, 2006.
   Based on the findings of this case study, we derive chal-                           [12] L. Yu, Y. Lei, R. N. Kacker, and D. R. Kuhn, “Acts: A combinatorial
                                                                                            test generation tool,” in Software Testing, Verification and Validation
lenges for combinatorial robustness testing. To ensure that                                 (ICST), 2013 IEEE Sixth International Conference on. IEEE, 2013,
failures do not remain hidden, possible masking should be                                   pp. 370–375.
reduced. Valid and invalid test inputs should be separated and                         [13] K. Fögen and H. Lichter, “Combinatorial testing with constraints for
                                                                                            negative test cases,” in 2018 IEEE Eleventh International Conference
invalid test inputs should be strong, i.e. should only contain                              on Software Testing, Verification and Validation Workshops (ICSTW),
one invalid value or invalid value combination.                                             7th International Workshop on Combinatorial Testing (IWCT), 2018.
   Further on, it is not sufficient to only consider invalid values                    [14] J. Pan, “The dimensionality of failures - a fault model for characterizing
                                                                                            software robustness,” Proc. FTCS ’99, June, 1999.
as most combinatorial testing tools do. Invalid value combi-                           [15] J. Pan, P. Koopman, and D. Siewiorek, “A dimensionality model
nations should be excluded when generating valid test inputs                                approach to testing and improving software robustness,” in AUTOTEST-
but considered for invalid test inputs. Therefore, appropriate                              CON’99. IEEE Systems Readiness Technology Conference, 1999. IEEE.
                                                                                            IEEE, 1999, pp. 493–501.
modeling facilities and algorithms are required.                                       [16] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The
   Since only low robustness interactions are observed, the                                 oracle problem in software testing: A survey,” IEEE Transactions on
generation of test inputs with 4- to 6-wise coverage is not that                            Software Engineering, vol. 41, no. 5, 2015.
                                                                                       [17] G. Meszaros, xUnit Test Patterns: Refactoring Test Code. Upper Saddle
important for negative scenarios. But, the support of variable                              River, NJ, USA: Prentice Hall PTR, 2007.
strength generation for invalid inputs is another challenge.                           [18] IEEE, “Ieee standard glossary of software engineering terminology,”
   Most robustness failures do not involve any robustness inter-                            IEEE Std, vol. 610.12-1990, 1990.
                                                                                       [19] N. Li and J. Offutt, “Test oracle strategies for model-based testing,”
action. But, there are situations where robustness interactions                             IEEE Transactions on Software Engineering, vol. 43, no. 4, pp. 372–
can be observed since different input values, configuration                                 395, April 2017.
options or internal states are modelled as input parameter                             [20] C. Yilmaz, E. Dumlu, M. B. Cohen, and A. Porter, “Reducing masking
                                                                                            effects in combinatorial interaction testing: A feedback driven adaptive
values. Depending on expected costs of failure, t-wise testing                              approach,” IEEE Transactions on Software Engineering, vol. 40, no. 1,
of invalid test inputs is an option.                                                        2014.
   In the future, we will work on facilities that support the                          [21] D. Cotroneo, R. Pietrantuono, S. Russo, and K. Trivedi, “How do bugs
                                                                                            surface? a comprehensive study on the characteristics of software bugs
modelling of invalid value combinations and we will integrate                               manifestation,” Journal of Systems and Software, vol. 113, pp. 27 – 43,
variable strength in a combinatorial algorithm for invalid input                            2016.
generation. To reduce the number of invalid test inputs, we will                       [22] Z. B. Ratliff, D. R. Kuhn, R. N. Kacker, Y. Lei, and K. S. Trivedi,
                                                                                            “The relationship between software bug type and number of factors
conduct experiments to investigate the efficiency of different                              involved in failures,” in 2016 IEEE International Symposium on Software
coverage criteria.                                                                          Reliability Engineering Workshops (ISSREW), Oct 2016, pp. 119–124.
                                                                                       [23] P. Wojciak and R. Tzoref-Brill, “System level combinatorial testing
                              R EFERENCES                                                   in practice - The concurrent maintenance case study,” Proceedings -
 [1] M. Grindal, J. Offutt, and S. F. Andler, “Combination testing strategies:              IEEE 7th International Conference on Software Testing, Verification and
     A survey,” Software Testing, Verification and Reliability, vol. 15, no. 3,             Validation, ICST 2014, 2014.
     2005.                                                                             [24] J. Offutt and C. Alluri, “An industrial study of applying input space
 [2] D. R. WALLACE and D. R. KUHN, “Failure modes in medical device                         partitioning to test financial calculation engines,” Empirical Software
     software: An analysis of 15 years of recall data,” International Journal               Engineering, vol. 19, no. 3, pp. 558–581, Jun 2014.
     of Reliability, Quality and Safety Engineering, vol. 08, no. 04, pp. 351–         [25] P. Runeson and M. Höst, “Guidelines for conducting and reporting case
     371, 2001.                                                                             study research in software engineering,” Empirical Software Engineer-
 [3] D. R. Kuhn and M. J. Reilly, “An investigation of the applicability                    ing, vol. 14, no. 2, p. 131, Dec 2008.
     of design of experiments to software testing,” in 27th Annual NASA                [26] L. M. Leventhal, B. M. Teasley, D. S. Rohlman, and K. Instone, “Positive
     Goddard/IEEE Software Engineering Workshop, 2002. Proceedings.,                        test bias in software testing among professionals: A review,” in Human-
     Dec 2002, pp. 91–95.                                                                   Computer Interaction, L. J. Bass, J. Gornostaev, and C. Unger, Eds.
 [4] D. R. Kuhn, D. R. Wallace, and A. M. Gallo, “Software fault interactions               Berlin, Heidelberg: Springer Berlin Heidelberg, 1993, pp. 210–218.
     and implications for software testing,” IEEE Transactions on Software             [27] B. E. Teasley, L. M. Leventhal, C. R. Mynatt, and D. S. Rohlman,
     Engineering, vol. 30, no. 6, pp. 418–421, June 2004.                                   “Why software testing is sometimes ineffective: Two applied studies of
 [5] K. Z. Bell and M. A. Vouk, “On effectiveness of pairwise methodology                   positive test strategy.” Journal of Applied Psychology, vol. 79, no. 1, p.
     for testing network-centric software,” in Information and Communica-                   142, 1994.
     tions Technology, 2005. Enabling Technologies for the New Knowledge               [28] A. Causevic, R. Shukla, S. Punnekkat, and D. Sundmark, “Effects of
     Society: ITI 3rd International Conference on. IEEE, 2005, pp. 221–235.                 negative testing on tdd: An industrial experiment,” in Agile Processes
 [6] D. R. Kuhn and V. Okum, “Pseudo-exhaustive testing for software,” in                   in Software Engineering and Extreme Programming, H. Baumeister and
     2006 30th Annual IEEE/NASA Software Engineering Workshop, April                        B. Weber, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,
     2006, pp. 153–158.                                                                     pp. 91–105.
 [7] D. R. Kuhn, R. N. Kacker, and Y. Lei, “Estimating t-way fault profile
     evolution during testing,” in Computer Software and Applications Con-
     ference (COMPSAC), 2016 IEEE 40th Annual, vol. 2. IEEE, 2016, pp.
     596–597.
 [8] R. Tzoref-Brill, “Advances in combinatorial testing,” ser. Advances in
     Computers. Elsevier, 2018.
 [9] M. M. Hassan, W. Afzal, M. Blom, B. Lindstrom, S. F. Andler, and
     S. Eldh, “Testability and software robustness: A systematic literature
     review,” in 2015 41st Euromicro Conference on Software Engineering
     and Advanced Applications (SEAA). IEEE, 2015.


       Copyright © 2018 for this paper by its authors.                            29