6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) A Case Study on Robustness Fault Characteristics for Combinatorial Testing - Results and Challenges Konrad Fögen Horst Lichter Research Group Software Construction Research Group Software Construction RWTH Aachen University RWTH Aachen University Aachen, NRW, Germany Aachen, NRW, Germany foegen@swc.rwth-aachen.de lichter@swc.rwth-aachen.de Abstract—Combinatorial testing is a well-known black-box an interaction of d or fewer parameter values, then testing all testing approach. Empirical studies suggest the effectiveness of d-wise parameter value combinations should be as effective combinatorial coverage criteria. So far, the research focuses as exhaustive testing [4]. No analyzed failure required an on positive test scenarios. But, robustness is an important characteristic of software systems and testing negative scenarios interaction of more than six parameter values to be triggered is crucial. Combinatorial strategies are extended to generate [7], [8]. The results indicate that 2-wise (pairwise) testing invalid test inputs but the effectiveness of negative test scenarios should trigger most failures and 4 to 6-wise testing should is yet unclear. Therefore, we conduct a case study and analyze 434 trigger all failures of a SUT. failures reported as bugs of an financial enterprise application. However, so far research focuses on positive test scenarios, As a result, 51 robustness failures are identified including failures triggered by invalid value combinations and failures triggered by i.e. test inputs with valid values to test the implemented interactions of valid and invalid values. Based on the findings, operations based on their specification. Since robustness is four challenges for combinatorial robustness testing are derived. an important characteristic of software systems [9], testing of negative test scenarios is crucial. Invalid test inputs contain Keywords-Software Testing, Combinatorial Testing, Robustness invalid values, e.g. a string value when a numerical value is Testing, Test Design expected, or invalid combinations of otherwise valid values, e.g. a begin date which is after the end date. I. I NTRODUCTION They are used to check proper error-handling to avoid ab- Combinatorial testing (CT) is a black-box approach to reveal normal behavior and system crashes. Error-handling is usually conformance faults between the system under test (SUT) and separated from normal program execution. It is triggered by its specification. An input parameter model (IPM) with input an invalid value or an invalid value combination and all other parameters and interesting values is derived from the specifi- values of the test input remain untested. Therefore, a strict cation. Test inputs are generated where each input parameter separation of valid and invalid test inputs is suggested and has a value assigned. The generation is usually automated and combination strategies are extended to support generation of a combination strategy defines how values are selected [1]. invalid test inputs [1], [10]–[13]. CT can help detecting interaction failures, e.g. failures But, the effectiveness of negative test scenarios is unclear triggered by the interaction of two or more specific values. as this is not yet empirically researched. To the best of our For instance, a bug report analyzed by Wallace and Kuhn [2] knowledge, it is only Pan et al. [14], [15] who characterize describes that ”the ventilator could fail when the altitude ad- data of faults from robustness testing. Their results obtained justment feature was set on 0 meters and the total flow volume from testing robustness of operating system APIs indicate that was set at a delivery rate of less than 2.2 liter per minute“. most robustness failures are caused by single invalid values. The failure is triggered by the interaction of altitude=0 and Though, there is no more information on failures caused by delivery-rate<2.2. This is called a failure-triggering fault invalid value combinations. Because only one type of software interaction (FTFI) and its dimension is d = 2 because the is analyzed, more empirical studies are required to confirm interaction of two input parameter values is required. (or reject) the distribution and upper limit of FTFIs for other Testing each value only once is not sufficient to detect software types (Kuhn and Wallace [4]). interaction faults and exhaustively testing all interactions To gather more information on failures triggered by invalid among all input parameters is almost never feasible in practice. value combinations, we conducted a case study to analyze bug Therefore, other combinatorial coverage criteria like t-wise reports of a newly developed distributed enterprise application where 1 ≤ t < n denotes the testing strength are proposed [1]. for financial services. In total, 683 bug reports are examined The effectiveness of combinatorial coverage criteria is also and 434 of them describe failures which are further analyzed. researched in empirical studies [2]–[7]. Collected bug reports The paper is structured as follows. Section II and III are analyzed and FTFI dimensions are determined for different summarize foundations and related work. In Section IV, the types of software [4]. If all failures of a SUT are triggered by design of the case study is explained. The results are discussed Copyright © 2018 for this paper by its authors. 22 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) p1 : P aymentT ype V1 = {CreditCard, Bill} Table I: Pairwise Test Suite p2 : DeliveryT ype V2 = {Standard, Express} PaymentType DeliveryType TotalAmount p3 : T otalAmount V3 = {1, 500} Bill Express 500 Bill Standard 1 Listing 1: Exemplary IPM for a Checkout Service CreditCard Standard 500 CreditCard Express 1 in Section V and challenges for combinatorial testing are specification. The IPM is represented as a set of n input discussed in Section VII. Afterwards, potential threads to parameters IPM = {p1 , ..., pn } and each input parameter pi validity are discussed and we conclude with a summary of is represented as a non-empty set of values Vi = {v1 , ..., vmi }. our work. Test inputs are composed from the IPM such that every test II. BACKGROUND input contains a value for each input parameter. Formally, a A. Robustness Testing test input is a set of parameter-value pairs for all n distinct pa- rameters and a parameter-value pair (pi , vj ) denotes a selection Testing is the activity of stimulating a system under test of value vj ∈ Vi for parameter pi . (SUT) and observing its response [16]. System testing (also Listing 1 depicts an exemplary IPM to test the check- called functional testing) is concerned with the behavior of the out service of an e-commerce system with three input pa- entire system and usually corresponds to business processes, rameters and two values for each input parameter. One use cases or user stories [17]. Both, the stimulus and response possible test input for this IPM is [PaymentType:Bill, consist of values. They are called test input and test output, re- DeliveryType:Standard, TotalAmount:1]. Formally, a spectively. In this context, input comprises anything explicable test input τ = {(pi1 , vj1 ), ..., (pin , vjn )} is denoted as a set of that is used to change the observable behaviour of the SUT. pairs. In this paper, we use the aforementioned notation with Output comprises anything explicable that can be observed brackets which is equal to τ = {(p1 , v2 ), (p2 , v1 ), (p3 , v1 )}. after test execution. The composition of test inputs is usually automated and a A test case covers a certain scenario to check whether the combination strategy defines how values are selected [1]. Since SUT satisfies a particular requirement [18]. It consists of a testing each value only once is not sufficient to detect inter- test input and a test oracle [19]. The test input is necessary action faults and exhaustively testing all interactions among to induce the desired behavior. The test oracle provides the all input parameters is almost never feasible in practice, other expected results which can be observed after test execution if coverage criteria like t-wise are proposed. and only if the SUT behaves as intended by its specification. Finally, the expected result to the actual result are compared For illustration, Table I depicts a test suite for the to determine whether the test passes or fails. e-commerce example that satisfies the pairwise coverage Since robustness is an important software quality [9], testing criterion. For more information on the different cover- should not only cover positive but also negative scenarios to age criteria, please refer to Grindal et al. [1]. To satisfy evaluate a SUT. Robustness is defined as “the degree to which the coverage criterion, all pairwise value combinations of a system or component can function correctly in the presence P aymentT ype×DeliveryT ype, P aymentT ype×T otalAmount of invalid inputs or stressful environmental conditions” [18]. and DeliveryT ype × T otalAmount must be included in at Positive scenarios focus on valid intended operations of the least one test input. If the first test input was not exe- SUT using valid test inputs that are within the specified cuted, pairwise coverage would not be satisfied because the boundaries. Negative scenarios focus on the error-handling combinations [PaymentType:Bill, DeliveryType:Expr using invalid test inputs that are outside of the specified ess],[PaymentType:Bill, Total Amount:1], [Delive boundaries. For instance, input that is malformed, e.g. a string ryType: Express, Total Amount:1] would be untested. input when numerical input is expected, or input that violates In comparison to exhaustive testing, fewer test inputs are business rules, e.g. a begin date which is after the end date. required to satisfy the other coverage criteria. But as the example illustrates, problems with only one test input might B. Combinatorial Testing lead to combinations being not covered and failures that are Combinatorial testing (CT) is a black-box approach to reveal triggered by these combinations remain undetected. interaction failures, i.e. failures triggered by the interaction of If we suppose that the checkout service requires a total two or more specific values, because the SUT is tested with amount of at least 25 dollar, then two test inputs of the example varying test inputs. A generic test script describes a sequence (Table I) with [TotalAmount:1] are expected to abort with of steps to exercise the SUT with placeholders (variables) a message to buy more products. In those cases, the SUT that represent variation points [17]. The variation points can deviates from the normal control-flow and an error-handling be used to vary different inputs to the system, configuration procedure is triggered. The value [TotalAmount:1] that is variables or internal system states [8]. With CT, varying test responsible for triggering the error-handling is called invalid inputs are created to instantiate the generic test script. value. If we also suppose that the checkout service rejects An input parameter model (IPM) is created for which payment by bill for total amounts greater than 300 dollar, then input parameters and interesting values are derived from the [PaymentType:Bill, TotalAmount:500] would trigger Copyright © 2018 for this paper by its authors. 23 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) error-handling as well. Even though both values are valid, the FTFI contains an invalid value or an invalid value combi- combination of them denotes an invalid value combination. nation. Then, the number of parameters that constitute the Valid test inputs do not contain any invalid values and invalid value or invalid value combination are denoted as the invalid value combinations. In contrast, an invalid test input robustness size. In case of an invalid value, the robustness size contains at least one invalid value or invalid value combination. is one. If an invalid test input contains exactly one invalid value or The extension to support t-wise generation of invalid test one invalid value combination, it is called a strong invalid test inputs is based on the assumption that failures are triggered input. by an interaction of an invalid value (or invalid value com- Once the SUT evaluates an invalid value or invalid value bination) and a (t − 1)-wise combination of valid values of combination, error-handling is triggered. The normal control- the other parameters. This is a robustness interaction and flow is left and all other values and value combinations of the its robustness interaction dimension can be computed by test input remain untested. They are masked by the invalid subtracting the robustness size from the FTFI dimension. value or invalid value combination [13]. This phenomenon is There is no robustness interaction if the robustness size called input masking effect which we adapt from Yilmaz et and FTFI dimension are equal, i.e. the robustness interac- al. [20]: “The input masking effect is an effect that prevents a tion dimension is zero. For instance, there is no robustness test case from testing all combinations of input values, which interaction if [TotalAmount:1] or [PaymentType:Bill, the test case is normally expected to test”. Total Amount:500] trigger failure. In contrast, the robust- To prevent input masking, a strict separation of valid and ness interaction dimension of the aforementioned example invalid test inputs is suggested [1], [10]–[13]. Combination [TotalAmount:1, DeliveryType:Express] is one be- strategies are extended to support t-wise generation of invalid cause the invalid value interacts with one valid value. test inputs. Values can be marked as invalid to exclude them from valid test inputs and to include them in invalid test III. R ELATED W ORK inputs. The invalid value is then combined with all (t−1)-wise If the highest dimension of parameters involved in FTFIs is combinations of valid values. An extension that we proposed known before testing, then testing all d-wise parameter value also allows to explicitly mark and generate t-wise invalid test combinations should be as effective as exhaustive testing [4]. inputs based on invalid value combinations [13]. However, d cannot be determined for a SUT a-priori because the faults are not known before testing. Hence, the motivation C. Fault Characteristics of fault characterization in black-box testing is to empirically According to IEEE [18], an error is ”the difference between derive fault characteristics to guide future test activities. a computed, observed, or measured value or condition and the Existing research on the effectiveness of black-box testing true, specified, or theoretically correct value or condition.“ It derives the distribution and maximum of d among different is the result of a mistake made by a human and is manifested types of software based on bug reports. Wallace and Kuhn as a fault. In turn, a fault is statically present in the source [2] review 15 years of recall data from medical devices, code and is the identified or hypothesized cause of a failure. i.e. software written for embedded systems. Kuhn and Reilly A failure is an external behavior of the SUT, i.e. a behavior [3] analyze bug reports from two large open-source software observable or perceivable by the user, which is incorrect with projects; namely the Apache web server and the Mozilla web regards to the specified or expected behavior. browser. Kuhn and Wallace [4] report findings from analyzing In CT, we assume that the execution path through the SUT 329 bug reports of a large distributed data management system is determined by the values and value combination of the test developed at NASA Goddard Space Flight Center. Bell and input. If an executed statement contains a fault that causes Vouk [5] analyze the effectiveness of pairwise testing network- an observable failure and if a certain value or a certain value centric software. They derive their fault characteristics from a combination is required for executing the statement, then the public database of security flaws and create simulations based value or value combination is called a failure-triggering fault on that data. Kuhn and Okum [6] apply combinatorial testing interaction (FTFI). The number of parameters involved in a with different strengths to a module of a traffic collision avoid- FTFI is its dimension denoted as d with 0 ≤ d ≤ n. For ance system which is written in the C programming language. instance, if the checkout service contains a fault and accepts a Though, the experiments use manually seeded ”realistic“ faults total amount of one but only if express is chosen as the delivery rather than a specific bug database. Cotroneo et al. [21] again type, then [TotalAmount:1, DeliveryType:Express] is analyze bug reports from Apache and the MySQL database a FTFI with a dimension of two. system and Ratliff et al. [22] report the FTFI dimensions of In general, different types of triggers exist to expose failures 242 bug reports from the MySQL database system. [21]. A trigger is a set of conditions to expose a failure if As concluded by Kuhn et al. [7], the studies show that most the conditions are satisfied . We focus on FTFI, i.e. failures failures in the investigated domains are triggered by single triggered by test input variations, rather than on failures parameter values and parameter value pairs. Progressively triggered by ordering or timing of stimulation. fewer failures are triggered by an interaction of three or more In addition, we introduce the following terms for robustness parameter values. In addition to the distribution of FTFIs, fault characteristics. A failure is a robustness failure if the a maximum interaction of four to six parameter values is Copyright © 2018 for this paper by its authors. 24 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) identified. No reported failure required an interaction of more RO1: Identify the different FTFI dimensions of robustness than six parameter values to be triggered. Thus, pairwise failures that can be observed in our case study. testing should trigger most failures and 4- to 6-wise testing The generation of t-wise invalid test inputs is based on the should trigger all failures of a SUT [8]. assumption that failures are triggered by a robustness interac- Several tools include the concept of invalid values to support tion between the invalid value (or invalid value combination) the generation of invalid test inputs [10]–[12]. An algorithm and (t − 1)-wise combination of the valid values of the other that we proposed [13] extends the concept to invalid value parameters. If the highest robustness interaction dimension combinations. In their case study, Wojciak and Tzoref-Brill is known before testing, then testing all t-wise invalid test [23] report on system level combinatorial testing that includes inputs of that dimension should be as effective as exhaustive testing of negative scenarios. There, t-wise coverage of neg- robustness testing. ative test inputs is required because error-handling depends RO2: Identify different robustness sizes and derive the robust- on a robustness interaction between invalid and valid values. ness interaction dimensions that can be observed in our case Another case study by Offut and Alluri [24] reports on the study. first application of CT for financial calculation engines but robustness is not further discussed. C. Case and Unit of Analysis From an empirical point of view, the effectiveness of The case is a software development project from an IT negative test scenarios is yet unclear. To the best of our service provider for an insurance company. A new system knowledge, it is only Pan et al. [14], [15] who characterize is developed to manage the lifecycle of life insurances. It is data on faults from robustness testing. The results of testing based on an off-the-shelf framework which is customized and robustness of operating system APIs indicate that most robust- extended to meet the company’s requirements. In total, the ness failures are caused by single invalid values. Though, there new system consists of 2.5 MLOC and an estimated workload is no more information on failures triggered by invalid value of 5000 person days. The core is an inventory sub-system combinations. Also, as Kuhn et al. [4] state, more empirical with a central database to store information on customer’s life studies are required to confirm (or reject) the distribution insurance contracts. In addition, complex financial calculation and upper limit of FTFIs for other software types. Therefore, engines and business processes like capturing and creating new we conducted another case study which is described in the customer insurances are implemented. The business processes subsequent sections. also integrate with a variety of different already existing systems which are, for instance, responsible to manage infor- IV. C ASE S TUDY D ESIGN mation about the contract partners, about claims and damages A. Research Method and to the support insurance agents. Since life insurance contracts have decade-long lifespans We follow the guidelines for conducting and reporting and rely on complex financial models, the correctness of the case study research in software engineering as suggested system is business critical. Mistakes can have severe effects by Runeson and Höst [25]. As they state, a case study which can even amplify over the long-lasting lifespans and ”investigates a contemporary phenomenon within its real life cause enormous damage to the company. Therefore, thorough context, especially when the boundaries between phenomenon testing is important. and context are not clearly evident“. Case study research Even though, the business processes are managed by the is typically used for exploratory purposes, e.g. seeking new new system, they rely on other systems of which each again insights and generating hypotheses for new research. relies on other systems. This makes it hard to test the system The guidelines suggest to conduct a case study in five steps. or its parts in isolation. It is also difficult to control the state First, the objectives are defined and the case study is planned. of the systems and to observe the complete behavior which As a second step, the data collection is prepared before the makes testing even more complicated. data is collected in a third step. Afterwards, the collected data Therefore, most testing is conducted on a system level is analyzed and finally, the results of the analysis are reported. within an integrated test environment in which all required systems are deployed. The test design is often based on expe- B. Research Objective rience and error-guessing. Tests are executed mostly manually The overall objective of this case study is to gather informa- because of the low controllability and observability. tion on the effectiveness of combinatorial testing with invalid test inputs, and to compare the obtained results with the ones D. Data Collection Procedure of other published case studies. For example, the work by To yield the research objectives, the case study relies on Pan et al. [14], [15] indicates that most robustness failures archival data from the aforementioned software development in operating system APIs are triggered by single values rather project. A project-wide issue management system contains all than value combinations, i.e. a FTFI dimension and robustness bug reports from the project start in 2015 to the productive size of one. Hence, our aim is to either confirm or reject deployment at the beginning of 2018. In general, a bug report this indication for enterprise applications. This leads to the is a specifically categorized issue which coexists with other following two concrete research objectives: project management- and development-related issues. Copyright © 2018 for this paper by its authors. 25 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) For our case study, we analyzed the issue’s title, its category, systems that timeout and do not response to requests. All these its initial description, additional information in the comment bug reports are excluded form further analysis because CT is section and its status. Further on, some bug reports are also about varying test inputs rather than varying sequences and connected to a central source code management system. If timing. a bug report does not contain sufficient information, the The remaining 388 bug reports describe failures triggered corresponding source code modifications can be analyzed as by some test input. A subset of 176 bug reports describes well. integration failures with other systems where values are not The issues are filtered to restrict the analysis to only correctly mapped from one data structure to another. They can reasonable bug reports. Therefore, issues created automatically be triggered by any test input. There are so many reported inte- by static analysis tools are excluded. Further on, only issues gration issues (45% out of 434 bug reports) because the system categorized as bug reports whose status is set to complete are consists of several independently developed components which considered because we expect only them to contain a correct are early and often integrated with the other components and description on how to systematically reproduce the failure. other systems using one of the test environments. Finally, 212 bug reports are considered to be suitable for E. Data Analysis Procedure CT, which is 49% of all 434 bug reports that describe failures Once the bug reports are exported from the issue manage- which is 55% of all 388 bug reports that are triggered by some ment system, each bug report is analyzed one at a time. First, it test input To reproduce one of the bugs reported, the test input is checked if the bug report describes a failure in the sense that requires at least one specific value. an incorrect behaviour is observable by the user. Otherwise, the bug report is rejected. B. Observed FTFI Dimensions Afterwards, the trigger type of the reported failure is The observed FTFI dimensions for the 212 bug reports are determined. The bug report is not further analyzed if no depicted in Table II. Most failures are triggered by single systematically reproducible trigger is found. It is also rejected parameter values and parameter value pairs and progressively if the failure is not triggered by a test input variation but rather fewer failures are triggered by 3- and 4-wise interactions. In triggered by unlikely ordering or timing. If a specific value our case, no reported bug requires an interaction of more than or value combination is identified to trigger the failure, the 4 parameters in order to trigger the failure. dimension of the FTFI is determined in the next step. Table III presents the cumulative percentage of the FTFI Then, the bug report is classified as either positive or dimensions. The last three columns refer to our case study and negative depending on whether any invalid values or invalid shows that 76% of all reported failures require 1-wise (each value combinations are contained. If it is classified as negative, choice) coverage to be reliably triggered. It adds up to 96% i.e. if it is a robustness failure, the robustness size of the invalid when testing with pairwise coverage and 100% are covered value or invalid value combination is determined as well. when all 4-wise parameter value combinations are used for A robustness size which is lower than the FTFI dimension testing. indicates a robustness interaction between the invalid value To compare our results, the first columns of the table show (combination) and valid combinations of the other parame- the results of previous case studies, briefly introduced in the ters. If possible, the robustness interaction dimension is also related work section. The numbers and also the average per- extracted from the bug report. centage values are taken from Kuhn et al. [7]. The distribution of FTFIs obtained in our case study is not in contradiction to V. R ESULTS AND D ISCUSSION the other cases. However, the distribution is mostly similar A. Analyzed Data to cases [2] and [4]. While there are no obvious similarities In total, 683 bug reports are analyzed. All reported bugs with embedded systems for medical devices [2], the large are revealed and fixed during the development phase of the data management system [4] is probably quite similar to our system. Even though filters are applied to export the bug case in terms of requirements and used technologies. Similar reports, 249 bug reports are classified as unrelated, because to our case, the bug reports are also from a development the issue management system is also used as a communication project whereas the other studies analyze fielded products and task management tool. For instance, problems with config- [7]. For all three cases, most failures are triggered by single urations of test environments, refactorings or build problems parameter values and almost all failures are triggered by the are categorized as bug reports as well. combination of single parameter values and pairwise parameter The remaining 434 bug reports describe failures, they are value combinations. All failures should be triggered by 4-wise classified as follows. Eight bug reports do not provide enough parameter value combinations. information for further analysis and classification. 38 reported So far, only the dimension of failure-triggering fault in- bugs require specific timing and ordering of sequences to be teractions is considered but differences between positive and triggered. For instance, one sequence to trigger a failure is negative scenarios are not discussed. to search for a customer, open its details, edit the birthday, All in all, 51 robustness failures are identified which are press cancel and edit the birthday again. Three reported bugs classified as follows. 22 failures are caused by incorrect error- are related to robustness testing. They are triggered by other detection of abnormal situations because conditions to detect Copyright © 2018 for this paper by its authors. 26 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) Table II: Observed FTFI Dimensions Table III: Cumulative Percentage of FTFI Dimensions Previous Studies Our Study d All Positive Negative d [2] [3]a [3]b [4] [5] [21] [22] Avg. All Pos. Neg. 1 162 121 41 1 66 28 41 67 18 9 49 39.7 76 75 80 2 40 31 9 2 97 76 70 93 62 47 86 75.9 96 94 98 3 6 5 1 3 99 95 89 98 87 75 97 91.4 98 98 100 4 4 4 - 4 100 97 96 100 97 97 99 98.0 100 100 5 - - - 5 99 96 100 100 100 99.0 6 - - - 6 100 100 100.0 abnormal situations are either wrong or missing. Consequently, the component that manages contracting parties ensure data these abnormal situations are not discovered. For instance, quality by checking that, e.g. the title of a person matches the bank transfer is accepted as a payment option even though gender of the first name and that the first name and family incorrect or no bank account information is provided. name are correct and not confused with each other. However, In 19 cases, reported failures are caused by incorrect error- when using an unknown invalid title, the system responds with signaling. Errors are signaled if an abnormal situation is a wrong error message saying that the family name was wrong. detected but the error should be handled somewhere else. For If an invalid family name was combined with the unknown instance, a misspelled first name is detected by a user regis- title, the failure would not have been discovered. tration service but the error message complaints a misspelled To yield the second research objective, the 10 invalid test last name. inputs with a FTFI dimension greater than one are further For three reported failures, the abnormal situation is cor- analyzed. As a result, two reported bugs that describe failures rectly detected and the error is correctly signaled. However, with robustness interactions are discovered. Even though a the system performs incorrect error-recovery because the combination of two and three specific parameter values is instructions to recover from the abnormal situation contain required to trigger the robustness failures, the robustness size faults. For instance, the user is asked to correct wrong input, is only one and two, respectively. e.g. a misspelled first name. After the input is corrected, the Furthermore, two reported bugs require an interaction of system does not recover and the corrected input cannot be invalid value (combinations) and a valid value of another processed. parameter. One reported bug is related to the communication In seven cases, failures are triggered by the system’s runtime between two systems. The response of the second system environment. For instance, a NullPointerException is sig- contains one parameter value with error information to indicate naled when the runtime environment detects unexpected and whether the requested operation succeeded or failed. Another illegal access of NULL values. Since developers did not expect parameter provides details about the internal processing of the NULL values, no respective error-handlers are implemented and request and a certain value indicates an internal resolution of the processes terminate. These failures denote incorrect flows the error. In that case, the calling system is expected to handle from error-signaling to error-recovery. the error in a different way. Table II depicts the observed FTFI dimensions and their Another reported bug belongs to the storage of de- distribution divided into positive and negative test scenarios. tails on contracting parties where one contracting party As can be seen, the maximum dimension of robustness interac- must be responsible for paying the insurance premiums. tion is three. Compared to positive test scenarios, the negative This responsibility is stored as a role called contribu- scenarios discover fewer failures and the FTFI dimensions are tor. If direct debit is chosen as the payment method also lower. For single parameter values and parameter value but an invalid bank account, i.e. an invalid IBAN num- pairs, the ratio is 3:1 of valid vs invalid test inputs and no ber, is provided, the resulting error message remains even invalid test inputs are identified for higher dimensions. after the invalid IBAN is replaced by a valid IBAN. While these numbers indicate that most failures are trig- While the combination [payment-method:direct-debit, gered by valid test inputs, we emphasize that the test design account-number:invalid] is required as an invalid com- is based on experience and error-guessing, robustness testing bination, the bug report states that this phenomenon could only was not in the focus. Hence, the ratio can also result from be observed for [role:contributor]. a general bias towards testing of positive scenarios which is identified in research [26]–[28]. VI. T HREADS TO VALIDITY Nevertheless, these findings underpin the results of Pan et The biggest thread to validity is that case studies are difficult al. [14], [15] who observe that most robustness failures in to generalize from [25]. Especially, because only one particular operating systems APIs are triggered by single invalid values. type of software of one company is analyzed. The archival In their study, 82% of robustness failures are triggered by data of the case study is only a snapshot and the ground truth, single invalid values. We observe the same ratio in our case. i.e. the set of all failures that can be triggered, is unknown. The bug reports also demonstrate the importance of strong Hence, the data set can be biased, for instance, towards positive invalid test inputs, i.e. test inputs with exactly one invalid scenarios which has been observed in research [26]–[28]. value or exactly one invalid value combination. For instance, Since the bug reports result from tests based on experience Copyright © 2018 for this paper by its authors. 27 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) an error-guessing, it may apply here as well. Challenge 4 - Support Alternative Coverage Criteria: To The data can also be biased towards certain fault character- reveal incorrect handling and incorrect recovery, the SUT must istics. Relevant and reasonable bug reports may be excluded be stimulated by the failure-triggering invalid test input. The by our filtering because the bug reports are incorrectly cat- majority of analyzed robustness failures does not indicate any egorized. Maybe not all triggered failures are reported. For robustness interaction between valid values and invalid values instance, a developer who finds a fault might just fix it without or invalid value combinations. Then, the failure is triggered creating a bug report. by the invalid value or invalid value combination. To satisfy the coverage criterion, it is sufficient to have a separate test VII. C HALLENGES FOR C OMBINATORIAL T ESTING input for each invalid value or invalid value combination. One challenge in combinatorial testing is to find an effective However, robustness failures where invalid values or in- coverage criteria. Based on the aforementioned empirical valid value combinations interact with valid values and studies, a recommendation for positive scenarios is to use valid value combinations could also be observed. The fail- pairwise coverage to trigger most failures and 4- to 6-wise ure is triggered by a t-wise interation of one or more coverage that should trigger all failures. For the application valid values and the invalid value or invalid value combi- in practice, one major challenge is to generate test suites of nation. For instance, suppose the valid role=contributor minimal or small size with 4- or 6-wise coverage. is responsible for selecting the strategy which is used To test negative scenarios, different challenges in combina- to process the bank data of a customer. If the invalid torial testing can be observed. combination of [payment-method:direct-debit] and In our case study, four classes of incorrect error-handling [account-number:invalid] is handled incorrectly by the are identified. First, incorrect error-detection is caused by selected strategy, then the interaction of all three values is conditions which are either too strict or too loose. Second, required to trigger the failure. incorrect error-signaling results in a wrong type of error to The observed failures of our case study are in line with be signaled. Third, incorrect recovery of a signaled error is a case study by Wojciak and Tzorref-Brill [23] who faced caused by a fault in the appropriate recovery instructions. error-handling that would be different depending on firmware Fourth, incorrect flow from error-signaling to error-recovery is in control and system configurations. Different configuration caused by a signaled error for which no appropriate recovery options can also be modelled as input parameters, a robustness instructions are implemented. interaction of configuration options with invalid values and Challenge 1 - Avoid the Input Masking Effect: Incorrect valid value combinations is also reasonable. error-detection that is caused by a too strict condition can be Since only low dimensions of robustness interaction are revealed by positive test input that mistakenly triggers error- observed, we believe it is unlikely that the generation of 4- recovery. But, revealing a condition that is too loose requires to 6-wise test suites is a challenge here as well. Instead, invalid test input that mistakenly does not trigger error- alternative coverage criteria that, for instance, allow a variable recovery. To ensure that too strict and too loose conditions strength interaction with some other input parameters can can be detected, the generation of valid and invalid test inputs become a relevant to reduce the number of test inputs. must be separated and both sets of test input must satisfy separate coverage criteria. VIII. C ONCLUSION Challenge 2 - Generate Strong Invalid Test Inputs: Another The effectiveness of negative test scenarios is unclear from challenge is the generation of strong invalid test inputs such an empirical point of view. We conducted a case study to get that one invalid value or invalid value combination cannot information on failures triggered by invalid test inputs. The mask another. Incorrect error-detection and incorrect error- motivation for our and others studies was that if all failures recovery may remain undetected if the signal that would result are triggered by an interaction of d or fewer parameter values, from an incorrect condition is masked by the computation of then testing all d-wise parameter value combinations should another invalid value or invalid value combination. be as effective as exhaustive testing [4]. Challenge 3 - Consider Invalid Value Combinations: Since In our case study we analyzed bug reports which originate the error-detection conditions may depend on an arbitrary from a development project that manages life insurances. In number of input values, it is not sufficient to only consider in- total, 683 bug reports are analyzed. 434 bug reports describe valid values as most combinatorial testing tools do. As our case actual failures and 212 of them are failures triggered by a study and Pan et al. [14], [15] show, 80% of the robustness 2-wise or higher interaction of parameter values. failures are triggered by invalid values, i.e. a robustness size In general, the distribution of FTFI dimensions conforms of one, but also 20% of the robustness failures require invalid to the pattern of previous empirical studies. But in contrast value combinations to be triggered. Error-detection with more to positive test scenarios, fewer robustness failures with lower complex conditions must be tested as well. Invalid value FTFI dimensions are identified. Overall, the robustness failures combinations should be excluded when generating positive test are grouped in four classes: incorrect error-detection, incorrect inputs but included when generating invalid test inputs [13]. error-signaling, incorrect recovery from a signaled error and Therefore, appropriate modeling facilities and algorithms that incorrect flow from error-signaling to error-recovery. Most ro- consider invalid value combinations are another challenge. bustness failures (80%) are triggered by single invalid values. Copyright © 2018 for this paper by its authors. 28 6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018) The remaining robustness failures require an interaction of two [10] D. M. Cohen, S. R. Dalal, M. L. Fredman, and G. C. Patton, “The aetg and three input parameter values. Two reported bugs require system: An approach to testing based on combinatorial design,” IEEE Transactions on Software Engineering, vol. 23, no. 7, 1997. an interaction of valid values with an invalid value or invalid [11] J. Czerwonka, “Pairwise testing in real world,” in 24th Pacific Northwest value combinations to trigger the robustness failure. Software Quality Conference, 2006. Based on the findings of this case study, we derive chal- [12] L. Yu, Y. Lei, R. N. Kacker, and D. R. Kuhn, “Acts: A combinatorial test generation tool,” in Software Testing, Verification and Validation lenges for combinatorial robustness testing. To ensure that (ICST), 2013 IEEE Sixth International Conference on. IEEE, 2013, failures do not remain hidden, possible masking should be pp. 370–375. reduced. Valid and invalid test inputs should be separated and [13] K. Fögen and H. Lichter, “Combinatorial testing with constraints for negative test cases,” in 2018 IEEE Eleventh International Conference invalid test inputs should be strong, i.e. should only contain on Software Testing, Verification and Validation Workshops (ICSTW), one invalid value or invalid value combination. 7th International Workshop on Combinatorial Testing (IWCT), 2018. Further on, it is not sufficient to only consider invalid values [14] J. Pan, “The dimensionality of failures - a fault model for characterizing software robustness,” Proc. FTCS ’99, June, 1999. as most combinatorial testing tools do. Invalid value combi- [15] J. Pan, P. Koopman, and D. Siewiorek, “A dimensionality model nations should be excluded when generating valid test inputs approach to testing and improving software robustness,” in AUTOTEST- but considered for invalid test inputs. Therefore, appropriate CON’99. IEEE Systems Readiness Technology Conference, 1999. IEEE. IEEE, 1999, pp. 493–501. modeling facilities and algorithms are required. [16] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The Since only low robustness interactions are observed, the oracle problem in software testing: A survey,” IEEE Transactions on generation of test inputs with 4- to 6-wise coverage is not that Software Engineering, vol. 41, no. 5, 2015. [17] G. Meszaros, xUnit Test Patterns: Refactoring Test Code. Upper Saddle important for negative scenarios. But, the support of variable River, NJ, USA: Prentice Hall PTR, 2007. strength generation for invalid inputs is another challenge. [18] IEEE, “Ieee standard glossary of software engineering terminology,” Most robustness failures do not involve any robustness inter- IEEE Std, vol. 610.12-1990, 1990. [19] N. Li and J. Offutt, “Test oracle strategies for model-based testing,” action. But, there are situations where robustness interactions IEEE Transactions on Software Engineering, vol. 43, no. 4, pp. 372– can be observed since different input values, configuration 395, April 2017. options or internal states are modelled as input parameter [20] C. Yilmaz, E. Dumlu, M. B. Cohen, and A. Porter, “Reducing masking effects in combinatorial interaction testing: A feedback driven adaptive values. Depending on expected costs of failure, t-wise testing approach,” IEEE Transactions on Software Engineering, vol. 40, no. 1, of invalid test inputs is an option. 2014. In the future, we will work on facilities that support the [21] D. Cotroneo, R. Pietrantuono, S. Russo, and K. Trivedi, “How do bugs surface? a comprehensive study on the characteristics of software bugs modelling of invalid value combinations and we will integrate manifestation,” Journal of Systems and Software, vol. 113, pp. 27 – 43, variable strength in a combinatorial algorithm for invalid input 2016. generation. To reduce the number of invalid test inputs, we will [22] Z. B. Ratliff, D. R. Kuhn, R. N. Kacker, Y. Lei, and K. S. Trivedi, “The relationship between software bug type and number of factors conduct experiments to investigate the efficiency of different involved in failures,” in 2016 IEEE International Symposium on Software coverage criteria. Reliability Engineering Workshops (ISSREW), Oct 2016, pp. 119–124. [23] P. Wojciak and R. Tzoref-Brill, “System level combinatorial testing R EFERENCES in practice - The concurrent maintenance case study,” Proceedings - [1] M. Grindal, J. Offutt, and S. F. Andler, “Combination testing strategies: IEEE 7th International Conference on Software Testing, Verification and A survey,” Software Testing, Verification and Reliability, vol. 15, no. 3, Validation, ICST 2014, 2014. 2005. [24] J. Offutt and C. Alluri, “An industrial study of applying input space [2] D. R. WALLACE and D. R. KUHN, “Failure modes in medical device partitioning to test financial calculation engines,” Empirical Software software: An analysis of 15 years of recall data,” International Journal Engineering, vol. 19, no. 3, pp. 558–581, Jun 2014. of Reliability, Quality and Safety Engineering, vol. 08, no. 04, pp. 351– [25] P. Runeson and M. Höst, “Guidelines for conducting and reporting case 371, 2001. study research in software engineering,” Empirical Software Engineer- [3] D. R. Kuhn and M. J. Reilly, “An investigation of the applicability ing, vol. 14, no. 2, p. 131, Dec 2008. of design of experiments to software testing,” in 27th Annual NASA [26] L. M. Leventhal, B. M. Teasley, D. S. Rohlman, and K. Instone, “Positive Goddard/IEEE Software Engineering Workshop, 2002. Proceedings., test bias in software testing among professionals: A review,” in Human- Dec 2002, pp. 91–95. Computer Interaction, L. J. Bass, J. Gornostaev, and C. Unger, Eds. [4] D. R. Kuhn, D. R. Wallace, and A. M. Gallo, “Software fault interactions Berlin, Heidelberg: Springer Berlin Heidelberg, 1993, pp. 210–218. and implications for software testing,” IEEE Transactions on Software [27] B. E. Teasley, L. M. Leventhal, C. R. Mynatt, and D. S. Rohlman, Engineering, vol. 30, no. 6, pp. 418–421, June 2004. “Why software testing is sometimes ineffective: Two applied studies of [5] K. Z. Bell and M. A. Vouk, “On effectiveness of pairwise methodology positive test strategy.” Journal of Applied Psychology, vol. 79, no. 1, p. for testing network-centric software,” in Information and Communica- 142, 1994. tions Technology, 2005. Enabling Technologies for the New Knowledge [28] A. Causevic, R. Shukla, S. Punnekkat, and D. Sundmark, “Effects of Society: ITI 3rd International Conference on. IEEE, 2005, pp. 221–235. negative testing on tdd: An industrial experiment,” in Agile Processes [6] D. R. Kuhn and V. Okum, “Pseudo-exhaustive testing for software,” in in Software Engineering and Extreme Programming, H. Baumeister and 2006 30th Annual IEEE/NASA Software Engineering Workshop, April B. Weber, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, 2006, pp. 153–158. pp. 91–105. [7] D. R. Kuhn, R. N. Kacker, and Y. Lei, “Estimating t-way fault profile evolution during testing,” in Computer Software and Applications Con- ference (COMPSAC), 2016 IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 596–597. [8] R. Tzoref-Brill, “Advances in combinatorial testing,” ser. Advances in Computers. Elsevier, 2018. [9] M. M. Hassan, W. Afzal, M. Blom, B. Lindstrom, S. F. Andler, and S. Eldh, “Testability and software robustness: A systematic literature review,” in 2015 41st Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 2015. Copyright © 2018 for this paper by its authors. 29