8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Using Rule Mining for Automatic Test Oracle Generation Alejandra Duque-Torresa , Anastasiia Shalyginaa , Dietmar Pfahla and Rudolf Ramlerb a Institute of Computer Science, University of Tartu, Tartu, Estonia b Software Competence Center Hagenberg GmbH, Hagenberg, Austria Abstract Software testing is essential for checking the quality of software but it is also a costly and time-consuming activity. The mechanism to determine the correct output of the System Under Test (SUT) for a given input space is called test oracle. The test oracle problem is a known bottleneck in situations where tests are generated automatically and no model of the correct behaviour of the SUT exists. To overcome this bottleneck, we developed a method which generates test oracles by comparing information extracted from object state data created during the execution of two subsequent versions of the SUT. In our initial proof-of-concept, we derive the relevant information in the form of rules by using the Association Rule Mining (ARM) technique. As a proof-of-concept, we validate our method on the Stack class from a custom version of the Java Collection classes and discuss the lessons learned from our experiment. The test suite that we use in our experiment to execute the different SUT version is automatically generated using Randoop. Other approaches to generate object state data could be used instead. Our proof-of-concept demonstrates that our method is applicable and that we can detect the presence of failures that are missed by regression testing alone. Automatic analysis of the set of violated association rules provides valuable information for localizing faults in the SUT by directly pointing to the faulty method. This kind of information cannot be found in the execution traces of failing tests. Keywords Software testing, test oracle, association rule mining, test oracle automation, machine learning methods in software testing 1. Introduction the SUT, and determine what should be the correct output after execution of the test cases. The second challenge Software testing is an essential activity for quality assur- refers to the the test oracle problem. A test oracle is a ance in software development process as it helps ensure mechanism that determines the correct output of SUT the correct operation of the final software [1]. However, for a given input [4]. Although substantial research has software testing has historically been recognised to be a been conducted to provide test oracle automatically, apart time-consuming, tedious, and expensive activity given from model-driven testing, the oracle problem is largely the size and complexity of large-scale software systems unsolved. [2]. Such cost and time involved in testing can be man- Motivated by the above, we developed a method to aged through test automation. Test automation refers to derive test oracles based on information contained in the writing of special programs that are aimed to detect object state data produced during the execution of the defects in the System Under Test (SUT) and to using these SUT. Object state data is the set of the values of all defined programs together with standard software solutions to attributes of an object at a certain point of time. We control the execution of test suites. It is possible to use assume that most programs have objects with a mutable test automation to improve test efficiency and effective- state, and the execution of methods can modify the state ness. of the program. The idea of using the state information Software testing, automated or not, has four major roots in the assumption that relations contained in the steps: test case generation, predicting the outcomes of the state data when testing a new version of the SUT should test cases, executing the SUT with the test cases to obtain remain unchanged as compared to a previous version. the actual outcome, and comparing the expected outcome Our proposed method employs Association Rule Min- against the actual outcome to obtain a verdict (pass/fail) ing (ARM). In our context, the purpose of ARM is to mine [3]. In these steps there are two major challenges: find interesting relations in the state data of the SUT. ARM is effective test inputs, i.e., inputs that can reveal faults in an unsupervised machine learning (ML) method [5]. The algorithms used in ARM attempt to find relationships or QuASoQ’20: 8th International Workshop on Quantitative Approaches associations between categorical variables in large trans- to Software Quality, December 1, 2020, Singapore actional data sets [6]. In particular, we were interested in " duquet@ut.ee (A. Duque-Torres); understanding whether the information provided by the anastasiia.shalygina@gmail.com (A. Shalygina); dietmar.pfahl@ut.ee (D. Pfahl); rudolf.ramler@scch.at (R. Ramler) resulting model can help to verify the correct operation  0000-0002-1133-284X (A. Duque-Torres); 0000-0003-2400-501X of new versions of the SUT to which we normally apply (D. Pfahl); 0000-0001-9903-6107 (R. Ramler) existing tests for regression testing. More specifically, Β© 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). we wonder if we can detect and locate faults in new ver- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 21 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) sions of the SUT. We tackle our goal by answering the A lift value one indicates 𝑋 and π‘Œ appear as frequently following research questions: together under the assumption of conditional indepen- RQ1: How effective is the rule mining approach?. This dence. research question investigates to what extent ARM is able to detect failures. RQ2:What information regarding fault localisation can 3. Method the method offer? This research question explores to what Figure 1 presents an overview of the method for rule- extend the information contained in association rules mining based tests oracles generation. Overall, the pro- helps developers locate faults in the code. posed method comprises two phases. Phase I is responsi- In our experiments, we use the Stack Class of the Java ble for the ruleset generation, i.e., the rule mining part. Collection framework as SUT. This class was chosen as The output of this phase is the ruleset. Phase II is in its state behaviour is well known and easy to manipulate. charge of applying the ruleset to the new SUT versions. Thus, the output of this phase could be seen as a fault 2. Association Rule Mining report for new SUT versions. Below we describe them in detail: ARM is a rule-based unsupervised ML method that al- lows discovering relations between variables or items 3.1. Phase I - Rule Set Generation in large databases. ARM has been used in other fields, such as business analysis, medical diagnosis, and census Phase I starts with the extraction of the state data. Then, data, to find out patterns previously unknown [6]. The feature selection and encoding are performed so that ARM process consists of at least two major steps: finding all the features become appropriate to use for the rule all the frequent itemsets that satisfy minimum support mining. The features should be encoded as categorical if thresholds and, generating strong association rules from there is a need. When all the required operations with the frequent derived itemsets by applying minimum con- the raw data are performed, the state data from the first fidence threshold. version of SUT is received, and the rule mining process A large variety of ARM algorithms exist. [7]. In our starts. After that, one gets a set of rules. All the process experiments, we use the Apriori algorithm from Python3 can be split into three main steps which are detailed Efficient-Apriori library [8]. It is well known that the below: Apriori algorithm is exhaustive, that is, it finds all the Step 1.1 - State data acquisition: This step com- rules with the specified support and confidence. In addi- prises two activities: tion, ARM doesn’t require labelled data and is, thus, fully Activity 1.1.1 (Produce test) is responsible for the gen- unsupervised. Below we define important terminology eration of tests. To perform this activity, we use Randoop regarding ARM: to generate unit tests automatically. Randoop is a popular Itemset: Let 𝐼 ={𝑋, . . . , π‘Œ, 𝑍} be a set of different random unit test generator for Java1 . Randoop creates items in the dataset 𝐷. Itemset is a set of π‘˜ different method sequences incrementally by randomly selecting items. a method call to apply and using arguments from previ- Association rule: Consider a dataset 𝐷, having 𝑛 ously constructed sequences. When the new sequence number of transactions containing a set of items. An has been created, it is executed and then checked against association rule exposes the relationship between the contracts. Two important parameters that we use in items. our experiments are test limit and a random-seed. The Support: The support is the percentage of transaction test limit parameter helps to limit the number of tests in the dataset D that contains both itemsets X and Y. The in the test suite. The random seed parameter allows us support of an association rule 𝑋 β†’ π‘Œ : to produce multiple different test suites since Randoop π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝑋 β†’ π‘Œ ) = π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝑋 βˆͺ π‘Œ ) = 𝑃 (𝑋 βˆͺ π‘Œ ) is deterministic by default. Therefore, these two param- Confidence: The confidence is the percentage of trans- eters allow us to generate many test suites of different actions in the database D with itemset X that also con- size containing various test cases. tains the itemset Y. The confidence is calculated using Activity 1.1.2 (Execute the test suite) is responsible for the conditional probability which is further expressed state tracking and saving raw data. To track the states in terms of itemset support: π‘π‘œπ‘›π‘“ 𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 β†’ π‘Œ ) = of the SUT while running the test suite and save it to 𝑃 (π‘Œ |𝑋) = π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝑋 βˆͺ π‘Œ )/π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝑋) the text file for later analysis, we built a special program Lift: Lift is used to measure frequency 𝑋 and π‘Œ to- that helps to track and save the information of the state gether if both are statistically independent of each other. of the SUT. We call this program Test Driver. The idea The lift of rule (𝑋 β†’ π‘Œ ) is defines as 𝑙𝑖𝑓 𝑑(𝑋 β†’ π‘Œ ) = behind the Test Driver is that the methods of the SUT π‘π‘œπ‘›π‘“ 𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 β†’ π‘Œ )/π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘(π‘Œ ). 1 https://randoop.github.io/randoop/ 22 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Figure 1: Overview of the method for rule-mining based test oracles generation can be logically divided into two categories: methods 3.2. Phase II that change a state of the class instance of the SUT, and Phase II is in charge of applying the ruleset to the new methods that reflect the state. These methods are so- versions. Like Phase I, Phase II comprises three steps called state methods. The test driver tracks and stores Step 2.1 - State data acquisition: Unlike Step 1.1, the information returned by state methods if they are this step has only one activity, the test execution activity. called immediately after the test case execution. The In phase one, we assume that the first version of the SUT information is saved in a CSV file. is correct; then, we build test suites using Randoop and Step 1.2 - State data pre-processing: This data pre- the test driver. In Step 2.1, the same tests generated in processing step is made up of three activities: Activity 1.1.1 are used to test the new versions of the Activity 1.2.1 (Sorting, augmenting and cleaning) is re- SUT. sponsible for ensuring that the data is correct, consistent Step 2.2 - State data pre-processing: This step per- and useable. This activity has three main functions: sort, forms the same activities as Step 1.2 to prepare the state aug, and clean. The sort function is responsible for sort- dataset of the new SUT version. ing the dataset based on the TestId and InstanceId, this is Step 2.3 - Apply the ruleset to the new SUT version: done to find interesting sequences in the data, and be able This step comprises three activities. Activity 2.3.1 is in to model those sequences. When the dataset is ordered, it charge of comparing and selecting the rows that are exclu- is possible to add more information. e.g., it is possible to sively different from the first version of the SUT. Activity add characteristics that indicate the previous state. This 2.3.2 removes duplicate rows. This is done for optimisa- is made trough aug function. The clean function removes tion. If two or more rows are the same, then they will the rows that are no needed or inconsistent rows. have the same results. Activity 1.2.2 (Encoding) is in charge of preparing the Activity 2.3.3 (apply the ruleset against the new unique data according to the requirements of the rule mining rows) is responsible for applying the rule-set against the algorithm. For example, Apriori [9], which is the algo- new unique rows. A rule will have two sides, e.g., let’s rithm used in this paper, works only with categorical consider the rule (𝑋 β†’ π‘Œ ), in this rule, 𝑋 is a left-hand features. Thus, Activity 1.2.2 categorises and generalises side (LHS) of the rule, and π‘Œ is a right-hand side (RHS). the numerical inputs into string representations. The procedure for applying the ruleset against the new Activity 1.2.3 (Feature selection) is an extra activity unique rows works as follows: 𝑖) pick a rule from the which allows to explore the different performance of the set of rules, 𝑖𝑖) use LHS, 𝑖𝑖𝑖) select values which match method when different features are used. For instance, in LHS, 𝑖𝑣) check whether these values match the RHS, 𝑣) this paper we used five different datasets which contain save the values which don’t match, and 𝑣𝑖) repeat steps different numbers of features. 𝑖 βˆ’ 𝑣 for every rule. In the end, we know that whenever Step 1.3 - Rule mining: This step is responsible for the state dataset contains values that violate rules, the generating the set of rules by using the Apriori ARM new version of the SUT is not correct. Since only those algorithm. rows in the state dataset corresponding to the modified SUT that are different from rows in the state dataset corresponding to the unmodified (correct) SUT have the potential to violate rules, it makes sense to only analyze the different rows. 23 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) 4. Results the element at the top of the Stack. calledMethod tells us which method was called, i.e., push, pop, or clear. The full set of data generated during our experiments as well as all scripts can be found in our GitHub repo2 . Table 1 In our experiment, we use the Stack Class of the Java Summary of the support and lift metrics of the rule-set ex- Collection framework as SUT. This class was chosen as tracted from the Stack class data using ARM with different its state behaviour is well known and easy to manipulate. number of feature The Stack implementation contain seven public methods: DS† NR∀ Support Lift push() which puts an item on the top of the stack, pop() Max Mean Min Max Mean Min removes and returns an item from the top, clear() clears FS-3 14 0.403 0.381 0.283 2.539 2.391 1.946 the stack, peek() returns an item from the top of the stack FS-4 36 0.337 0.273 0.225 2.539 2.475 2.243 FS-8 439 0.269 0.227 0.205 3.808 3.404 2.545 without deleting it, isEmpty() tests if stack is empty, size() FS-9 676 0.305 0.231 0.201 4.392 3.237 2.169 returns size of the stack and, finally, toString() returns a FS-10 1450 0.279 0.225 0.203 4.404 3.492 2.442 † string representation of the object. Data Set, ∀ Number of rules The methods push(), pop() and clear() modify instances of the Stack class when they are called. On the other hand, Step 1.2 - State data pre-processing: In this step, we peek(), isEmpty(), size() and toString() provide informa- sorted the dataset based on the testID and intanceID, this tion about the Stack object state when they are called is done to find the sequence of Stack size, and be able and, thus, peek(), isEmpty(), size() and toString() are state to model those sequences. When the dataset is ordered, methods. The test driver for Stack class creates an in- it is possible to add more information. For example, it stance of the Stack class and contains public methods is possible to add characteristics that indicate the previ- push(), pop() and clear() which call the original push(), ous state. To distinguish from the original features, we pop() and clear() using the instance of the Stack class. Ad- add a _𝑝 at the end of the feature name, this means "pre- ditionally, the driver implements peek(), isEmpty(), size() vious". Then, the previous states are named as: size_𝑝, and toString() but these methods are private. Further- isEmpty_𝑝, peek_obj_type_𝑝, calledMethod_𝑝. Then, we more, the driver has a method for writing states to the removed the unnecessary rows, this is, rows whit not CSV file. This method is used whenever push(), pop() or state information. For instance, the Test Driver writes a clear() methods are called during test execution. This rows with the name "Constructor" in the feature called- set-up allows us to run Randoop test suite generation not Method which indicates that a new Stack was created. on the original Stack class but on the driver class, where This information it is not related to the state, thus, should the public methods are the ones that modify the Stack be dropped. Finally, we encoded the features should be objects. Thus, only push(), pop(), and clear() methods are encoded as categorical if there is a need. In the context called in the test sequences, and the state data captured of our data, we encoded the feature size, and size_𝑝 since by peek(), isEmpty(), size() and toString() is saved to a they are not categorical features. CSV file. We create five different datasets which contain dif- ferent numbers of features. The created dataset are named with the prefix 𝐹 𝑆, which stands for Feature Se- 4.1. Phase I - Rule Set Generation lection. To distinguish the different dataset, they have Step 1.1 - State data acquisition: Two different re- been named with the number of characteristics that were ports are the output of this step, Test Report (pass / failed) used, i.e., FS-X where X is the number of features. The and State Report. The regression testing generates the datasets created are the following: FS-3 comprises the Test Report . Test Driver generates the State Report. The features size, isEmpty, and peek_obj_type. FS-4 contains State Report provides seven main features testID, intan- the features used in FS-3 plus called_Method. FS-8 com- ceID, size, isEmpty, peek_obj_type, pushInput, and called- prises the features used in FS-4 and their previous val- Method. The features testID and instanceID provide an ues, which are size_𝑝, isEmpty_𝑝, peek_obj_type_𝑝, and identification of the test generate by Randoop. One test calledMethod_𝑝. FS-9 uses the FS-8 features and the fea- can have multiples instances. Thus, instanceID is the ture pushIputh. Finally, FS-10 uses all the features. identification of those instance belonging to the same Step 1.3 - Rule mining: We apply the Apriori algo- test. The feature size tells the size of the Stack, and is a rithm with minimal support and maximum confidence numerical feature. isEmpty feature contains the values thresholds, i.e., 0.2 and 1, to each dataset. Table 1 pro- "True" or "False". isEmpty is "True" when the Stack is vides a comparison between the number of rules, support empty, otherwise it will be "False". peek_obj_type tells us and lift values for each dataset. As per Table 1, we can observe that the average support ratio decrease when the number of features used is increased. The average is 2 https://github.com/aduquet/Using-Rule-Mining-for- closer to the threshold value set up. Table 1 also shows Automatic-Test-Oracle-Generation 24 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) the number of rules that can be generated does not have 4.3. RQ1: How effective is the rule linear behaviour since it depends on the number of items mining approach? belonging to each feature. From Table 1 column lift, we can observe that the lift ratio is increasing when the num- Table 3 summarises the not modified version and the ber of features used is increased. This is the opposite of seven modifications of the Stack class regarding the re- the support ratio. A lower lift ratio means that the prob- sults obtained by the regression test, columns Regression ability of occurrence of a determinate rule is weak since test (pass / failed), and the information obtained during the LHS and RHS are near to be independent between the test execution generated by the Test Driver, i.e., the themselves. State Data. We notice from Table 3 that some datasets that correspond to modifications, e.g., Mod2βˆ’π‘’ , Mod4βˆ’π‘π‘’ , Mod6βˆ’π‘’π‘  , and Mod7βˆ’π‘π‘’π‘  , do not have the same number 4.2. Phase II of rows as in the No-Mod dataset. This is because some Step 2.1 - State data pre-processing: In this step, we tests from the test suite are failed during the test execu- created new versions of the Stack class. We introduce tion the state in these cases will not be written to the CSV defects to the class under test. For example, in the file. When no tests from the suite are failed, all the states Stack class we use three state methods: peek(), size() and will be written to the file. Therefore, the number of rows isEmpty(). We modify each of them individually and also will be equal to the Not-Mod data since we execute the make all possible combinations of these modifications. same test suite both for the Not-Mod and for the modified Table 2 summarises the main modifications made to the data extraction, e.g., Mod1βˆ’π‘ , Mod3βˆ’π‘  , and Mod5βˆ’π‘π‘  . SUT, and it provides the meaning of the terminology As Table 3 column "Regression test" shows, only four of used to refers to the different version of the SUT. These seven modifications have failed tests (Mod2βˆ’π‘’ , Mod4βˆ’π‘π‘’ , modifications are done on purpose and manually, as we Mod6βˆ’π‘’π‘  , and Mod7βˆ’π‘π‘’π‘  ). We see that the number of want to understand the potential of ARM to detect and failed tests is the same for Mod2βˆ’π‘’ and Mod4βˆ’π‘π‘’ datasets. locate faults using the state information of the SUT. Also, the number of failed tests are the same for Mod6βˆ’π‘’π‘  and Mod7βˆ’π‘π‘’π‘  . The state data columns also shows the Table 2 same behaviour, but regarding the number of rows gen- Summary of the modifications injected to the Stack class erated. The common aspect between all these datasets is isEmpty method modification. In particular, Mod2βˆ’π‘’ Modification π‘π‘’π‘’π‘˜() π‘–π‘ πΈπ‘šπ‘π‘‘π‘¦() 𝑠𝑖𝑧𝑒() and Mod4βˆ’π‘π‘’ have 1452 failed test. The Mod4βˆ’π‘π‘’ modi- Mod1βˆ’π‘ βœ“ – – fication is the combination of the modified methods peek Mod2βˆ’π‘’ – βœ“ – and isEmpty. However, it seems that the regression test Mod3βˆ’π‘  – – βœ“ spots the fault related to "isEmpty" only. Furthermore, Mod4βˆ’π‘π‘’ βœ“ βœ“ – the Mod1βˆ’π‘ , which is the modification of Peek, none Mod5βˆ’π‘π‘  βœ“ – βœ“ Mod6βˆ’π‘’π‘  – βœ“ βœ“ regression tests failed. This fact confirms the regression Mod7βˆ’π‘π‘’π‘  βœ“ βœ“ βœ“ tests failed of Mod4βˆ’π‘π‘’ are belonging to isEmpty modifi- cation only. Some of the modifications affect the state such that Same as Mod1βˆ’π‘ , none regression test failed in it will be quite easy to detect that something is wrong. Mod3βˆ’π‘  . In this modification "size" method is modified. For example, isEmpty() method is modified such that it The number of tests failed in Mod7βˆ’π‘π‘’π‘  and Mod6βˆ’π‘’π‘  returns isEmpty=True in the cases when size of the Stack are different from Mod2βˆ’π‘’ and Mod4βˆ’π‘π‘’ . Are the regres- class instance is 2 or 0. Thus, we would get an obviously sion tests spotting faults in the other modified methods, i.e., faulty state size="2.0", is_empty="true". The modification "peek" and "size" when combined in this way? The modi- of peek() will not return the object on the top of a Stack fication of size() returns the incorrect size for the Stack class instance but the previous one in the cases when class instances that contain one or more objects, i.e., when stack size is greater or equal than 2. Modification of size() the size of the Stack is greater than 0, the modification would return incorrect size for the Stack class instances would return the correct size plus one. For instance, if that contain one or more objects. Thus, the states would the 𝑠𝑖𝑧𝑒 = 1, the modified version will return 𝑠𝑖𝑧𝑒 = 2. look like the correct ones, and the dataset would not That is why Mod6βˆ’π‘’π‘  increase the number of failed tests contain faulty-looking rows. because the size modified method increases the number Step 2.2 - State data pre-processing: In this step, the of 𝑠𝑖𝑧𝑒 = 2, then triggers the modified isEmpty() when same process of Step 1.2 was performed on the new data. the size of the Stack class is 2 or 0. From the Table 3 Step 2.3 - Apply the rule-set to the new SUT ver- we can ask ourself, why the regression tests are failing sions: We applied the ruleset against the state data of only in the isEmpty() method? When analysing in detail new versions by following the activities described in Sec- the report provided by regression test, we can find that tion 3.2. during the execution of the test, there is an exception 25 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Table 3 Summary of the no modified version and the seven modifications of the Stack class regarding the results obtained by the regression test, and the information obtained during the test execution generated by the test driver Regression test State data Total Total Dataset # of # of Total # of UR† Total # of new UR† Total # of new UR† that violate rules Pass Failed rows new rows FS-3 FS-4 FS-8 FS-9 FS-10 FS-3 FS-4 FS-8 FS-9 FS-10 FS-3 FS-4 FS-8 FS-9 FS-10 No-Mod 2037 0 71604 0 47 76 320 320 320 0 0 0 0 0 0 0 0 0 0 Mod1βˆ’π‘ 2037 0 71604 11882 45 73 232 308 398 28 43 97 166 189 0 0 0 38 38 Mod2βˆ’π‘’ 585 1452 37786 6390 25 33 108 108 108 1 2 20 20 20 1 2 20 20 20 Mod3βˆ’π‘  2037 0 71604 43405 47 76 320 320 320 46 74 237 237 237 0 0 19 19 19 Mod4βˆ’π‘π‘’ 585 1452 37786 7792 26 33 106 119 133 14 20 46 59 66 1 2 31 42 49 Mod5βˆ’π‘π‘  2037 0 71604 43405 44 73 232 308 398 44 71 156 232 279 0 0 19 50 57 Mod6βˆ’π‘’π‘  25 2012 7160 5497 5 9 17 17 17 5 7 8 8 8 1 2 6 6 6 Mod7βˆ’π‘π‘’π‘  25 2012 7160 5497 3 7 14 16 17 3 5 6 8 8 1 2 4 7 7 † Unique rows that checks if the Stack is empty or not. By making the total number of new unique rows that violate at least one modification in the IsEmpty() method, we generate the rule. It turns out that only for the isEmpty() modification situation where the exception is generated, thus allow- always all rows also trigger a rule violation. It is less ing the test to not finish its execution and report it as a often the case when other methods are modified (either failure. individually or in combination). Only when the largest feature sets (FS-9 and FS-10) are used, there is always at Table 4 least one new row in the dataset that also violates at least Percentage of new unique rows (among all-new unique rows) one rule. This result is weaker than what we can already that violate at least one rule see by just looking at new rows but, recalling that none % of new unique rows that violate rules of the regression tests failed when only methods size() Dataset FS-3 FS-4 FS-8 FS-9 FS-10 and peek() were modified, rule violations seem to occur No-Mod - - - - - in a more balanced way. This let us hope that we might Mod1βˆ’π‘ 0 0 0 22.89 20.11 be able to exploit this when answering research question Mod2βˆ’π‘’ 100 100 100 100 100 RQ2. Mod3βˆ’π‘  0 0 8.02 8.02 8.02 Mod4βˆ’π‘π‘’ 7.14 10 67.39 71.19 74.24 Mod5βˆ’π‘π‘  0 0 12.18 21.55 20.43 4.4. RQ2: What information regarding Mod6βˆ’π‘’π‘  20 28.57 75 75 75 Mod7βˆ’π‘π‘’π‘  33.33 40 66.67 87.50 87.5 fault localisation can the method offer? In Table 3, the column "Total # of unique rows" indi- Regarding the first research question, we concluded that cates that the number of unique rows increases when the failure detection effectiveness improves by comparing number of features increases. This is because by adding state data even without using ARM. However, neither more features, we increase the heterogeneity in the data. analyzing the traces of failing tests (all of them failed The number of new unique rows refers to those rows when executing the pop() nor inspecting the information that are different from the unmodified SUT version. The provided in the new unique rows provided any helpful in- idea of using state information is based on the assump- formation that would guide fault localization. Therefore, tion that the relationship in state when testing a new we set our hopes in a systematic analysis of the rules that version of the SUT must remain unchanged or must not are violated by new unique rows. change significantly. Therefore, we can conclude that As a starting point for fault localization, it is necessary the rows of the modified versions that are different from that at least one rule be violated by at least one new the unmodified version are failures. Up to this point our single row. Table 5 summarises the total number of rules method is capable of detecting failures without the need generated per dataset, and the number of rules violated to use ARM. among all the rules generated. As Table 5 shows, from FS- We are interested in understanding whether the in- 8 to FS-10, more than a hundred rules need to be analysed formation provided by the resulting ARM model would to be able to localise the fault. To reduce and optimize have confirmed the failure detection that is already given the number of rules, we construct a rule hierarchy. Let when identifying new rows in the state dataset of a modi- us consider the following set of rules: (i) 𝐴, 𝐡, 𝐢 β†’ 𝐷 fied SUT. Therefore, we measured the proportion of rules (ii) 𝐴, 𝐡 β†’ 𝐷 (iii) 𝐢, 𝐡 β†’ 𝐷, and (iv) 𝐢 β†’ 𝐷. The that are violated by new rows. In Table 3, the column rule (i) contains the same items of rules (ii), (iii), (iv) in "Total # of new unique rows that violate rules" shows the implicitly. Therefore, having only rule (i) is sufficient for 26 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Table 5 Total number of rules generated per-each dataset, and the number of rules violated among all the rules generated Total # of rules Violated (the row match with LHS but NOT with RHS) DS RG† Mod1βˆ’π‘ƒ Mod2βˆ’π‘’ Mod3βˆ’π‘  Mod4βˆ’π‘π‘’ Mod5βˆ’π‘π‘  Mod6βˆ’π‘’π‘  Mod7βˆ’π‘π‘’π‘  RVβŠ₯ OSR⋆ RVβŠ₯ OSR⋆ RVβŠ₯ OSR⋆ RVβŠ₯ OSR⋆ RVβŠ₯ OSR⋆ RVβŠ₯ OSR⋆ RVβŠ₯ OSR⋆ FS-3 14 0 0 3 - 0 0 3 - 0 0 3 - 3 - FS-4 36 0 0 4 - 0 0 4 - 0 0 4 - 4 - FS-8 439 0 0 73 16 95 95 73 56 95 95 113 98 113 98 FS-9 676 12 10 142 61 95 95 224 191 307 297 297 269 297 272 FS-10 1450 12 10 224 65 285 285 236 191 297 397 314 269 325 272 † Rules generated, βŠ₯ Number of rules violated ⋆ Optimal set of rules interpretation purposes. Table 7 Once the number of rules is reduced, we can analyse Fault localised using ARM approach them more efficiently. The key to being able to anal- Modification Modification localised by ARM yse the set of rules violated is to answer the follow- "Bug" FS-3 FS-4 FS-8 FS-9 FS-10 ing question: Which item is violating the rule?; A rule Mod1βˆ’π‘ βœ— βœ— βœ— 𝑝 𝑝 violation occurs when some items in a row match the Mod2βˆ’π‘’ Mod3βˆ’π‘  - βœ— 𝑒 βœ— 𝑒 𝑠 pushInput 𝑠 pushInput 𝑠 LHS of the rule, but at least one items in the same row Mod4βˆ’π‘π‘’ βœ— 𝑒 method 𝑝 𝑝 do not match the RHS of the same rule. Based on the Mod5βˆ’π‘π‘  Mod6βˆ’π‘’π‘  βœ— - βœ— 𝑒 𝑠 𝑠 𝑠 𝑠 𝑠 𝑠 above, the item that generate the violation of the rule Mod7βˆ’π‘π‘’π‘  - 𝑒 𝑠 𝑠 𝑠 must be present more frequently in the RHS than in the LHS of the set of violated rules. Then, to locate the fault, we quantify the occurrence of the main methods in fewer features, e.g., FS-4, the isEmpty modification is al- the LHS with respect to their occurrence in RHS. If the ways located, which is not the case when using more 𝐿𝐻𝑆/𝑅𝐻𝑆 ratio approaches zero, the fault has been features. located. Let’s consider Table 6 the set of rules violated Note that our method can only point out one fault by Mod1βˆ’π‘ of FS-9, the total number of rules violated per analysis. Therefore, one must apply the analysis are 10. Please note the following nomenclature: A: peek, repeatedly. Once a fault has been located, it must be B: isEmpty, C: size. D: calledMethod, and E: inputPush. removed, the tests must be run again, and if there are Let’s compute the relation 𝐿𝐻𝑆/𝑅𝐻𝑆 per each item, still new unique rows in the state dataset, we must try to i.e., A, B, D and E. 𝐴 = 3/7 = 0.428, 𝐡 = 2/2 = 1, spot the next fault. Thus, if we had started out with the 𝐷 = 6/2 = 3, and 𝐸 = 6/3 = 3. The value closest case where all three methods were modified, and we had to zero is 𝐴 = 3/7 = 0.428, which corresponds to the used models with 8, 9, and 10 features, we would have peek method. We can conclude that the fault has been first located the fault in (with all feature sets), then in located, and that it corresponds to the peek method. peek() (with FS-9 and FS-10), and then in isEmpty() (with Table 7 reports modifications located using our ap- FS-8). proach. According to the results shown in the table, six out of seven modifications could be located when using nine or ten features. An interesting aspect that stands 5. Threats to Validity out from the results is the localization of the modifica- tions where the isEmpty method is involved. When using In the context of our proof-of-concept validation, two types of threats to validity are most relevant: threats to internal and external validity. Table 6 With regards to internal validity, it is not fully clear in Set of violated rules of Mod1βˆ’π‘ with FS-9 which situations the iteratively applied heuristic that we propose to localize faults is successful, i.e., points to a rule LHS RHS element that carries useful information. It is clear that D = push, E = objType A = objType our method only can capture behaviour of the program E = objType D = push, A = objType B = True, E = objType A = objType that has not been changed and, thus, behaves equal at D = push B = True, A = objType the level of granularity at which we capture object states. D = push, B = False, E = objType A = objType E = objType D = push, B = False, A = objType In addition, our positive results might depend on the D = push, A = objType E = objType number and type of faults injected. A more systematic D = push, B = False, A = objType D = push, A = objType E = objType E = objType, B = False and more comprehensive analysis is required to explore E = objType A = objType the limitations of our method with regards to both failure exposure and fault localization. 27 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) With regards to external validity our study is rather Acknowledgments limited since we only use one well-understood class in our experiments. Thus, the actual scope of effectiveness This research was partly funded by the Estonian Center of our proposed method for object-oriented programs is of Excellence in ICT research (EXCITE), the IT Academy yet to be determined. Programme for ICT Research Development, the Austrian ministries BMVIT and BMDW, and the Province of Up- per Austria under the COMET (Competence Centers for 6. Related Work Excellent Technologies) Programme managed by FFG, and by the group grant PRG887 of the Estonian Research A good overview of methods proposed to automatically Council. generate test oracles can be found in [2] and [4]. As far as we know, ARM has not yet been used for this purpose by other researchers. A field that has recently References received attention in the context text oracle generation is metamorphic testing as metamorphic relations can be [1] T. M. Abdellatif, L. F. Capretz, D. Ho, Software ana- used as test oracles[10]. lytics to software practice: A systematic literature review, in: 2015 IEEE/ACM 1st Int’l Workshop on Big Data Software Engineering, 2015, pp. 30–36. 7. Conclusion [2] R. Braga, P. S. Neto, R. RabΓͺlo, J. Santiago, M. Souza, A machine learning approach to generate test or- We presented a new ARM-based method for the detection acles, in: Proc. of the XXXII Brazilian Symp. on and localization of faults using the state data from SUT Softw. Eng., SBES ’18, Association for Computing executions. In our proof-of-concept, we used the Stack Machinery, New York, NY, USA, 2018, p. 142–151. class from the Java collection framework as the SUT. To [3] K. Patel, R. M. Hierons, A partial oracle for uni- test our method, we generated seven faulty versions of formity statistics, Softw. Quality Journal 27 (2019) the SUT. The results obtained have shown that our ap- 1419–1447. proach is capable of detecting failures and locating faults. [4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, The ARM-approach mainly benefits fault localization. S. Yoo, The oracle problem in software testing: A An advantage of our method is that our method tested survey, IEEE Trans. on Softw. Eng. 41 (2015) 507– not on the SUT with only integer inputs and outputs but 525. on the class under test where we can have any type of [5] L. Zhou, S. Yau, Efficient association rule mining inputs, and the state data consequently is also of mixed among both frequent and infrequent items, Com- types of data. Thus, our method can be generalized for puters and Mathematics with Applications 54 (2007) inputs of any type, not only for integers. It removes some 737 – 749. limitations on the type of SUT that can be analyzed. [6] S. K. Solanki, J. T. Patel, A survey on association One of the weaknesses of our method is the need of rule mining, in: 2015 Fifth Int’l Conf. on Advanced a test driver that we use to extract state data during the Computing Communication Technologies, 2015, pp. test suite execution. To generate the test driver for the 212–216. SUT, we have to identify the state extracting methods [7] R. Agrawal, R. Srikant, Fast algorithms for mining manually. For efficiency reasons, it would be better to association rules in large databases, in: Proc. of the have an automatic identification of state extracting meth- 20th Int’l Conf. on Very Large Data Bases, VLDB ’94, ods. Unfortunately, there is no simple way to do this. Morgan Kaufmann Publishers Inc., San Francisco, Also, in the case of the manual identification, for some CA, USA, 1994, p. 487–499. classes, it may not be so clear what methods should be [8] G. McCluskey, Efficient-apriori documentation, marked as state extracting methods. 2018. Given the limitations of our study, more experiments [9] A. Bhandari, A. Gupta, D. Das, Improvised apriori have to be conducted to empirically test our proposed algorithm using frequent pattern tree for real time method for fault detection and localization. We are cur- applications in data mining, Procedia Computer rently focusing on extending our experiments in two Science 46 (2015) 644 – 651. directions. First, we will add more kinds of fault injec- [10] B. Zhang, et al., Automatic discovery and cleans- tions to test the sensitivity of our method with regards to ing of numerical metamorphic relations, in: Proc. the type of faults in a program. We will systematize this 35th IEEE International Conference on Software by using mutation. Second, we will apply our proposed Maintenance and Evolution (ICSME 2019), 2019, pp. method to more classes in the Java collections framework 235–245. and beyond. 28