1. Introduction

International Workshop on Quantitative Approaches to Software Quality, December

Using Rule Mining for Automatic Test Oracle Generation

Alejandra Duque-Torres

Anastasiia Shalygina

Dietmar Pfahl

Rudolf Ramler

1 0 Institute of Computer Science, University of Tartu , Tartu , Estonia 1 Software Competence Center Hagenberg GmbH , Hagenberg , Austria

2020

1 2020 0000 0002

Software testing is essential for checking the quality of software but it is also a costly and time-consuming activity. The mechanism to determine the correct output of the System Under Test (SUT) for a given input space is called test oracle. The test oracle problem is a known bottleneck in situations where tests are generated automatically and no model of the correct behaviour of the SUT exists. To overcome this bottleneck, we developed a method which generates test oracles by comparing information extracted from object state data created during the execution of two subsequent versions of the SUT. In our initial proof-of-concept, we derive the relevant information in the form of rules by using the Association Rule Mining (ARM) technique. As a proof-of-concept, we validate our method on the Stack class from a custom version of the Java Collection classes and discuss the lessons learned from our experiment. The test suite that we use in our experiment to execute the diferent SUT version is automatically generated using Randoop. Other approaches to generate object state data could be used instead. Our proof-of-concept demonstrates that our method is applicable and that we can detect the presence of failures that are missed by regression testing alone. Automatic analysis of the set of violated association rules provides valuable information for localizing faults in the SUT by directly pointing to the faulty method. This kind of information cannot be found in the execution traces of failing tests.

eol>Software testing test oracle association rule mining test oracle automation machine learning methods in software testing

1. Introduction

the SUT, and determine what should be the correct output after execution of the test cases. The second challenge Software testing is an essential activity for quality assur- refers to the the test oracle problem. A test oracle is a ance in software development process as it helps ensure mechanism that determines the correct output of SUT the correct operation of the final software [ 1]. However, for a given input [4]. Although substantial research has software testing has historically been recognised to be a been conducted to provide test oracle automatically, apart time-consuming, tedious, and expensive activity given from model-driven testing, the oracle problem is largely the size and complexity of large-scale software systems unsolved. [2]. Such cost and time involved in testing can be man- Motivated by the above, we developed a method to aged through test automation. Test automation refers to derive test oracles based on information contained in the writing of special programs that are aimed to detect object state data produced during the execution of the defects in the System Under Test (SUT) and to using these SUT. Object state data is the set of the values of all defined programs together with standard software solutions to attributes of an object at a certain point of time. We control the execution of test suites. It is possible to use assume that most programs have objects with a mutable test automation to improve test eficiency and efective- state, and the execution of methods can modify the state ness. of the program. The idea of using the state information

Software testing, automated or not, has four major roots in the assumption that relations contained in the steps: test case generation, predicting the outcomes of the state data when testing a new version of the SUT should test cases, executing the SUT with the test cases to obtain remain unchanged as compared to a previous version. the actual outcome, and comparing the expected outcome Our proposed method employs Association Rule Minagainst the actual outcome to obtain a verdict (pass/fail) ing (ARM). In our context, the purpose of ARM is to mine [3]. In these steps there are two major challenges: find interesting relations in the state data of the SUT. ARM is efective test inputs, i.e., inputs that can reveal faults in an unsupervised machine learning (ML) method [5]. The algorithms used in ARM attempt to find relationships or associations between categorical variables in large transactional data sets [6]. In particular, we were interested in understanding whether the information provided by the resulting model can help to verify the correct operation of new versions of the SUT to which we normally apply existing tests for regression testing. More specifically, we wonder if we can detect and locate faults in new versions of the SUT. We tackle our goal by answering the following research questions:

RQ1: How efective is the rule mining approach? . This research question investigates to what extent ARM is able to detect failures.

RQ2:What information regarding fault localisation can the method ofer? This research question explores to what extend the information contained in association rules helps developers locate faults in the code.

In our experiments, we use the Stack Class of the Java Collection framework as SUT. This class was chosen as its state behaviour is well known and easy to manipulate.

2. Association Rule Mining A lift value one indicates and appear as frequently

together under the assumption of conditional independence.

3. Method

ARM is a rule-based unsupervised ML method that allows discovering relations between variables or items 3.1. Phase I - Rule Set Generation in large databases. ARM has been used in other fields, such as business analysis, medical diagnosis, and census Phase I starts with the extraction of the state data. Then, data, to find out patterns previously unknown [ 6]. The feature selection and encoding are performed so that ARM process consists of at least two major steps: finding all the features become appropriate to use for the rule all the frequent itemsets that satisfy minimum support mining. The features should be encoded as categorical if thresholds and, generating strong association rules from there is a need. When all the required operations with the frequent derived itemsets by applying minimum con- the raw data are performed, the state data from the first ifdence threshold. version of SUT is received, and the rule mining process

A large variety of ARM algorithms exist. [7]. In our starts. After that, one gets a set of rules. All the process experiments, we use the Apriori algorithm from Python3 can be split into three main steps which are detailed Eficient-Apriori library [ 8]. It is well known that the below: Apriori algorithm is exhaustive, that is, it finds all the Step 1.1 - State data acquisition: This step comrules with the specified support and confidence. In addi- prises two activities: tion, ARM doesn’t require labelled data and is, thus, fully Activity 1.1.1 (Produce test) is responsible for the genunsupervised. Below we define important terminology eration of tests. To perform this activity, we use Randoop regarding ARM: to generate unit tests automatically. Randoop is a popular

Itemset: Let ={, . . . , , } be a set of diferent random unit test generator for Java1. Randoop creates items in the dataset . Itemset is a set of diferent method sequences incrementally by randomly selecting items. a method call to apply and using arguments from previ

Association rule: Consider a dataset , having ously constructed sequences. When the new sequence number of transactions containing a set of items. An has been created, it is executed and then checked against association rule exposes the relationship between the contracts. Two important parameters that we use in items. our experiments are test limit and a random-seed. The

Support: The support is the percentage of transaction test limit parameter helps to limit the number of tests in the dataset D that contains both itemsets X and Y. The in the test suite. The random seed parameter allows us support of an association rule → : to produce multiple diferent test suites since Randoop ( → ) = ( ∪ ) = ( ∪ ) is deterministic by default. Therefore, these two paramConfidence: The confidence is the percentage of trans- eters allow us to generate many test suites of diferent actions in the database D with itemset X that also con- size containing various test cases. tains the itemset Y. The confidence is calculated using Activity 1.1.2 (Execute the test suite) is responsible for the conditional probability which is further expressed state tracking and saving raw data. To track the states in terms of itemset support: ( → ) = of the SUT while running the test suite and save it to ( |) = ( ∪ )/( ) the text file for later analysis, we built a special program

Lift: Lift is used to measure frequency and to- that helps to track and save the information of the state gether if both are statistically independent of each other. of the SUT. We call this program Test Driver. The idea The lift of rule ( → ) is defines as ( → ) = behind the Test Driver is that the methods of the SUT ( → )/( ). can be logically divided into two categories: methods 3.2. Phase II that change a state of the class instance of the SUT, and methods that reflect the state. These methods are so- Phase II is in charge of applying the ruleset to the new tchaelleidnfsotramteatmioenthroedtus.rnTehde btyessttdarteivmerettrhaocdkss iafntdhesytoareres verSstieopns2..1Li-keStPahtaesdeaI,taPhaacsqeuIIisciotmiopnr:iseUsnthlirkeeeSstteepps1.1, called immediately after the test case execution. The this step has only one activity, the test execution activity. information is saved in a CSV file. In phase one, we assume that the first version of the SUT

Step 1.2 - State data pre-processing: This data pre- is correct; then, we build test suites using Randoop and processing step is made up of three activities: the test driver. In Step 2.1, the same tests generated in

Activity 1.2.1 (Sorting, augmenting and cleaning) is re- Activity 1.1.1 are used to test the new versions of the sponsible for ensuring that the data is correct, consistent SUSTt.ep 2.2 - State data pre-processing: This step perand useable. This activity has three main functions: sort, aug, and clean. The sort function is responsible for sort- forms the same activities as Step 1.2 to prepare the state ing the dataset based on the TestId and InstanceId, this is datSatseept 2o.f3t-hAepnpewly SthUeTrvuelressieotnt.o the new SUT version: done to find interesting sequences in the data, and be able to model those sequences. When the dataset is ordered, it This step comprises three activities. Activity 2.3.1 is in is possible to add more information. e.g., it is possible to charge of comparing and selecting the rows that are excluadd characteristics that indicate the previous state. This sively diferent from the first version of the SUT. Activity is made trough aug function. The clean function removes 2.3.2 removes duplicate rows. This is done for optimisathe rows that are no needed or inconsistent rows. tion. If two or more rows are the same, then they will

Activity 1.2.2 (Encoding) is in charge of preparing the have the same results. data according to the requirements of the rule mining Activity 2.3.3 (apply the ruleset against the new unique algorithm. For example, Apriori [9], which is the algo- rows) is responsible for applying the rule-set against the rithm used in this paper, works only with categorical new unique rows. A rule will have two sides, e.g., let’s features. Thus, Activity 1.2.2 categorises and generalises consider the rule ( → ), in this rule, is a left-hand the numerical inputs into string representations. side (LHS) of the rule, and is a right-hand side (RHS).

Activity 1.2.3 (Feature selection) is an extra activity The procedure for applying the ruleset against the new which allows to explore the diferent performance of the unique rows works as follows: ) pick a rule from the method when diferent features are used. For instance, in set of rules, ) use LHS, ) select values which match this paper we used five diferent datasets which contain LHS, ) check whether these values match the RHS, ) diferent numbers of features. save the values which don’t match, and ) repeat steps

Step 1.3 - Rule mining: This step is responsible for − for every rule. In the end, we know that whenever generating the set of rules by using the Apriori ARM the state dataset contains values that violate rules, the algorithm. new version of the SUT is not correct. Since only those rows in the state dataset corresponding to the modified SUT that are diferent from rows in the state dataset corresponding to the unmodified (correct) SUT have the potential to violate rules, it makes sense to only analyze the diferent rows.

4. Results

the element at the top of the Stack. calledMethod tells us which method was called, i.e., push, pop, or clear.

The full set of data generated during our experiments as well as all scripts can be found in our GitHub repo2. Table 1 In our experiment, we use the Stack Class of the Java Summary of the support and lift metrics of the rule-set exCollection framework as SUT. This class was chosen as tracted from the Stack class data using ARM with diferent its state behaviour is well known and easy to manipulate. number of feature The Stack implementation contain seven public methods: push() which puts an item on the top of the stack, pop() DS† NR∤ Max SuMpepaonrt Min Max MLeiftan Min removes and returns an item from the top, clear() clears FS-3 14 0.403 0.381 0.283 2.539 2.391 1.946 the stack, peek() returns an item from the top of the stack FS-4 36 0.337 0.273 0.225 2.539 2.475 2.243 without deleting it, isEmpty() tests if stack is empty, size() FFSS--98 647369 00..320659 00..223217 00..220051 34..830982 33..243074 22..156495 returns size of the stack and, finally, toString() returns a FS-10 1450 0.279 0.225 0.203 4.404 3.492 2.442 string representation of the object. †Data Set, ∤Number of rules

The methods push(), pop() and clear() modify instances of the Stack class when they are called. On the other hand, Step 1.2 - State data pre-processing: In this step, we peek(), isEmpty(), size() and toString() provide informa- sorted the dataset based on the testID and intanceID, this tion about the Stack object state when they are called is done to find the sequence of Stack size, and be able and, thus, peek(), isEmpty(), size() and toString() are state to model those sequences. When the dataset is ordered, methods. The test driver for Stack class creates an in- it is possible to add more information. For example, it stance of the Stack class and contains public methods is possible to add characteristics that indicate the previpush(), pop() and clear() which call the original push(), ous state. To distinguish from the original features, we pop() and clear() using the instance of the Stack class. Ad- add a _ at the end of the feature name, this means "preditionally, the driver implements peek(), isEmpty(), size() vious". Then, the previous states are named as: size_, and toString() but these methods are private. Further- isEmpty_, peek_obj_type_, calledMethod_. Then, we more, the driver has a method for writing states to the removed the unnecessary rows, this is, rows whit not CSV file. This method is used whenever push(), pop() or state information. For instance, the Test Driver writes a clear() methods are called during test execution. This rows with the name "Constructor" in the feature calledset-up allows us to run Randoop test suite generation not Method which indicates that a new Stack was created. on the original Stack class but on the driver class, where This information it is not related to the state, thus, should the public methods are the ones that modify the Stack be dropped. Finally, we encoded the features should be objects. Thus, only push(), pop(), and clear() methods are encoded as categorical if there is a need. In the context called in the test sequences, and the state data captured of our data, we encoded the feature size, and size_ since by peek(), isEmpty(), size() and toString() is saved to a they are not categorical features.

CSV file. We create five diferent datasets which contain different numbers of features. The created dataset are named with the prefix , which stands for Feature Se4.1. Phase I - Rule Set Generation lection. To distinguish the diferent dataset, they have Step 1.1 - State data acquisition: Two diferent re- been named with the number of characteristics that were ports are the output of this step, Test Report (pass / failed) used, i.e., FS-X where X is the number of features. The and State Report. The regression testing generates the datasets created are the following: FS-3 comprises the Test Report . Test Driver generates the State Report. The features size, isEmpty, and peek_obj_type. FS-4 contains State Report provides seven main features testID, intan- the features used in FS-3 plus called_Method. FS-8 comceID, size, isEmpty, peek_obj_type, pushInput, and called- prises the features used in FS-4 and their previous valMethod. The features testID and instanceID provide an ues, which are size_, isEmpty_, peek_obj_type_, and identification of the test generate by Randoop. One test calledMethod_. FS-9 uses the FS-8 features and the feacan have multiples instances. Thus, instanceID is the ture pushIputh. Finally, FS-10 uses all the features. identification of those instance belonging to the same Step 1.3 - Rule mining: We apply the Apriori algotest. The feature size tells the size of the Stack, and is a rithm with minimal support and maximum confidence numerical feature. isEmpty feature contains the values thresholds, i.e., 0.2 and 1, to each dataset. Table 1 pro"True" or "False". isEmpty is "True" when the Stack is vides a comparison between the number of rules, support empty, otherwise it will be "False". peek_obj_type tells us and lift values for each dataset. As per Table 1, we can observe that the average support ratio decrease when the number of features used is increased. The average is closer to the threshold value set up. Table 1 also shows 2https://github.com/aduquet/Using-Rule-Mining-forAutomatic-Test-Oracle-Generation 4.3. RQ1: How efective is the rule mining approach? the number of rules that can be generated does not have linear behaviour since it depends on the number of items belonging to each feature. From Table 1 column lift, we can observe that the lift ratio is increasing when the number of features used is increased. This is the opposite of the support ratio. A lower lift ratio means that the probability of occurrence of a determinate rule is weak since the LHS and RHS are near to be independent between themselves.

Table 3 summarises the not modified version and the seven modifications of the Stack class regarding the results obtained by the regression test, columns Regression test (pass / failed), and the information obtained during the test execution generated by the Test Driver, i.e., the State Data. We notice from Table 3 that some datasets that correspond to modifications, e.g., Mod2− , Mod4− , Mod6− , and Mod7− , do not have the same number 4.2. Phase II of rows as in the No-Mod dataset. This is because some Step 2.1 - State data pre-processing: In this step, we tests from the test suite are failed during the test execucreated new versions of the Stack class. We introduce tion the state in these cases will not be written to the CSV defects to the class under test. For example, in the ifle. When no tests from the suite are failed, all the states Stack class we use three state methods: peek(), size() and will be written to the file. Therefore, the number of rows isEmpty(). We modify each of them individually and also will be equal to the Not-Mod data since we execute the make all possible combinations of these modifications. same test suite both for the Not-Mod and for the modified Table 2 summarises the main modifications made to the data extraction, e.g., Mod1− , Mod3− , and Mod5− . SUT, and it provides the meaning of the terminology As Table 3 column "Regression test" shows, only four of used to refers to the diferent version of the SUT. These seven modifications have failed tests (Mod 2− , Mod4− , modifications are done on purpose and manually, as we Mod6− , and Mod7− ). We see that the number of want to understand the potential of ARM to detect and failed tests is the same for Mod2− and Mod4− datasets. locate faults using the state information of the SUT. Also, the number of failed tests are the same for Mod6− and Mod7− . The state data columns also shows the Table 2 same behaviour, but regarding the number of rows genSummary of the modifications injected to the Stack class erated. The common aspect between all these datasets is isEmpty method modification. In particular, Mod 2− Modification () () () and Mod4− have 1452 failed test. The Mod4− modiMod1− ✓ – – ifcation is the combination of the modified methods peek MMMMMoooooddddd54623−−−−− ✓✓––– ✓✓✓–– ✓✓✓–– trsahepngeodrtMeisssEtsohimdoe1np−ftatyeu.,slHtwtsorhfewailciaelhetvedeid.sr,Tttiohhties"seismfeEamomctdspicttficohyana"ttfirioomtnnhsleyot.hrfeeFPgurerereetgskhrs,eeiosnrnsmoiontoenerset, Mod7− ✓ ✓ ✓ tests failed of Mod4− are belonging to isEmpty modification only.

Some of the modifications afect the state such that Same as Mod1− , none regression test failed in it will be quite easy to detect that something is wrong. Mod3− . In this modification "size" method is modified. For example, isEmpty() method is modified such that it The number of tests failed in Mod7− and Mod6− returns isEmpty=True in the cases when size of the Stack are diferent from Mod 2− and Mod4− . Are the regresclass instance is 2 or 0. Thus, we would get an obviously sion tests spotting faults in the other modified methods, i.e., faulty state size="2.0", is_empty="true". The modification "peek" and "size" when combined in this way? The modiof peek() will not return the object on the top of a Stack ifcation of size() returns the incorrect size for the Stack class instance but the previous one in the cases when class instances that contain one or more objects, i.e., when stack size is greater or equal than 2. Modification of size() the size of the Stack is greater than 0, the modification would return incorrect size for the Stack class instances would return the correct size plus one. For instance, if that contain one or more objects. Thus, the states would the = 1, the modified version will return = 2. look like the correct ones, and the dataset would not That is why Mod6− increase the number of failed tests contain faulty-looking rows. because the size modified method increases the number

Step 2.2 - State data pre-processing: In this step, the of = 2, then triggers the modified isEmpty() when same process of Step 1.2 was performed on the new data. the size of the Stack class is 2 or 0. From the Table 3

Step 2.3 - Apply the rule-set to the new SUT ver- we can ask ourself, why the regression tests are failing sions: We applied the ruleset against the state data of only in the isEmpty() method? When analysing in detail new versions by following the activities described in Sec- the report provided by regression test, we can find that tion 3.2. during the execution of the test, there is an exception 4.4. RQ2: What information regarding fault localisation can the method ofer?

In Table 3, the column "Total # of unique rows" indi- Regarding the first research question, we concluded that cates that the number of unique rows increases when the failure detection efectiveness improves by comparing number of features increases. This is because by adding state data even without using ARM. However, neither more features, we increase the heterogeneity in the data. analyzing the traces of failing tests (all of them failed The number of new unique rows refers to those rows when executing the pop() nor inspecting the information that are diferent from the unmodified SUT version. The provided in the new unique rows provided any helpful inidea of using state information is based on the assump- formation that would guide fault localization. Therefore, tion that the relationship in state when testing a new we set our hopes in a systematic analysis of the rules that version of the SUT must remain unchanged or must not are violated by new unique rows. change significantly. Therefore, we can conclude that As a starting point for fault localization, it is necessary the rows of the modified versions that are diferent from that at least one rule be violated by at least one new the unmodified version are failures. Up to this point our single row. Table 5 summarises the total number of rules method is capable of detecting failures without the need generated per dataset, and the number of rules violated to use ARM. among all the rules generated. As Table 5 shows, from FS

We are interested in understanding whether the in- 8 to FS-10, more than a hundred rules need to be analysed formation provided by the resulting ARM model would to be able to localise the fault. To reduce and optimize have confirmed the failure detection that is already given the number of rules, we construct a rule hierarchy. Let when identifying new rows in the state dataset of a modi- us consider the following set of rules: (i) , , → ifed SUT. Therefore, we measured the proportion of rules (ii) , → (iii) , → , and (iv) → . The that are violated by new rows. In Table 3, the column rule (i) contains the same items of rules (ii), (iii), (iv) in "Total # of new unique rows that violate rules" shows the implicitly. Therefore, having only rule (i) is suficient for interpretation purposes. Table 7

Once the number of rules is reduced, we can analyse Fault localised using ARM approach them more eficiently. The key to being able to anal- Modification Modification localised by ARM yse the set of rules violated is to answer the follow- "Bug" FS-3 FS-4 FS-8 FS-9 FS-10 ing question: Which item is violating the rule?; A rule Mod1− ✗ ✗ ✗ vLiHolSatoifonthoecrcuulres, bwuhteant lseoamsteointeemitse minsainrotwhemsaamtcehrtohwe MMMoooddd324−−− ✗✗- ✗ method pushInput pushInput adboonvoet, tmhaeticthemthtehRatHgSenofertahtee stahmeeviroullaet.ioBnaosefdthoenrtuhlee MMMoooddd756−−− ✗-- ✗ must be present more frequently in the RHS than in the LHS of the set of violated rules. Then, to locate the fault, we quantify the occurrence of the main methods in fewer features, e.g., FS-4, the isEmpty modification is althe LHS with respect to their occurrence in RHS. If the ways located, which is not the case when using more / ratio approaches zero, the fault has been features. located. Let’s consider Table 6 the set of rules violated Note that our method can only point out one fault by Mod1− of FS-9, the total number of rules violated per analysis. Therefore, one must apply the analysis are 10. Please note the following nomenclature: A: peek, repeatedly. Once a fault has been located, it must be B: isEmpty, C: size. D: calledMethod, and E: inputPush. removed, the tests must be run again, and if there are Let’s compute the relation / per each item, still new unique rows in the state dataset, we must try to i.e., A, B, D and E. = 3/7 = 0.428, = 2/2 = 1, spot the next fault. Thus, if we had started out with the = 6/2 = 3, and = 6/3 = 3. The value closest case where all three methods were modified, and we had to zero is = 3/7 = 0.428, which corresponds to the used models with 8, 9, and 10 features, we would have peek method. We can conclude that the fault has been first located the fault in ( with all feature sets), then in located, and that it corresponds to the peek method. peek() (with FS-9 and FS-10), and then in isEmpty() (with

Table 7 reports modifications located using our ap- FS-8). proach. According to the results shown in the table, six out of seven modifications could be located when using nine or ten features. An interesting aspect that stands 5. Threats to Validity out from the results is the localization of the modifications where the isEmpty method is involved. When using

In the context of our proof-of-concept validation, two

types of threats to validity are most relevant: threats to internal and external validity.

With regards to internal validity, it is not fully clear in which situations the iteratively applied heuristic that we propose to localize faults is successful, i.e., points to a rule element that carries useful information. It is clear that our method only can capture behaviour of the program that has not been changed and, thus, behaves equal at the level of granularity at which we capture object states. In addition, our positive results might depend on the number and type of faults injected. A more systematic and more comprehensive analysis is required to explore the limitations of our method with regards to both failure exposure and fault localization.

With regards to external validity our study is rather

limited since we only use one well-understood class in our experiments. Thus, the actual scope of efectiveness of our proposed method for object-oriented programs is yet to be determined.

6. Related Work

A good overview of methods proposed to automatically generate test oracles can be found in [2] and [4]. As far as we know, ARM has not yet been used for this purpose by other researchers. A field that has recently received attention in the context text oracle generation is metamorphic testing as metamorphic relations can be used as test oracles[10].

Acknowledgments This research was partly funded by the Estonian Center

of Excellence in ICT research (EXCITE), the IT Academy Programme for ICT Research Development, the Austrian ministries BMVIT and BMDW, and the Province of Upper Austria under the COMET (Competence Centers for Excellent Technologies) Programme managed by FFG, and by the group grant PRG887 of the Estonian Research Council. [1] T. M. Abdellatif, L. F. Capretz, D. Ho, Software analytics to software practice: A systematic literature review, in: 2015 IEEE/ACM 1st Int’l Workshop on Big Data Software Engineering, 2015, pp. 30–36. 7. Conclusion [2] R. Braga, P. S. Neto, R. Rabêlo, J. Santiago, M. Souza, A machine learning approach to generate test orWe presented a new ARM-based method for the detection acles, in: Proc. of the XXXII Brazilian Symp. on and localization of faults using the state data from SUT Softw. Eng., SBES ’18, Association for Computing executions. In our proof-of-concept, we used the Stack Machinery, New York, NY, USA, 2018, p. 142–151. class from the Java collection framework as the SUT. To [3] K. Patel, R. M. Hierons, A partial oracle for unitest our method, we generated seven faulty versions of formity statistics, Softw. Quality Journal 27 (2019) the SUT. The results obtained have shown that our ap- 1419–1447. proach is capable of detecting failures and locating faults. [4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, The ARM-approach mainly benefits fault localization. S. Yoo, The oracle problem in software testing: A

An advantage of our method is that our method tested survey, IEEE Trans. on Softw. Eng. 41 (2015) 507– not on the SUT with only integer inputs and outputs but 525. on the class under test where we can have any type of [5] L. Zhou, S. Yau, Eficient association rule mining inputs, and the state data consequently is also of mixed among both frequent and infrequent items, Comtypes of data. Thus, our method can be generalized for puters and Mathematics with Applications 54 (2007) inputs of any type, not only for integers. It removes some 737 – 749. limitations on the type of SUT that can be analyzed. [6] S. K. Solanki, J. T. Patel, A survey on association

One of the weaknesses of our method is the need of rule mining, in: 2015 Fifth Int’l Conf. on Advanced a test driver that we use to extract state data during the Computing Communication Technologies, 2015, pp. test suite execution. To generate the test driver for the 212–216.

SUT, we have to identify the state extracting methods [7] R. Agrawal, R. Srikant, Fast algorithms for mining manually. For eficiency reasons, it would be better to association rules in large databases, in: Proc. of the have an automatic identification of state extracting meth- 20th Int’l Conf. on Very Large Data Bases, VLDB ’94, ods. Unfortunately, there is no simple way to do this. Morgan Kaufmann Publishers Inc., San Francisco, Also, in the case of the manual identification, for some CA, USA, 1994, p. 487–499. classes, it may not be so clear what methods should be [8] G. McCluskey, Eficient-apriori documentation, marked as state extracting methods. 2018.

Given the limitations of our study, more experiments [9] A. Bhandari, A. Gupta, D. Das, Improvised apriori have to be conducted to empirically test our proposed algorithm using frequent pattern tree for real time method for fault detection and localization. We are cur- applications in data mining, Procedia Computer rently focusing on extending our experiments in two Science 46 (2015) 644 – 651. directions. First, we will add more kinds of fault injec- [10] B. Zhang, et al., Automatic discovery and cleanstions to test the sensitivity of our method with regards to ing of numerical metamorphic relations, in: Proc. the type of faults in a program. We will systematize this 35th IEEE International Conference on Software by using mutation. Second, we will apply our proposed Maintenance and Evolution (ICSME 2019), 2019, pp. method to more classes in the Java collections framework 235–245. and beyond.