<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop on Quantitative Approaches
to Software Quality, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Using Rule Mining for Automatic Test Oracle Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alejandra Duque-Torres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasiia Shalygina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietmar Pfahl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rudolf Ramler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, University of Tartu</institution>
          ,
          <addr-line>Tartu</addr-line>
          ,
          <country country="EE">Estonia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Software Competence Center Hagenberg GmbH</institution>
          ,
          <addr-line>Hagenberg</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <issue>2020</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Software testing is essential for checking the quality of software but it is also a costly and time-consuming activity. The mechanism to determine the correct output of the System Under Test (SUT) for a given input space is called test oracle. The test oracle problem is a known bottleneck in situations where tests are generated automatically and no model of the correct behaviour of the SUT exists. To overcome this bottleneck, we developed a method which generates test oracles by comparing information extracted from object state data created during the execution of two subsequent versions of the SUT. In our initial proof-of-concept, we derive the relevant information in the form of rules by using the Association Rule Mining (ARM) technique. As a proof-of-concept, we validate our method on the Stack class from a custom version of the Java Collection classes and discuss the lessons learned from our experiment. The test suite that we use in our experiment to execute the diferent SUT version is automatically generated using Randoop. Other approaches to generate object state data could be used instead. Our proof-of-concept demonstrates that our method is applicable and that we can detect the presence of failures that are missed by regression testing alone. Automatic analysis of the set of violated association rules provides valuable information for localizing faults in the SUT by directly pointing to the faulty method. This kind of information cannot be found in the execution traces of failing tests.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Software testing</kwd>
        <kwd>test oracle</kwd>
        <kwd>association rule mining</kwd>
        <kwd>test oracle automation</kwd>
        <kwd>machine learning methods in software testing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the SUT, and determine what should be the correct output
after execution of the test cases. The second challenge
Software testing is an essential activity for quality assur- refers to the the test oracle problem. A test oracle is a
ance in software development process as it helps ensure mechanism that determines the correct output of SUT
the correct operation of the final software [ 1]. However, for a given input [4]. Although substantial research has
software testing has historically been recognised to be a been conducted to provide test oracle automatically, apart
time-consuming, tedious, and expensive activity given from model-driven testing, the oracle problem is largely
the size and complexity of large-scale software systems unsolved.
[2]. Such cost and time involved in testing can be man- Motivated by the above, we developed a method to
aged through test automation. Test automation refers to derive test oracles based on information contained in
the writing of special programs that are aimed to detect object state data produced during the execution of the
defects in the System Under Test (SUT) and to using these SUT. Object state data is the set of the values of all defined
programs together with standard software solutions to attributes of an object at a certain point of time. We
control the execution of test suites. It is possible to use assume that most programs have objects with a mutable
test automation to improve test eficiency and efective- state, and the execution of methods can modify the state
ness. of the program. The idea of using the state information</p>
      <p>Software testing, automated or not, has four major roots in the assumption that relations contained in the
steps: test case generation, predicting the outcomes of the state data when testing a new version of the SUT should
test cases, executing the SUT with the test cases to obtain remain unchanged as compared to a previous version.
the actual outcome, and comparing the expected outcome Our proposed method employs Association Rule
Minagainst the actual outcome to obtain a verdict (pass/fail) ing (ARM). In our context, the purpose of ARM is to mine
[3]. In these steps there are two major challenges: find interesting relations in the state data of the SUT. ARM is
efective test inputs, i.e., inputs that can reveal faults in an unsupervised machine learning (ML) method [5]. The
algorithms used in ARM attempt to find relationships or
associations between categorical variables in large
transactional data sets [6]. In particular, we were interested in
understanding whether the information provided by the
resulting model can help to verify the correct operation
of new versions of the SUT to which we normally apply
existing tests for regression testing. More specifically,
we wonder if we can detect and locate faults in new
versions of the SUT. We tackle our goal by answering the
following research questions:</p>
      <p>RQ1: How efective is the rule mining approach? . This
research question investigates to what extent ARM is
able to detect failures.</p>
      <p>RQ2:What information regarding fault localisation can
the method ofer? This research question explores to what
extend the information contained in association rules
helps developers locate faults in the code.</p>
      <p>In our experiments, we use the Stack Class of the Java
Collection framework as SUT. This class was chosen as
its state behaviour is well known and easy to manipulate.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Association Rule Mining</title>
      <sec id="sec-2-1">
        <title>A lift value one indicates  and  appear as frequently</title>
        <p>together under the assumption of conditional
independence.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>ARM is a rule-based unsupervised ML method that
allows discovering relations between variables or items 3.1. Phase I - Rule Set Generation
in large databases. ARM has been used in other fields,
such as business analysis, medical diagnosis, and census Phase I starts with the extraction of the state data. Then,
data, to find out patterns previously unknown [ 6]. The feature selection and encoding are performed so that
ARM process consists of at least two major steps: finding all the features become appropriate to use for the rule
all the frequent itemsets that satisfy minimum support mining. The features should be encoded as categorical if
thresholds and, generating strong association rules from there is a need. When all the required operations with
the frequent derived itemsets by applying minimum con- the raw data are performed, the state data from the first
ifdence threshold. version of SUT is received, and the rule mining process</p>
      <p>A large variety of ARM algorithms exist. [7]. In our starts. After that, one gets a set of rules. All the process
experiments, we use the Apriori algorithm from Python3 can be split into three main steps which are detailed
Eficient-Apriori library [ 8]. It is well known that the below:
Apriori algorithm is exhaustive, that is, it finds all the Step 1.1 - State data acquisition: This step
comrules with the specified support and confidence. In addi- prises two activities:
tion, ARM doesn’t require labelled data and is, thus, fully Activity 1.1.1 (Produce test) is responsible for the
genunsupervised. Below we define important terminology eration of tests. To perform this activity, we use Randoop
regarding ARM: to generate unit tests automatically. Randoop is a popular</p>
      <p>Itemset: Let  ={, . . . , , } be a set of diferent random unit test generator for Java1. Randoop creates
items in the dataset . Itemset is a set of  diferent method sequences incrementally by randomly selecting
items. a method call to apply and using arguments from
previ</p>
      <p>Association rule: Consider a dataset , having  ously constructed sequences. When the new sequence
number of transactions containing a set of items. An has been created, it is executed and then checked against
association rule exposes the relationship between the contracts. Two important parameters that we use in
items. our experiments are test limit and a random-seed. The</p>
      <p>Support: The support is the percentage of transaction test limit parameter helps to limit the number of tests
in the dataset D that contains both itemsets X and Y. The in the test suite. The random seed parameter allows us
support of an association rule  →  : to produce multiple diferent test suites since Randoop
( →  ) = ( ∪  ) =  ( ∪  ) is deterministic by default. Therefore, these two
paramConfidence: The confidence is the percentage of trans- eters allow us to generate many test suites of diferent
actions in the database D with itemset X that also con- size containing various test cases.
tains the itemset Y. The confidence is calculated using Activity 1.1.2 (Execute the test suite) is responsible for
the conditional probability which is further expressed state tracking and saving raw data. To track the states
in terms of itemset support:  ( →  ) = of the SUT while running the test suite and save it to
 ( |) = ( ∪  )/( ) the text file for later analysis, we built a special program</p>
      <p>Lift: Lift is used to measure frequency  and  to- that helps to track and save the information of the state
gether if both are statistically independent of each other. of the SUT. We call this program Test Driver. The idea
The lift of rule ( →  ) is defines as  ( →  ) = behind the Test Driver is that the methods of the SUT
 ( →  )/( ).
can be logically divided into two categories: methods 3.2. Phase II
that change a state of the class instance of the SUT, and
methods that reflect the state. These methods are so- Phase II is in charge of applying the ruleset to the new
tchaelleidnfsotramteatmioenthroedtus.rnTehde btyessttdarteivmerettrhaocdkss iafntdhesytoareres verSstieopns2..1Li-keStPahtaesdeaI,taPhaacsqeuIIisciotmiopnr:iseUsnthlirkeeeSstteepps1.1,
called immediately after the test case execution. The this step has only one activity, the test execution activity.
information is saved in a CSV file. In phase one, we assume that the first version of the SUT</p>
      <p>Step 1.2 - State data pre-processing: This data pre- is correct; then, we build test suites using Randoop and
processing step is made up of three activities: the test driver. In Step 2.1, the same tests generated in</p>
      <p>Activity 1.2.1 (Sorting, augmenting and cleaning) is re- Activity 1.1.1 are used to test the new versions of the
sponsible for ensuring that the data is correct, consistent SUSTt.ep 2.2 - State data pre-processing: This step
perand useable. This activity has three main functions: sort,
aug, and clean. The sort function is responsible for sort- forms the same activities as Step 1.2 to prepare the state
ing the dataset based on the TestId and InstanceId, this is datSatseept 2o.f3t-hAepnpewly SthUeTrvuelressieotnt.o the new SUT version:
done to find interesting sequences in the data, and be able
to model those sequences. When the dataset is ordered, it This step comprises three activities. Activity 2.3.1 is in
is possible to add more information. e.g., it is possible to charge of comparing and selecting the rows that are
excluadd characteristics that indicate the previous state. This sively diferent from the first version of the SUT. Activity
is made trough aug function. The clean function removes 2.3.2 removes duplicate rows. This is done for
optimisathe rows that are no needed or inconsistent rows. tion. If two or more rows are the same, then they will</p>
      <p>Activity 1.2.2 (Encoding) is in charge of preparing the have the same results.
data according to the requirements of the rule mining Activity 2.3.3 (apply the ruleset against the new unique
algorithm. For example, Apriori [9], which is the algo- rows) is responsible for applying the rule-set against the
rithm used in this paper, works only with categorical new unique rows. A rule will have two sides, e.g., let’s
features. Thus, Activity 1.2.2 categorises and generalises consider the rule ( →  ), in this rule,  is a left-hand
the numerical inputs into string representations. side (LHS) of the rule, and  is a right-hand side (RHS).</p>
      <p>Activity 1.2.3 (Feature selection) is an extra activity The procedure for applying the ruleset against the new
which allows to explore the diferent performance of the unique rows works as follows: ) pick a rule from the
method when diferent features are used. For instance, in set of rules, ) use LHS, ) select values which match
this paper we used five diferent datasets which contain LHS, ) check whether these values match the RHS, )
diferent numbers of features. save the values which don’t match, and ) repeat steps</p>
      <p>Step 1.3 - Rule mining: This step is responsible for  −  for every rule. In the end, we know that whenever
generating the set of rules by using the Apriori ARM the state dataset contains values that violate rules, the
algorithm. new version of the SUT is not correct. Since only those
rows in the state dataset corresponding to the modified
SUT that are diferent from rows in the state dataset
corresponding to the unmodified (correct) SUT have the
potential to violate rules, it makes sense to only analyze
the diferent rows.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>the element at the top of the Stack. calledMethod tells us
which method was called, i.e., push, pop, or clear.</p>
      <p>The full set of data generated during our experiments
as well as all scripts can be found in our GitHub repo2. Table 1
In our experiment, we use the Stack Class of the Java Summary of the support and lift metrics of the rule-set
exCollection framework as SUT. This class was chosen as tracted from the Stack class data using ARM with diferent
its state behaviour is well known and easy to manipulate. number of feature
The Stack implementation contain seven public methods:
push() which puts an item on the top of the stack, pop() DS† NR∤ Max SuMpepaonrt Min Max MLeiftan Min
removes and returns an item from the top, clear() clears FS-3 14 0.403 0.381 0.283 2.539 2.391 1.946
the stack, peek() returns an item from the top of the stack FS-4 36 0.337 0.273 0.225 2.539 2.475 2.243
without deleting it, isEmpty() tests if stack is empty, size() FFSS--98 647369 00..320659 00..223217 00..220051 34..830982 33..243074 22..156495
returns size of the stack and, finally, toString() returns a FS-10 1450 0.279 0.225 0.203 4.404 3.492 2.442
string representation of the object. †Data Set, ∤Number of rules</p>
      <p>The methods push(), pop() and clear() modify instances
of the Stack class when they are called. On the other hand, Step 1.2 - State data pre-processing: In this step, we
peek(), isEmpty(), size() and toString() provide informa- sorted the dataset based on the testID and intanceID, this
tion about the Stack object state when they are called is done to find the sequence of Stack size, and be able
and, thus, peek(), isEmpty(), size() and toString() are state to model those sequences. When the dataset is ordered,
methods. The test driver for Stack class creates an in- it is possible to add more information. For example, it
stance of the Stack class and contains public methods is possible to add characteristics that indicate the
previpush(), pop() and clear() which call the original push(), ous state. To distinguish from the original features, we
pop() and clear() using the instance of the Stack class. Ad- add a _ at the end of the feature name, this means
"preditionally, the driver implements peek(), isEmpty(), size() vious". Then, the previous states are named as: size_,
and toString() but these methods are private. Further- isEmpty_, peek_obj_type_, calledMethod_. Then, we
more, the driver has a method for writing states to the removed the unnecessary rows, this is, rows whit not
CSV file. This method is used whenever push(), pop() or state information. For instance, the Test Driver writes a
clear() methods are called during test execution. This rows with the name "Constructor" in the feature
calledset-up allows us to run Randoop test suite generation not Method which indicates that a new Stack was created.
on the original Stack class but on the driver class, where This information it is not related to the state, thus, should
the public methods are the ones that modify the Stack be dropped. Finally, we encoded the features should be
objects. Thus, only push(), pop(), and clear() methods are encoded as categorical if there is a need. In the context
called in the test sequences, and the state data captured of our data, we encoded the feature size, and size_ since
by peek(), isEmpty(), size() and toString() is saved to a they are not categorical features.</p>
      <p>CSV file. We create five diferent datasets which contain
different numbers of features. The created dataset are
named with the prefix  , which stands for Feature
Se4.1. Phase I - Rule Set Generation lection. To distinguish the diferent dataset, they have
Step 1.1 - State data acquisition: Two diferent re- been named with the number of characteristics that were
ports are the output of this step, Test Report (pass / failed) used, i.e., FS-X where X is the number of features. The
and State Report. The regression testing generates the datasets created are the following: FS-3 comprises the
Test Report . Test Driver generates the State Report. The features size, isEmpty, and peek_obj_type. FS-4 contains
State Report provides seven main features testID, intan- the features used in FS-3 plus called_Method. FS-8
comceID, size, isEmpty, peek_obj_type, pushInput, and called- prises the features used in FS-4 and their previous
valMethod. The features testID and instanceID provide an ues, which are size_, isEmpty_, peek_obj_type_, and
identification of the test generate by Randoop. One test calledMethod_. FS-9 uses the FS-8 features and the
feacan have multiples instances. Thus, instanceID is the ture pushIputh. Finally, FS-10 uses all the features.
identification of those instance belonging to the same Step 1.3 - Rule mining: We apply the Apriori
algotest. The feature size tells the size of the Stack, and is a rithm with minimal support and maximum confidence
numerical feature. isEmpty feature contains the values thresholds, i.e., 0.2 and 1, to each dataset. Table 1
pro"True" or "False". isEmpty is "True" when the Stack is vides a comparison between the number of rules, support
empty, otherwise it will be "False". peek_obj_type tells us and lift values for each dataset. As per Table 1, we can
observe that the average support ratio decrease when
the number of features used is increased. The average is
closer to the threshold value set up. Table 1 also shows
2https://github.com/aduquet/Using-Rule-Mining-forAutomatic-Test-Oracle-Generation
4.3. RQ1: How efective is the rule
mining approach?
the number of rules that can be generated does not have
linear behaviour since it depends on the number of items
belonging to each feature. From Table 1 column lift, we
can observe that the lift ratio is increasing when the
number of features used is increased. This is the opposite of
the support ratio. A lower lift ratio means that the
probability of occurrence of a determinate rule is weak since
the LHS and RHS are near to be independent between
themselves.</p>
      <p>Table 3 summarises the not modified version and the
seven modifications of the Stack class regarding the
results obtained by the regression test, columns Regression
test (pass / failed), and the information obtained during
the test execution generated by the Test Driver, i.e., the
State Data. We notice from Table 3 that some datasets
that correspond to modifications, e.g., Mod2− , Mod4− ,
Mod6− , and Mod7− , do not have the same number
4.2. Phase II of rows as in the No-Mod dataset. This is because some
Step 2.1 - State data pre-processing: In this step, we tests from the test suite are failed during the test
execucreated new versions of the Stack class. We introduce tion the state in these cases will not be written to the CSV
defects to the class under test. For example, in the ifle. When no tests from the suite are failed, all the states
Stack class we use three state methods: peek(), size() and will be written to the file. Therefore, the number of rows
isEmpty(). We modify each of them individually and also will be equal to the Not-Mod data since we execute the
make all possible combinations of these modifications. same test suite both for the Not-Mod and for the modified
Table 2 summarises the main modifications made to the data extraction, e.g., Mod1− , Mod3− , and Mod5− .
SUT, and it provides the meaning of the terminology As Table 3 column "Regression test" shows, only four of
used to refers to the diferent version of the SUT. These seven modifications have failed tests (Mod 2− , Mod4− ,
modifications are done on purpose and manually, as we Mod6− , and Mod7− ). We see that the number of
want to understand the potential of ARM to detect and failed tests is the same for Mod2− and Mod4− datasets.
locate faults using the state information of the SUT. Also, the number of failed tests are the same for Mod6−
and Mod7− . The state data columns also shows the
Table 2 same behaviour, but regarding the number of rows
genSummary of the modifications injected to the Stack class erated. The common aspect between all these datasets
is isEmpty method modification. In particular, Mod 2−
Modification () () () and Mod4− have 1452 failed test. The Mod4−
modiMod1− ✓ – – ifcation is the combination of the modified methods peek
MMMMMoooooddddd54623−−−−− ✓✓––– ✓✓✓–– ✓✓✓–– trsahepngeodrtMeisssEtsohimdoe1np−ftatyeu.,slHtwtsorhfewailciaelhetvedeid.sr,Tttiohhties"seismfeEamomctdspicttficohyana"ttfirioomtnnhsleyot.hrfeeFPgurerereetgskhrs,eeiosnrnsmoiontoenerset,
Mod7− ✓ ✓ ✓ tests failed of Mod4− are belonging to isEmpty
modification only.</p>
      <p>Some of the modifications afect the state such that Same as Mod1− , none regression test failed in
it will be quite easy to detect that something is wrong. Mod3− . In this modification "size" method is modified.
For example, isEmpty() method is modified such that it The number of tests failed in Mod7− and Mod6−
returns isEmpty=True in the cases when size of the Stack are diferent from Mod 2− and Mod4− . Are the
regresclass instance is 2 or 0. Thus, we would get an obviously sion tests spotting faults in the other modified methods, i.e.,
faulty state size="2.0", is_empty="true". The modification "peek" and "size" when combined in this way? The
modiof peek() will not return the object on the top of a Stack ifcation of size() returns the incorrect size for the Stack
class instance but the previous one in the cases when class instances that contain one or more objects, i.e., when
stack size is greater or equal than 2. Modification of size() the size of the Stack is greater than 0, the modification
would return incorrect size for the Stack class instances would return the correct size plus one. For instance, if
that contain one or more objects. Thus, the states would the  = 1, the modified version will return  = 2.
look like the correct ones, and the dataset would not That is why Mod6− increase the number of failed tests
contain faulty-looking rows. because the size modified method increases the number</p>
      <p>Step 2.2 - State data pre-processing: In this step, the of  = 2, then triggers the modified isEmpty() when
same process of Step 1.2 was performed on the new data. the size of the Stack class is 2 or 0. From the Table 3</p>
      <p>Step 2.3 - Apply the rule-set to the new SUT ver- we can ask ourself, why the regression tests are failing
sions: We applied the ruleset against the state data of only in the isEmpty() method? When analysing in detail
new versions by following the activities described in Sec- the report provided by regression test, we can find that
tion 3.2. during the execution of the test, there is an exception
4.4. RQ2: What information regarding
fault localisation can the method
ofer?</p>
      <p>In Table 3, the column "Total # of unique rows" indi- Regarding the first research question, we concluded that
cates that the number of unique rows increases when the failure detection efectiveness improves by comparing
number of features increases. This is because by adding state data even without using ARM. However, neither
more features, we increase the heterogeneity in the data. analyzing the traces of failing tests (all of them failed
The number of new unique rows refers to those rows when executing the pop() nor inspecting the information
that are diferent from the unmodified SUT version. The provided in the new unique rows provided any helpful
inidea of using state information is based on the assump- formation that would guide fault localization. Therefore,
tion that the relationship in state when testing a new we set our hopes in a systematic analysis of the rules that
version of the SUT must remain unchanged or must not are violated by new unique rows.
change significantly. Therefore, we can conclude that As a starting point for fault localization, it is necessary
the rows of the modified versions that are diferent from that at least one rule be violated by at least one new
the unmodified version are failures. Up to this point our single row. Table 5 summarises the total number of rules
method is capable of detecting failures without the need generated per dataset, and the number of rules violated
to use ARM. among all the rules generated. As Table 5 shows, from
FS</p>
      <p>We are interested in understanding whether the in- 8 to FS-10, more than a hundred rules need to be analysed
formation provided by the resulting ARM model would to be able to localise the fault. To reduce and optimize
have confirmed the failure detection that is already given the number of rules, we construct a rule hierarchy. Let
when identifying new rows in the state dataset of a modi- us consider the following set of rules: (i) , ,  → 
ifed SUT. Therefore, we measured the proportion of rules (ii) ,  →  (iii) ,  → , and (iv)  → . The
that are violated by new rows. In Table 3, the column rule (i) contains the same items of rules (ii), (iii), (iv) in
"Total # of new unique rows that violate rules" shows the implicitly. Therefore, having only rule (i) is suficient for
interpretation purposes. Table 7</p>
      <p>Once the number of rules is reduced, we can analyse Fault localised using ARM approach
them more eficiently. The key to being able to anal- Modification Modification localised by ARM
yse the set of rules violated is to answer the follow- "Bug" FS-3 FS-4 FS-8 FS-9 FS-10
ing question: Which item is violating the rule?; A rule Mod1− ✗ ✗ ✗  
vLiHolSatoifonthoecrcuulres, bwuhteant lseoamsteointeemitse minsainrotwhemsaamtcehrtohwe MMMoooddd324−−− ✗✗- ✗ method pushInput pushInput
adboonvoet, tmhaeticthemthtehRatHgSenofertahtee stahmeeviroullaet.ioBnaosefdthoenrtuhlee MMMoooddd756−−− ✗-- ✗   
must be present more frequently in the RHS than in
the LHS of the set of violated rules. Then, to locate the
fault, we quantify the occurrence of the main methods in fewer features, e.g., FS-4, the isEmpty modification is
althe LHS with respect to their occurrence in RHS. If the ways located, which is not the case when using more
/ ratio approaches zero, the fault has been features.
located. Let’s consider Table 6 the set of rules violated Note that our method can only point out one fault
by Mod1− of FS-9, the total number of rules violated per analysis. Therefore, one must apply the analysis
are 10. Please note the following nomenclature: A: peek, repeatedly. Once a fault has been located, it must be
B: isEmpty, C: size. D: calledMethod, and E: inputPush. removed, the tests must be run again, and if there are
Let’s compute the relation / per each item, still new unique rows in the state dataset, we must try to
i.e., A, B, D and E.  = 3/7 = 0.428,  = 2/2 = 1, spot the next fault. Thus, if we had started out with the
 = 6/2 = 3, and  = 6/3 = 3. The value closest case where all three methods were modified, and we had
to zero is  = 3/7 = 0.428, which corresponds to the used models with 8, 9, and 10 features, we would have
peek method. We can conclude that the fault has been first located the fault in ( with all feature sets), then in
located, and that it corresponds to the peek method. peek() (with FS-9 and FS-10), and then in isEmpty() (with</p>
      <p>Table 7 reports modifications located using our ap- FS-8).
proach. According to the results shown in the table, six
out of seven modifications could be located when using
nine or ten features. An interesting aspect that stands 5. Threats to Validity
out from the results is the localization of the
modifications where the isEmpty method is involved. When using</p>
      <sec id="sec-4-1">
        <title>In the context of our proof-of-concept validation, two</title>
        <p>types of threats to validity are most relevant: threats to
internal and external validity.</p>
        <p>With regards to internal validity, it is not fully clear in
which situations the iteratively applied heuristic that we
propose to localize faults is successful, i.e., points to a rule
element that carries useful information. It is clear that
our method only can capture behaviour of the program
that has not been changed and, thus, behaves equal at
the level of granularity at which we capture object states.
In addition, our positive results might depend on the
number and type of faults injected. A more systematic
and more comprehensive analysis is required to explore
the limitations of our method with regards to both failure
exposure and fault localization.</p>
      </sec>
      <sec id="sec-4-2">
        <title>With regards to external validity our study is rather</title>
        <p>limited since we only use one well-understood class in
our experiments. Thus, the actual scope of efectiveness
of our proposed method for object-oriented programs is
yet to be determined.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Related Work</title>
      <p>A good overview of methods proposed to automatically
generate test oracles can be found in [2] and [4]. As
far as we know, ARM has not yet been used for this
purpose by other researchers. A field that has recently
received attention in the context text oracle generation
is metamorphic testing as metamorphic relations can be
used as test oracles[10].</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This research was partly funded by the Estonian Center</title>
        <p>of Excellence in ICT research (EXCITE), the IT Academy
Programme for ICT Research Development, the Austrian
ministries BMVIT and BMDW, and the Province of
Upper Austria under the COMET (Competence Centers for
Excellent Technologies) Programme managed by FFG,
and by the group grant PRG887 of the Estonian Research
Council.
[1] T. M. Abdellatif, L. F. Capretz, D. Ho, Software
analytics to software practice: A systematic literature
review, in: 2015 IEEE/ACM 1st Int’l Workshop on
Big Data Software Engineering, 2015, pp. 30–36.
7. Conclusion [2] R. Braga, P. S. Neto, R. Rabêlo, J. Santiago, M. Souza,
A machine learning approach to generate test
orWe presented a new ARM-based method for the detection acles, in: Proc. of the XXXII Brazilian Symp. on
and localization of faults using the state data from SUT Softw. Eng., SBES ’18, Association for Computing
executions. In our proof-of-concept, we used the Stack Machinery, New York, NY, USA, 2018, p. 142–151.
class from the Java collection framework as the SUT. To [3] K. Patel, R. M. Hierons, A partial oracle for
unitest our method, we generated seven faulty versions of formity statistics, Softw. Quality Journal 27 (2019)
the SUT. The results obtained have shown that our ap- 1419–1447.
proach is capable of detecting failures and locating faults. [4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz,
The ARM-approach mainly benefits fault localization. S. Yoo, The oracle problem in software testing: A</p>
        <p>An advantage of our method is that our method tested survey, IEEE Trans. on Softw. Eng. 41 (2015) 507–
not on the SUT with only integer inputs and outputs but 525.
on the class under test where we can have any type of [5] L. Zhou, S. Yau, Eficient association rule mining
inputs, and the state data consequently is also of mixed among both frequent and infrequent items,
Comtypes of data. Thus, our method can be generalized for puters and Mathematics with Applications 54 (2007)
inputs of any type, not only for integers. It removes some 737 – 749.
limitations on the type of SUT that can be analyzed. [6] S. K. Solanki, J. T. Patel, A survey on association</p>
        <p>One of the weaknesses of our method is the need of rule mining, in: 2015 Fifth Int’l Conf. on Advanced
a test driver that we use to extract state data during the Computing Communication Technologies, 2015, pp.
test suite execution. To generate the test driver for the 212–216.</p>
        <p>SUT, we have to identify the state extracting methods [7] R. Agrawal, R. Srikant, Fast algorithms for mining
manually. For eficiency reasons, it would be better to association rules in large databases, in: Proc. of the
have an automatic identification of state extracting meth- 20th Int’l Conf. on Very Large Data Bases, VLDB ’94,
ods. Unfortunately, there is no simple way to do this. Morgan Kaufmann Publishers Inc., San Francisco,
Also, in the case of the manual identification, for some CA, USA, 1994, p. 487–499.
classes, it may not be so clear what methods should be [8] G. McCluskey, Eficient-apriori documentation,
marked as state extracting methods. 2018.</p>
        <p>Given the limitations of our study, more experiments [9] A. Bhandari, A. Gupta, D. Das, Improvised apriori
have to be conducted to empirically test our proposed algorithm using frequent pattern tree for real time
method for fault detection and localization. We are cur- applications in data mining, Procedia Computer
rently focusing on extending our experiments in two Science 46 (2015) 644 – 651.
directions. First, we will add more kinds of fault injec- [10] B. Zhang, et al., Automatic discovery and
cleanstions to test the sensitivity of our method with regards to ing of numerical metamorphic relations, in: Proc.
the type of faults in a program. We will systematize this 35th IEEE International Conference on Software
by using mutation. Second, we will apply our proposed Maintenance and Evolution (ICSME 2019), 2019, pp.
method to more classes in the Java collections framework 235–245.
and beyond.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>