=Paper=
{{Paper
|id=Vol-2767/paper04
|storemode=property
|title=Using Rule Mining for Automatic Test Oracle Generation
|pdfUrl=https://ceur-ws.org/Vol-2767/03-QuASoQ-2020.pdf
|volume=Vol-2767
|authors=Alejandra Duque-Torres,Dietmar Pfahl,Anastasiia Shalygina,Rudolf Ramler
|dblpUrl=https://dblp.org/rec/conf/apsec/Duque-TorresPSR20
}}
==Using Rule Mining for Automatic Test Oracle Generation==
<pdf width="1500px">https://ceur-ws.org/Vol-2767/03-QuASoQ-2020.pdf</pdf>
<pre>
                                               8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Using Rule Mining for Automatic Test Oracle Generation
Alejandra Duque-Torresa , Anastasiia Shalyginaa , Dietmar Pfahla and Rudolf Ramlerb
a
    Institute of Computer Science, University of Tartu, Tartu, Estonia
b
    Software Competence Center Hagenberg GmbH, Hagenberg, Austria


                                             Abstract
                                             Software testing is essential for checking the quality of software but it is also a costly and time-consuming activity. The
                                             mechanism to determine the correct output of the System Under Test (SUT) for a given input space is called test oracle.
                                             The test oracle problem is a known bottleneck in situations where tests are generated automatically and no model of the
                                             correct behaviour of the SUT exists. To overcome this bottleneck, we developed a method which generates test oracles
                                             by comparing information extracted from object state data created during the execution of two subsequent versions of the
                                             SUT. In our initial proof-of-concept, we derive the relevant information in the form of rules by using the Association Rule
                                             Mining (ARM) technique. As a proof-of-concept, we validate our method on the Stack class from a custom version of the
                                             Java Collection classes and discuss the lessons learned from our experiment. The test suite that we use in our experiment to
                                             execute the different SUT version is automatically generated using Randoop. Other approaches to generate object state data
                                             could be used instead. Our proof-of-concept demonstrates that our method is applicable and that we can detect the presence
                                             of failures that are missed by regression testing alone. Automatic analysis of the set of violated association rules provides
                                             valuable information for localizing faults in the SUT by directly pointing to the faulty method. This kind of information
                                             cannot be found in the execution traces of failing tests.

                                             Keywords
                                             Software testing, test oracle, association rule mining, test oracle automation, machine learning methods in software testing


1. Introduction                                                                                                            the SUT, and determine what should be the correct output
                                                                                                                           after execution of the test cases. The second challenge
Software testing is an essential activity for quality assur-                                                               refers to the the test oracle problem. A test oracle is a
ance in software development process as it helps ensure                                                                    mechanism that determines the correct output of SUT
the correct operation of the final software [1]. However,                                                                  for a given input [4]. Although substantial research has
software testing has historically been recognised to be a                                                                  been conducted to provide test oracle automatically, apart
time-consuming, tedious, and expensive activity given                                                                      from model-driven testing, the oracle problem is largely
the size and complexity of large-scale software systems                                                                    unsolved.
[2]. Such cost and time involved in testing can be man-                                                                       Motivated by the above, we developed a method to
aged through test automation. Test automation refers to                                                                    derive test oracles based on information contained in
the writing of special programs that are aimed to detect                                                                   object state data produced during the execution of the
defects in the System Under Test (SUT) and to using these                                                                  SUT. Object state data is the set of the values of all defined
programs together with standard software solutions to                                                                      attributes of an object at a certain point of time. We
control the execution of test suites. It is possible to use                                                                assume that most programs have objects with a mutable
test automation to improve test efficiency and effective-                                                                  state, and the execution of methods can modify the state
ness.                                                                                                                      of the program. The idea of using the state information
   Software testing, automated or not, has four major                                                                      roots in the assumption that relations contained in the
steps: test case generation, predicting the outcomes of the                                                                state data when testing a new version of the SUT should
test cases, executing the SUT with the test cases to obtain                                                                remain unchanged as compared to a previous version.
the actual outcome, and comparing the expected outcome                                                                        Our proposed method employs Association Rule Min-
against the actual outcome to obtain a verdict (pass/fail)                                                                 ing (ARM). In our context, the purpose of ARM is to mine
[3]. In these steps there are two major challenges: find                                                                   interesting relations in the state data of the SUT. ARM is
effective test inputs, i.e., inputs that can reveal faults in                                                              an unsupervised machine learning (ML) method [5]. The
                                                                                                                           algorithms used in ARM attempt to find relationships or
QuASoQ’20: 8th International Workshop on Quantitative Approaches                                                           associations between categorical variables in large trans-
to Software Quality, December 1, 2020, Singapore                                                                           actional data sets [6]. In particular, we were interested in
" duquet@ut.ee (A. Duque-Torres);                                                                                          understanding whether the information provided by the
anastasiia.shalygina@gmail.com (A. Shalygina);
dietmar.pfahl@ut.ee (D. Pfahl); rudolf.ramler@scch.at (R. Ramler)
                                                                                                                           resulting model can help to verify the correct operation
 0000-0002-1133-284X (A. Duque-Torres); 0000-0003-2400-501X                                                               of new versions of the SUT to which we normally apply
(D. Pfahl); 0000-0001-9903-6107 (R. Ramler)                                                                                existing tests for regression testing. More specifically,
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                          we wonder if we can detect and locate faults in new ver-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                      21
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


sions of the SUT. We tackle our goal by answering the            A lift value one indicates 𝑋 and 𝑌 appear as frequently
following research questions:                                  together under the assumption of conditional indepen-
   RQ1: How effective is the rule mining approach?. This       dence.
research question investigates to what extent ARM is
able to detect failures.
   RQ2:What information regarding fault localisation can       3. Method
the method offer? This research question explores to what
                                                               Figure 1 presents an overview of the method for rule-
extend the information contained in association rules
                                                               mining based tests oracles generation. Overall, the pro-
helps developers locate faults in the code.
                                                               posed method comprises two phases. Phase I is responsi-
   In our experiments, we use the Stack Class of the Java
                                                               ble for the ruleset generation, i.e., the rule mining part.
Collection framework as SUT. This class was chosen as
                                                               The output of this phase is the ruleset. Phase II is in
its state behaviour is well known and easy to manipulate.
                                                               charge of applying the ruleset to the new SUT versions.
                                                               Thus, the output of this phase could be seen as a fault
2. Association Rule Mining                                     report for new SUT versions. Below we describe them in
                                                               detail:
ARM is a rule-based unsupervised ML method that al-
lows discovering relations between variables or items          3.1. Phase I - Rule Set Generation
in large databases. ARM has been used in other fields,
such as business analysis, medical diagnosis, and census       Phase I starts with the extraction of the state data. Then,
data, to find out patterns previously unknown [6]. The         feature selection and encoding are performed so that
ARM process consists of at least two major steps: finding      all the features become appropriate to use for the rule
all the frequent itemsets that satisfy minimum support         mining. The features should be encoded as categorical if
thresholds and, generating strong association rules from       there is a need. When all the required operations with
the frequent derived itemsets by applying minimum con-         the raw data are performed, the state data from the first
fidence threshold.                                             version of SUT is received, and the rule mining process
   A large variety of ARM algorithms exist. [7]. In our        starts. After that, one gets a set of rules. All the process
experiments, we use the Apriori algorithm from Python3         can be split into three main steps which are detailed
Efficient-Apriori library [8]. It is well known that the       below:
Apriori algorithm is exhaustive, that is, it finds all the        Step 1.1 - State data acquisition: This step com-
rules with the specified support and confidence. In addi-      prises two activities:
tion, ARM doesn’t require labelled data and is, thus, fully       Activity 1.1.1 (Produce test) is responsible for the gen-
unsupervised. Below we define important terminology            eration of tests. To perform this activity, we use Randoop
regarding ARM:                                                 to generate unit tests automatically. Randoop is a popular
   Itemset: Let 𝐼 ={𝑋, . . . , 𝑌, 𝑍} be a set of different     random unit test generator for Java1 . Randoop creates
items in the dataset 𝐷. Itemset is a set of 𝑘 different        method sequences incrementally by randomly selecting
items.                                                         a method call to apply and using arguments from previ-
   Association rule: Consider a dataset 𝐷, having 𝑛            ously constructed sequences. When the new sequence
number of transactions containing a set of items. An           has been created, it is executed and then checked against
association rule exposes the relationship between the          contracts. Two important parameters that we use in
items.                                                         our experiments are test limit and a random-seed. The
   Support: The support is the percentage of transaction       test limit parameter helps to limit the number of tests
in the dataset D that contains both itemsets X and Y. The      in the test suite. The random seed parameter allows us
support of an association rule 𝑋 → 𝑌 :                         to produce multiple different test suites since Randoop
   𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 → 𝑌 ) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 ∪ 𝑌 ) = 𝑃 (𝑋 ∪ 𝑌 )              is deterministic by default. Therefore, these two param-
   Confidence: The confidence is the percentage of trans-      eters allow us to generate many test suites of different
actions in the database D with itemset X that also con-        size containing various test cases.
tains the itemset Y. The confidence is calculated using           Activity 1.1.2 (Execute the test suite) is responsible for
the conditional probability which is further expressed         state tracking and saving raw data. To track the states
in terms of itemset support: 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌 ) =             of the SUT while running the test suite and save it to
𝑃 (𝑌 |𝑋) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 ∪ 𝑌 )/𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑋)                          the text file for later analysis, we built a special program
   Lift: Lift is used to measure frequency 𝑋 and 𝑌 to-         that helps to track and save the information of the state
gether if both are statistically independent of each other.    of the SUT. We call this program Test Driver. The idea
The lift of rule (𝑋 → 𝑌 ) is defines as 𝑙𝑖𝑓 𝑡(𝑋 → 𝑌 ) =        behind the Test Driver is that the methods of the SUT
𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌 )/𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌 ).                                   1
                                                                       https://randoop.github.io/randoop/


                                                          22
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Figure 1: Overview of the method for rule-mining based test oracles generation

can be logically divided into two categories: methods           3.2. Phase II
that change a state of the class instance of the SUT, and
                                                                Phase II is in charge of applying the ruleset to the new
methods that reflect the state. These methods are so-
                                                                versions. Like Phase I, Phase II comprises three steps
called state methods. The test driver tracks and stores
                                                                   Step 2.1 - State data acquisition: Unlike Step 1.1,
the information returned by state methods if they are
                                                                this step has only one activity, the test execution activity.
called immediately after the test case execution. The
                                                                In phase one, we assume that the first version of the SUT
information is saved in a CSV file.
                                                                is correct; then, we build test suites using Randoop and
   Step 1.2 - State data pre-processing: This data pre-
                                                                the test driver. In Step 2.1, the same tests generated in
processing step is made up of three activities:
                                                                Activity 1.1.1 are used to test the new versions of the
   Activity 1.2.1 (Sorting, augmenting and cleaning) is re-
                                                                SUT.
sponsible for ensuring that the data is correct, consistent
                                                                   Step 2.2 - State data pre-processing: This step per-
and useable. This activity has three main functions: sort,
                                                                forms the same activities as Step 1.2 to prepare the state
aug, and clean. The sort function is responsible for sort-
                                                                dataset of the new SUT version.
ing the dataset based on the TestId and InstanceId, this is
                                                                   Step 2.3 - Apply the ruleset to the new SUT version:
done to find interesting sequences in the data, and be able
                                                                This step comprises three activities. Activity 2.3.1 is in
to model those sequences. When the dataset is ordered, it
                                                                charge of comparing and selecting the rows that are exclu-
is possible to add more information. e.g., it is possible to
                                                                sively different from the first version of the SUT. Activity
add characteristics that indicate the previous state. This
                                                                2.3.2 removes duplicate rows. This is done for optimisa-
is made trough aug function. The clean function removes
                                                                tion. If two or more rows are the same, then they will
the rows that are no needed or inconsistent rows.
                                                                have the same results.
   Activity 1.2.2 (Encoding) is in charge of preparing the
                                                                   Activity 2.3.3 (apply the ruleset against the new unique
data according to the requirements of the rule mining
                                                                rows) is responsible for applying the rule-set against the
algorithm. For example, Apriori [9], which is the algo-
                                                                new unique rows. A rule will have two sides, e.g., let’s
rithm used in this paper, works only with categorical
                                                                consider the rule (𝑋 → 𝑌 ), in this rule, 𝑋 is a left-hand
features. Thus, Activity 1.2.2 categorises and generalises
                                                                side (LHS) of the rule, and 𝑌 is a right-hand side (RHS).
the numerical inputs into string representations.
                                                                The procedure for applying the ruleset against the new
   Activity 1.2.3 (Feature selection) is an extra activity
                                                                unique rows works as follows: 𝑖) pick a rule from the
which allows to explore the different performance of the
                                                                set of rules, 𝑖𝑖) use LHS, 𝑖𝑖𝑖) select values which match
method when different features are used. For instance, in
                                                                LHS, 𝑖𝑣) check whether these values match the RHS, 𝑣)
this paper we used five different datasets which contain
                                                                save the values which don’t match, and 𝑣𝑖) repeat steps
different numbers of features.
                                                                𝑖 − 𝑣 for every rule. In the end, we know that whenever
   Step 1.3 - Rule mining: This step is responsible for
                                                                the state dataset contains values that violate rules, the
generating the set of rules by using the Apriori ARM
                                                                new version of the SUT is not correct. Since only those
algorithm.
                                                                rows in the state dataset corresponding to the modified
                                                                SUT that are different from rows in the state dataset
                                                                corresponding to the unmodified (correct) SUT have the
                                                                potential to violate rules, it makes sense to only analyze
                                                                the different rows.


                                                           23
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


4. Results                                                        the element at the top of the Stack. calledMethod tells us
                                                                  which method was called, i.e., push, pop, or clear.
The full set of data generated during our experiments
as well as all scripts can be found in our GitHub repo2 .
                                                                  Table 1
In our experiment, we use the Stack Class of the Java             Summary of the support and lift metrics of the rule-set ex-
Collection framework as SUT. This class was chosen as             tracted from the Stack class data using ARM with different
its state behaviour is well known and easy to manipulate.         number of feature
The Stack implementation contain seven public methods:
                                                                    DS†         NR∤                   Support                    Lift
push() which puts an item on the top of the stack, pop()                                  Max          Mean     Min     Max     Mean    Min
removes and returns an item from the top, clear() clears            FS-3          14      0.403        0.381    0.283   2.539   2.391   1.946
the stack, peek() returns an item from the top of the stack         FS-4          36      0.337        0.273    0.225   2.539   2.475   2.243
                                                                    FS-8         439      0.269        0.227    0.205   3.808   3.404   2.545
without deleting it, isEmpty() tests if stack is empty, size()      FS-9         676      0.305        0.231    0.201   4.392   3.237   2.169
returns size of the stack and, finally, toString() returns a        FS-10       1450      0.279        0.225    0.203   4.404   3.492   2.442
                                                                    †
string representation of the object.                                    Data Set, ∤ Number of rules

   The methods push(), pop() and clear() modify instances
of the Stack class when they are called. On the other hand,       Step 1.2 - State data pre-processing: In this step, we
peek(), isEmpty(), size() and toString() provide informa-      sorted the dataset based on the testID and intanceID, this
tion about the Stack object state when they are called         is done to find the sequence of Stack size, and be able
and, thus, peek(), isEmpty(), size() and toString() are state  to model those sequences. When the dataset is ordered,
methods. The test driver for Stack class creates an in-        it is possible to add more information. For example, it
stance of the Stack class and contains public methods          is possible to add characteristics that indicate the previ-
push(), pop() and clear() which call the original push(),      ous state. To distinguish from the original features, we
pop() and clear() using the instance of the Stack class. Ad-   add a _𝑝 at the end of the feature name, this means "pre-
ditionally, the driver implements peek(), isEmpty(), size()    vious". Then, the previous states are named as: size_𝑝,
and toString() but these methods are private. Further-         isEmpty_𝑝, peek_obj_type_𝑝, calledMethod_𝑝. Then, we
more, the driver has a method for writing states to the        removed the unnecessary rows, this is, rows whit not
CSV file. This method is used whenever push(), pop() or        state information. For instance, the Test Driver writes a
clear() methods are called during test execution. This         rows with the name "Constructor" in the feature called-
set-up allows us to run Randoop test suite generation not      Method which indicates that a new Stack was created.
on the original Stack class but on the driver class, where     This information it is not related to the state, thus, should
the public methods are the ones that modify the Stack          be dropped. Finally, we encoded the features should be
objects. Thus, only push(), pop(), and clear() methods are     encoded as categorical if there is a need. In the context
called in the test sequences, and the state data captured      of our data, we encoded the feature size, and size_𝑝 since
by peek(), isEmpty(), size() and toString() is saved to a      they are not categorical features.
CSV file.                                                         We create five different datasets which contain dif-
                                                               ferent numbers of features. The created dataset are
                                                               named with the prefix 𝐹 𝑆, which stands for Feature Se-
4.1. Phase I - Rule Set Generation                             lection. To distinguish the different dataset, they have
Step 1.1 - State data acquisition: Two different re- been named with the number of characteristics that were
ports are the output of this step, Test Report (pass / failed) used, i.e., FS-X where X is the number of features. The
and State Report. The regression testing generates the datasets created are the following: FS-3 comprises the
Test Report . Test Driver generates the State Report. The features size, isEmpty, and peek_obj_type. FS-4 contains
State Report provides seven main features testID, intan- the features used in FS-3 plus called_Method. FS-8 com-
ceID, size, isEmpty, peek_obj_type, pushInput, and called- prises the features used in FS-4 and their previous val-
Method. The features testID and instanceID provide an ues, which are size_𝑝, isEmpty_𝑝, peek_obj_type_𝑝, and
identification of the test generate by Randoop. One test calledMethod_𝑝. FS-9 uses the FS-8 features and the fea-
can have multiples instances. Thus, instanceID is the ture pushIputh. Finally, FS-10 uses all the features.
identification of those instance belonging to the same            Step 1.3 - Rule mining: We apply the Apriori algo-
test. The feature size tells the size of the Stack, and is a rithm with minimal support and maximum confidence
numerical feature. isEmpty feature contains the values thresholds, i.e., 0.2 and 1, to each dataset. Table 1 pro-
"True" or "False". isEmpty is "True" when the Stack is vides a comparison between the number of rules, support
empty, otherwise it will be "False". peek_obj_type tells us and lift values for each dataset. As per Table 1, we can
                                                               observe that the average support ratio decrease when
                                                               the number of features used is increased. The average is
    2
      https://github.com/aduquet/Using-Rule-Mining-for-        closer to the threshold value set up. Table 1 also shows
Automatic-Test-Oracle-Generation


                                                             24
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


the number of rules that can be generated does not have          4.3. RQ1: How effective is the rule
linear behaviour since it depends on the number of items              mining approach?
belonging to each feature. From Table 1 column lift, we
can observe that the lift ratio is increasing when the num-      Table 3 summarises the not modified version and the
ber of features used is increased. This is the opposite of       seven modifications of the Stack class regarding the re-
the support ratio. A lower lift ratio means that the prob-       sults obtained by the regression test, columns Regression
ability of occurrence of a determinate rule is weak since        test (pass / failed), and the information obtained during
the LHS and RHS are near to be independent between               the test execution generated by the Test Driver, i.e., the
themselves.                                                      State Data. We notice from Table 3 that some datasets
                                                                 that correspond to modifications, e.g., Mod2−𝑒 , Mod4−𝑝𝑒 ,
                                                                 Mod6−𝑒𝑠 , and Mod7−𝑝𝑒𝑠 , do not have the same number
4.2. Phase II                                                    of rows as in the No-Mod dataset. This is because some
Step 2.1 - State data pre-processing: In this step, we           tests from the test suite are failed during the test execu-
created new versions of the Stack class. We introduce            tion the state in these cases will not be written to the CSV
defects to the class under test. For example, in the             file. When no tests from the suite are failed, all the states
Stack class we use three state methods: peek(), size() and       will be written to the file. Therefore, the number of rows
isEmpty(). We modify each of them individually and also          will be equal to the Not-Mod data since we execute the
make all possible combinations of these modifications.           same test suite both for the Not-Mod and for the modified
Table 2 summarises the main modifications made to the            data extraction, e.g., Mod1−𝑝 , Mod3−𝑠 , and Mod5−𝑝𝑠 .
SUT, and it provides the meaning of the terminology                 As Table 3 column "Regression test" shows, only four of
used to refers to the different version of the SUT. These        seven modifications have failed tests (Mod2−𝑒 , Mod4−𝑝𝑒 ,
modifications are done on purpose and manually, as we            Mod6−𝑒𝑠 , and Mod7−𝑝𝑒𝑠 ). We see that the number of
want to understand the potential of ARM to detect and            failed tests is the same for Mod2−𝑒 and Mod4−𝑝𝑒 datasets.
locate faults using the state information of the SUT.            Also, the number of failed tests are the same for Mod6−𝑒𝑠
                                                                 and Mod7−𝑝𝑒𝑠 . The state data columns also shows the
Table 2                                                          same behaviour, but regarding the number of rows gen-
Summary of the modifications injected to the Stack class         erated. The common aspect between all these datasets
                                                                 is isEmpty method modification. In particular, Mod2−𝑒
  Modification      𝑝𝑒𝑒𝑘()      𝑖𝑠𝐸𝑚𝑝𝑡𝑦()         𝑠𝑖𝑧𝑒()         and Mod4−𝑝𝑒 have 1452 failed test. The Mod4−𝑝𝑒 modi-
  Mod1−𝑝               ✓              –              –           fication is the combination of the modified methods peek
  Mod2−𝑒               –              ✓              –           and isEmpty. However, it seems that the regression test
  Mod3−𝑠               –              –              ✓           spots the fault related to "isEmpty" only. Furthermore,
  Mod4−𝑝𝑒              ✓              ✓              –
                                                                 the Mod1−𝑝 , which is the modification of Peek, none
  Mod5−𝑝𝑠              ✓              –              ✓
  Mod6−𝑒𝑠              –              ✓              ✓           regression tests failed. This fact confirms the regression
  Mod7−𝑝𝑒𝑠             ✓              ✓              ✓           tests failed of Mod4−𝑝𝑒 are belonging to isEmpty modifi-
                                                                 cation only.
   Some of the modifications affect the state such that             Same as Mod1−𝑝 , none regression test failed in
it will be quite easy to detect that something is wrong.         Mod3−𝑠 . In this modification "size" method is modified.
For example, isEmpty() method is modified such that it           The number of tests failed in Mod7−𝑝𝑒𝑠 and Mod6−𝑒𝑠
returns isEmpty=True in the cases when size of the Stack         are different from Mod2−𝑒 and Mod4−𝑝𝑒 . Are the regres-
class instance is 2 or 0. Thus, we would get an obviously        sion tests spotting faults in the other modified methods, i.e.,
faulty state size="2.0", is_empty="true". The modification       "peek" and "size" when combined in this way? The modi-
of peek() will not return the object on the top of a Stack       fication of size() returns the incorrect size for the Stack
class instance but the previous one in the cases when            class instances that contain one or more objects, i.e., when
stack size is greater or equal than 2. Modification of size()    the size of the Stack is greater than 0, the modification
would return incorrect size for the Stack class instances        would return the correct size plus one. For instance, if
that contain one or more objects. Thus, the states would         the 𝑠𝑖𝑧𝑒 = 1, the modified version will return 𝑠𝑖𝑧𝑒 = 2.
look like the correct ones, and the dataset would not            That is why Mod6−𝑒𝑠 increase the number of failed tests
contain faulty-looking rows.                                     because the size modified method increases the number
   Step 2.2 - State data pre-processing: In this step, the       of 𝑠𝑖𝑧𝑒 = 2, then triggers the modified isEmpty() when
same process of Step 1.2 was performed on the new data.          the size of the Stack class is 2 or 0. From the Table 3
   Step 2.3 - Apply the rule-set to the new SUT ver-             we can ask ourself, why the regression tests are failing
sions: We applied the ruleset against the state data of          only in the isEmpty() method? When analysing in detail
new versions by following the activities described in Sec-       the report provided by regression test, we can find that
tion 3.2.                                                        during the execution of the test, there is an exception


                                                            25
                 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Table 3
Summary of the no modified version and the seven modifications of the Stack class regarding the results obtained by the
regression test, and the information obtained during the test execution generated by the test driver
              Regression test                                                                        State data
                                Total     Total
 Dataset                         # of      # of                  Total # of UR†                         Total # of new UR†              Total # of new UR† that violate rules
              Pass    Failed
                                rows      new
                                          rows      FS-3   FS-4      FS-8    FS-9   FS-10     FS-3     FS-4       FS-8   FS-9   FS-10   FS-3   FS-4    FS-8    FS-9    FS-10
 No-Mod        2037     0       71604       0        47     76       320     320     320       0        0          0      0       0      0       0      0       0       0
 Mod1−𝑝        2037     0       71604     11882      45     73       232     308     398       28       43        97     166     189     0       0      0       38      38
 Mod2−𝑒         585    1452     37786      6390      25     33       108     108     108       1        2         20     20      20      1       2      20      20      20
 Mod3−𝑠        2037     0       71604     43405      47     76       320     320     320       46       74        237    237     237     0       0      19      19      19
 Mod4−𝑝𝑒        585    1452     37786      7792      26     33       106     119     133       14       20        46     59       66     1       2      31      42      49
 Mod5−𝑝𝑠       2037     0       71604     43405      44     73       232     308     398       44       71        156    232     279     0       0      19      50      57
 Mod6−𝑒𝑠        25     2012     7160      5497       5      9        17      17      17        5        7          8      8       8      1       2      6       6       6
 Mod7−𝑝𝑒𝑠       25     2012     7160      5497       3      7        14      16      17        3        5          6      8       8      1       2      4       7       7
 † Unique rows


that checks if the Stack is empty or not. By making the                                    total number of new unique rows that violate at least one
modification in the IsEmpty() method, we generate the                                      rule. It turns out that only for the isEmpty() modification
situation where the exception is generated, thus allow-                                    always all rows also trigger a rule violation. It is less
ing the test to not finish its execution and report it as a                                often the case when other methods are modified (either
failure.                                                                                   individually or in combination). Only when the largest
                                                                                           feature sets (FS-9 and FS-10) are used, there is always at
Table 4                                                                                    least one new row in the dataset that also violates at least
Percentage of new unique rows (among all-new unique rows)                                  one rule. This result is weaker than what we can already
that violate at least one rule                                                             see by just looking at new rows but, recalling that none
                        % of new unique rows that violate rules                            of the regression tests failed when only methods size()
  Dataset
                       FS-3     FS-4    FS-8     FS-9      FS-10                           and peek() were modified, rule violations seem to occur
  No-Mod                  -          -               -         -               -           in a more balanced way. This let us hope that we might
  Mod1−𝑝                  0          0               0      22.89           20.11          be able to exploit this when answering research question
  Mod2−𝑒                100        100             100       100             100           RQ2.
  Mod3−𝑠                  0          0             8.02      8.02            8.02
  Mod4−𝑝𝑒               7.14       10             67.39     71.19           74.24
  Mod5−𝑝𝑠                 0          0            12.18     21.55           20.43          4.4. RQ2: What information regarding
  Mod6−𝑒𝑠                20       28.57             75        75              75
  Mod7−𝑝𝑒𝑠             33.33        40            66.67     87.50            87.5
                                                                                                fault localisation can the method
                                                                                                offer?
   In Table 3, the column "Total # of unique rows" indi-                               Regarding the first research question, we concluded that
cates that the number of unique rows increases when the                                failure detection effectiveness improves by comparing
number of features increases. This is because by adding                                state data even without using ARM. However, neither
more features, we increase the heterogeneity in the data.                              analyzing the traces of failing tests (all of them failed
The number of new unique rows refers to those rows                                     when executing the pop() nor inspecting the information
that are different from the unmodified SUT version. The                                provided in the new unique rows provided any helpful in-
idea of using state information is based on the assump-                                formation that would guide fault localization. Therefore,
tion that the relationship in state when testing a new                                 we set our hopes in a systematic analysis of the rules that
version of the SUT must remain unchanged or must not                                   are violated by new unique rows.
change significantly. Therefore, we can conclude that                                      As a starting point for fault localization, it is necessary
the rows of the modified versions that are different from                              that at least one rule be violated by at least one new
the unmodified version are failures. Up to this point our                              single row. Table 5 summarises the total number of rules
method is capable of detecting failures without the need                               generated per dataset, and the number of rules violated
to use ARM.                                                                            among all the rules generated. As Table 5 shows, from FS-
   We are interested in understanding whether the in-                                  8 to FS-10, more than a hundred rules need to be analysed
formation provided by the resulting ARM model would                                    to be able to localise the fault. To reduce and optimize
have confirmed the failure detection that is already given                             the number of rules, we construct a rule hierarchy. Let
when identifying new rows in the state dataset of a modi-                              us consider the following set of rules: (i) 𝐴, 𝐵, 𝐶 → 𝐷
fied SUT. Therefore, we measured the proportion of rules                               (ii) 𝐴, 𝐵 → 𝐷 (iii) 𝐶, 𝐵 → 𝐷, and (iv) 𝐶 → 𝐷. The
that are violated by new rows. In Table 3, the column                                  rule (i) contains the same items of rules (ii), (iii), (iv) in
"Total # of new unique rows that violate rules" shows the                              implicitly. Therefore, having only rule (i) is sufficient for


                                                                                    26
                8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Table 5
Total number of rules generated per-each dataset, and the number of rules violated among all the rules generated
                                             Total # of rules Violated (the row match with LHS but NOT with RHS)
 DS       RG†
                    Mod1−𝑃             Mod2−𝑒               Mod3−𝑠              Mod4−𝑝𝑒         Mod5−𝑝𝑠        Mod6−𝑒𝑠               Mod7−𝑝𝑒𝑠
                  RV⊥   OSR⋆         RV⊥  OSR⋆          RV⊥         OSR⋆      RV⊥   OSR⋆     RV⊥    OSR⋆     RV⊥   OSR⋆             RV⊥  OSR⋆
 FS-3        14      0       0          3        -            0             0        3       -      0       0        3       -           3         -
 FS-4        36      0       0          4        -            0             0        4       -      0       0        4       -           4         -
 FS-8       439      0       0         73       16           95            95       73      56     95      95       113     98          113       98
 FS-9       676      12     10        142       61           95            95       224     191    307     297      297     269         297       272
 FS-10 1450          12     10        224       65          285            285      236     191    297     397      314     269         325       272
 † Rules generated, ⊥ Number of rules violated ⋆ Optimal set of rules


interpretation purposes.                                                         Table 7
    Once the number of rules is reduced, we can analyse                          Fault localised using ARM approach
them more efficiently. The key to being able to anal-                              Modification            Modification localised by ARM
yse the set of rules violated is to answer the follow-                             "Bug"          FS-3    FS-4    FS-8          FS-9           FS-10
ing question: Which item is violating the rule?; A rule                            Mod1−𝑝           ✗      ✗       ✗            𝑝                 𝑝
violation occurs when some items in a row match the                                Mod2−𝑒
                                                                                   Mod3−𝑠
                                                                                                    -
                                                                                                    ✗
                                                                                                           𝑒
                                                                                                           ✗
                                                                                                                   𝑒
                                                                                                                   𝑠
                                                                                                                            pushInput
                                                                                                                                𝑠
                                                                                                                                              pushInput
                                                                                                                                                  𝑠
LHS of the rule, but at least one items in the same row                            Mod4−𝑝𝑒          ✗      𝑒     method         𝑝                 𝑝
do not match the RHS of the same rule. Based on the                                Mod5−𝑝𝑠
                                                                                   Mod6−𝑒𝑠
                                                                                                    ✗
                                                                                                    -
                                                                                                           ✗
                                                                                                           𝑒
                                                                                                                   𝑠
                                                                                                                   𝑠
                                                                                                                                𝑠
                                                                                                                                𝑠
                                                                                                                                                  𝑠
                                                                                                                                                  𝑠
above, the item that generate the violation of the rule                            Mod7−𝑝𝑒𝑠         -      𝑒       𝑠            𝑠                 𝑠
must be present more frequently in the RHS than in
the LHS of the set of violated rules. Then, to locate the
fault, we quantify the occurrence of the main methods in                         fewer features, e.g., FS-4, the isEmpty modification is al-
the LHS with respect to their occurrence in RHS. If the                          ways located, which is not the case when using more
𝐿𝐻𝑆/𝑅𝐻𝑆 ratio approaches zero, the fault has been                                features.
located. Let’s consider Table 6 the set of rules violated                           Note that our method can only point out one fault
by Mod1−𝑝 of FS-9, the total number of rules violated                            per analysis. Therefore, one must apply the analysis
are 10. Please note the following nomenclature: A: peek,                         repeatedly. Once a fault has been located, it must be
B: isEmpty, C: size. D: calledMethod, and E: inputPush.                          removed, the tests must be run again, and if there are
Let’s compute the relation 𝐿𝐻𝑆/𝑅𝐻𝑆 per each item,                                still new unique rows in the state dataset, we must try to
i.e., A, B, D and E. 𝐴 = 3/7 = 0.428, 𝐵 = 2/2 = 1,                               spot the next fault. Thus, if we had started out with the
𝐷 = 6/2 = 3, and 𝐸 = 6/3 = 3. The value closest                                  case where all three methods were modified, and we had
to zero is 𝐴 = 3/7 = 0.428, which corresponds to the                             used models with 8, 9, and 10 features, we would have
peek method. We can conclude that the fault has been                             first located the fault in (with all feature sets), then in
located, and that it corresponds to the peek method.                             peek() (with FS-9 and FS-10), and then in isEmpty() (with
    Table 7 reports modifications located using our ap-                          FS-8).
proach. According to the results shown in the table, six
out of seven modifications could be located when using
nine or ten features. An interesting aspect that stands                          5. Threats to Validity
out from the results is the localization of the modifica-
tions where the isEmpty method is involved. When using                           In the context of our proof-of-concept validation, two
                                                                                 types of threats to validity are most relevant: threats to
                                                                                 internal and external validity.
Table 6                                                                             With regards to internal validity, it is not fully clear in
Set of violated rules of Mod1−𝑝 with FS-9                                        which situations the iteratively applied heuristic that we
                                                                                 propose to localize faults is successful, i.e., points to a rule
  LHS                                   RHS                                      element that carries useful information. It is clear that
  D = push, E = objType                 A = objType                              our method only can capture behaviour of the program
  E = objType                           D = push, A = objType
  B = True, E = objType                 A = objType                              that has not been changed and, thus, behaves equal at
  D = push                              B = True, A = objType                    the level of granularity at which we capture object states.
  D = push, B = False, E = objType      A = objType
  E = objType                           D = push, B = False, A = objType         In addition, our positive results might depend on the
  D = push, A = objType                 E = objType                              number and type of faults injected. A more systematic
  D = push, B = False, A = objType
  D = push, A = objType
                                        E = objType
                                        E = objType, B = False
                                                                                 and more comprehensive analysis is required to explore
  E = objType                           A = objType                              the limitations of our method with regards to both failure
                                                                                 exposure and fault localization.


                                                                            27
             8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


   With regards to external validity our study is rather        Acknowledgments
limited since we only use one well-understood class in
our experiments. Thus, the actual scope of effectiveness        This research was partly funded by the Estonian Center
of our proposed method for object-oriented programs is          of Excellence in ICT research (EXCITE), the IT Academy
yet to be determined.                                           Programme for ICT Research Development, the Austrian
                                                                ministries BMVIT and BMDW, and the Province of Up-
                                                                per Austria under the COMET (Competence Centers for
6. Related Work                                                 Excellent Technologies) Programme managed by FFG,
                                                                and by the group grant PRG887 of the Estonian Research
A good overview of methods proposed to automatically            Council.
generate test oracles can be found in [2] and [4]. As
far as we know, ARM has not yet been used for this
purpose by other researchers. A field that has recently         References
received attention in the context text oracle generation
is metamorphic testing as metamorphic relations can be           [1] T. M. Abdellatif, L. F. Capretz, D. Ho, Software ana-
used as test oracles[10].                                            lytics to software practice: A systematic literature
                                                                     review, in: 2015 IEEE/ACM 1st Int’l Workshop on
                                                                     Big Data Software Engineering, 2015, pp. 30–36.
7. Conclusion                                                    [2] R. Braga, P. S. Neto, R. Rabêlo, J. Santiago, M. Souza,
                                                                     A machine learning approach to generate test or-
We presented a new ARM-based method for the detection                acles, in: Proc. of the XXXII Brazilian Symp. on
and localization of faults using the state data from SUT             Softw. Eng., SBES ’18, Association for Computing
executions. In our proof-of-concept, we used the Stack               Machinery, New York, NY, USA, 2018, p. 142–151.
class from the Java collection framework as the SUT. To          [3] K. Patel, R. M. Hierons, A partial oracle for uni-
test our method, we generated seven faulty versions of               formity statistics, Softw. Quality Journal 27 (2019)
the SUT. The results obtained have shown that our ap-                1419–1447.
proach is capable of detecting failures and locating faults.     [4] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz,
The ARM-approach mainly benefits fault localization.                 S. Yoo, The oracle problem in software testing: A
   An advantage of our method is that our method tested              survey, IEEE Trans. on Softw. Eng. 41 (2015) 507–
not on the SUT with only integer inputs and outputs but              525.
on the class under test where we can have any type of            [5] L. Zhou, S. Yau, Efficient association rule mining
inputs, and the state data consequently is also of mixed             among both frequent and infrequent items, Com-
types of data. Thus, our method can be generalized for               puters and Mathematics with Applications 54 (2007)
inputs of any type, not only for integers. It removes some           737 – 749.
limitations on the type of SUT that can be analyzed.             [6] S. K. Solanki, J. T. Patel, A survey on association
   One of the weaknesses of our method is the need of                rule mining, in: 2015 Fifth Int’l Conf. on Advanced
a test driver that we use to extract state data during the           Computing Communication Technologies, 2015, pp.
test suite execution. To generate the test driver for the            212–216.
SUT, we have to identify the state extracting methods            [7] R. Agrawal, R. Srikant, Fast algorithms for mining
manually. For efficiency reasons, it would be better to              association rules in large databases, in: Proc. of the
have an automatic identification of state extracting meth-           20th Int’l Conf. on Very Large Data Bases, VLDB ’94,
ods. Unfortunately, there is no simple way to do this.               Morgan Kaufmann Publishers Inc., San Francisco,
Also, in the case of the manual identification, for some             CA, USA, 1994, p. 487–499.
classes, it may not be so clear what methods should be           [8] G. McCluskey, Efficient-apriori documentation,
marked as state extracting methods.                                  2018.
   Given the limitations of our study, more experiments          [9] A. Bhandari, A. Gupta, D. Das, Improvised apriori
have to be conducted to empirically test our proposed                algorithm using frequent pattern tree for real time
method for fault detection and localization. We are cur-             applications in data mining, Procedia Computer
rently focusing on extending our experiments in two                  Science 46 (2015) 644 – 651.
directions. First, we will add more kinds of fault injec-       [10] B. Zhang, et al., Automatic discovery and cleans-
tions to test the sensitivity of our method with regards to          ing of numerical metamorphic relations, in: Proc.
the type of faults in a program. We will systematize this            35th IEEE International Conference on Software
by using mutation. Second, we will apply our proposed                Maintenance and Evolution (ICSME 2019), 2019, pp.
method to more classes in the Java collections framework             235–245.
and beyond.


                                                           28

</pre>