=Paper= {{Paper |id=Vol-2951/paper8 |storemode=property |title=Evaluating Fault Localization Techniques with Bug Signatures and Joined Predicates |pdfUrl=https://ceur-ws.org/Vol-2951/paper8.pdf |volume=Vol-2951 |authors=Roman Milewski,Simon Heiden,Lars Grunske |dblpUrl=https://dblp.org/rec/conf/csp/MilewskiHG21 }} ==Evaluating Fault Localization Techniques with Bug Signatures and Joined Predicates== https://ceur-ws.org/Vol-2951/paper8.pdf
Evaluating Fault Localization Techniques with Bug
Signatures and Joined Predicates
Roman Milewski1 , Simon Heiden1 and Lars Grunske1
1
    Software Engineering, Humboldt-Universität zu Berlin, Germany


                                         Abstract
                                         Predicate-based fault localization techniques have an advantage over other approaches by not only
                                         determining the location of a fault but also potentially giving the developer additional information to
                                         understand it. In this paper, we evaluate the accuracy of predicate-based bug signatures based on the
                                         Defects4J benchmark. Additionally, we try to improve the predicate-based approach by extending it
                                         with joined predicates, a technique for combining multiple predicates, to extract even more information.
                                         To validate our results, we compare our approaches with established spectrum-based fault localization
                                         methods.

                                         Keywords
                                         SBFL, predicate-based fault localization, bug signatures




1. Introduction
Searching for bugs, debugging, and fixing bugs are important parts of the software development
cycle. A lot of time and effort is spent on minimizing the amount of bugs in a program, but this
is usually a costly and difficult process [1].
   While some bugs can be found by static analysis, e.g., compilers fail to compile a program, a
lot of bugs only manifest under special conditions or are simply not detectable at all by compilers
or static analysis [2]. For those bugs, the only detection approach is dynamic analysis, e.g.,
extracting information about the program during execution. Basic methods for findings bugs
include the usage of print statements to extract information about the state of variables or
the path taken through the program. More advanced methods include using a debugger or
slicing [3]. All these methods give the programmer more information about the program states
which lead to the bug, or reduce the amount of code that needs to be examined, thus helping
the programmer to identify the faulty code.
   There have been multiple approaches to automate parts of the fault localization process.
A popular approach, shared by most techniques, is collecting data during correct and faulty
executions of the program code and then analyzing it to produce information about the possible
bug locations [4]. Tarantula, one of the first automated fault localization techniques, compares
the executed lines in failed and successful program executions and assigns lines executed by
more failed runs a higher score [5]. There exist more advanced statistics-based approaches

29th international Workshop on Concurrency, Specification and Programming (CS&P’21)
" milewskr@hu-berlin.de (R. Milewski); heiden@informatik.hu-berlin.de (S. Heiden);
grunske@informatik.hu-berlin.de (L. Grunske)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
that use, e.g., path spectra [6, 7], def-use pairs [8, 9], information-flow pairs [9], dependence
chains [10], (potential) invariants [11, 12, 13, 14, 15, 16], predicates [17, 18, 19, 20], predicate
pairs [21] or method call sequences [22]. Many of these techniques provide the user with more
information than only the potential locations of the bug and, thus, provide more context to help
the user understand the cause of the bug.
   One leading the way to providing this context is statistical debugging, a technique introduced
by Liblit et al. [17], to isolate bugs by profiling several runs of the buggy program and then using
statistical analysis to pinpoint the likely cause of failure. It uses predicates to provide extra
information, as each predicate can hold information about a state of the program or information
about the control flow of the program. Predicates can be used to generate bug signatures [23],
which are sets of said predicates, providing information about the cause or effect of a bug or
other information helpful for debugging. This provides the context for where and how the bug
occurs and, thus, can be used to understand and locate the bug itself.
   As an example, when confronted with a bug which leads to a program crash, the usual entry
point for the programmer is the stack trace (a report of the active stack frames) and, thus, the
location the program failed in. Often, the programmer is now presented with a scenario, in
which it is obvious why the code failed at this location (e.g., a null-reference) but not how the
program managed to arrive at this state. The programmer now has to work backwards through
the stack trace, trying to find the location of the bug that caused the failure. However, often
the bug is not directly part of the stack-trace. It may be in a function that manipulated some
variables but already returned or – in a multi-threaded scenario – had happened in another
thread.
   A bug signature tries to explain a bug by showing the programmer a set of relevant predicates
that have been observed during the execution of the program prior to the crash. These sets are
extracted by comparing failed program executions to normal executions (i.e., without a crash)
and have a high chance of being correlated to the cause of the program failure [23]. This gives
the programmer additional entry points into the code, at locations with some relevance to the
bug, and may thus expedite the process of identifying the bug. Additionally, the bug signatures
proposed by Hsu et al. [23] consist of predicates which provide information about either the
state of the program, its control flow or its data flow prior to the crash. This is additional
information that the programmer can use to reconstruct the program execution more easily
and which may help them locate the bug.
   In this paper, we want to explore the usage of predicates in a java environment. We imple-
mented a tool for instrumenting the java bytecode to collect the predicate data during execution,
a mining algorithm that can find the top-k bug signatures from this data and, finally, compare
the bug localization performance of our approaches with a state-of-the-art bug localization
approach. Additionally, we explore the idea of joined predicates, i.e., predicate chains consisting
of one or more predicates, as an improvement for predicated bug signatures.

RQ1: Can a bug signature based approach be used for bug localization?

RQ2: Can a bug signature based approach be enhanced by using joined predicates?
2. Background
In this section, we explain important concepts of predicates, itemsets and generators used in
this work.

2.1. Predicates
Predicates are part of an approach developed by Liblit et al. [17]. Here, a program is instrumented
at prior defined instrumentation sites. At each of these sites, one or multiple predicates are
evaluated (as true or false) and the evaluation results are tracked. Liblit et al. [17] used the
following instrumentations:

      branches: At each conditional statement, one predicate tracks whether the true branch
      is taken at runtime, and one predicate tracks the false branch.

      returns: Sometimes, function return values are used to track success or failure (either
      directly as boolean or with a numeric value). At each scalar returning method call, six
      predicates are used: < 0, ≤ 0, = 0 ̸= 0, ≥ 0, > 0.

      scalar pairs: At each assignment to a scalar variable 𝑥, for each other in scope scalar
      variable 𝑦𝑖 the following six predicates are possible: 𝑥 < 𝑦𝑖 , 𝑥 ≤ 𝑦𝑖 , 𝑥 = 𝑦𝑖 , 𝑥 ̸= 𝑦𝑖 ,
      𝑥 ≥ 𝑦𝑖 and 𝑥 > 𝑦𝑖 .

Additionally, we define and track the following predicates:

      nullness: At each assignment of an object to a variable, we track whether the object
      equals null.

2.2. Itemsets, Generators and Gr-Tree
In this work, we will use the following terminology for itemsets and generators:

    • An itemset 𝐼 is an unordered set of distinct items 𝐼 = {𝑖1 , 𝑖2 , ..., 𝑖𝑚 }.

    • A transaction 𝑡 is a set of items, and 𝑡 ⊆ 𝐼.

    • A transaction database 𝑇 is a set of transactions 𝑇 = {𝑡1 , 𝑡2 , ..., 𝑡𝑛 }.

    • A frequent itemset is an itemset that appears in at least 𝑥 transactions from a transaction
      database.

    • The support of an itemset is the number of transactions in a transaction database that
      contain the itemset.

    • A generator is an itemset 𝑋 such that there does not exist an itemset 𝑌 strictly included
      in 𝑋 that has the same support.
   Frequent itemset mining algorithms are a data mining technique used to find all frequent
itemsets for a given transaction database. In a naive approach, the search space is exponential
to the number of items in the transaction database. Starting with the Apriori algorithm [24],
better algorithms for finding frequent itemsets have been developed. One important difference
between most frequent itemset mining algorithms and the one used by our approach based
on [25] is the specification of support as positive support and negative support. A transaction
has positive support if it is the result of a successful run, and it has negative support if it is the
result of a failed run.
   Gr-trees (generator trees) and a depth-first Gr-growth algorithm for mining frequent gen-
erators were introduced in [26]. A Gr-tree is a typical trie structure, providing a compact
representation of a transaction database. Each Gr-tree has a prefix that is the distinct itemset
prefixing all items in the Gr-tree. In [25] and [27], the authors provide improved algorithms,
albeit still using a Gr-tree as basis. We base our implementation on the work done in [25].


3. Fault Localization with Mined Bug Signatures via Predicates
   and Joined Predicates
3.1. Generating Predicates
In this first step, the goal is to modify the existing java code in such a way that we can gather
predicate information during later execution.

3.1.1. Instrumentation with ASM
The ASM library1 [28] is one of the more popular frameworks for instrumenting java code. We
used its ability to manipulate the bytecode of already compiled java classes to instrument java
programs and to generate predicate data if the program is run afterward. The core ASM library
provides an event-based representation of a compiled java class, with each element of the class
being represented by an event. The core ASM library uses the visitor design pattern in which
each class is passed to one (or multiple) ClassVisitors which can modify and transform it.
The same happens for fields and methods. We implement our own visitors, which react to the
events emitted by ASM while parsing the bytecode. Now, when ASM is reading a class and
visits a relevant instruction, we can generate predicates and insert instructions for triggering
the evaluation of those predicates: if the predicate evaluates as true during the execution of the
code, the program saves this information.

3.1.2. Joined Predicates
In addition to the predicates introduced in [17], our approach aims to evaluate the usefulness
of joined predicates. In our context, joined predicates are composed of multiple ”traditional”
predicates and carry information about their order of appearance in the program flow.
   As a motivating example, in the buggy code shown in Listing 1, a classical approach would
place predicates in lines 3 and 5, among others. The included bug only manifests if both branches
   1
       https://asm.ow2.io
                                     Listing 1: Example code with tests

1     private boolean anyTrue(boolean first, boolean second) {
2        int x = 0;
3        if (first)
4           x++;
 5       if (second)
 6          x++;
 7       return x == 1; // bug, should be: x >= 1
 8    }
 9    public Test1() { assert(!anyTrue(false,false)); }
10    public Test2() { assert(anyTrue(true,false)); assert(anyTrue(false,true)); }
11    public Test3() { assert(anyTrue(true,true)); } // this fails




     at line 3 and 5 are true, but not in any other combination. If we run all example tests from Listing 1,
     the predicate database would contain the same predicates L:3[true] and L:5[true] after running
     either test 2 or test 3, making them impossible to distinguish based on only this information.
     Our approach would now generate 4 additional joined predicates: 1) L:3[true] ≻ L:5[true],
     2) L:3[false] ≻ L:5[true], 3) L:3[true] ≻ L:5[false], and 4) L:3[false] ≻ L:5[false].
         These joined predicates are different from the two separate predicates, as they additionally
     encode the order of execution. For example, L:3[true] ≻ L:5[true] means: the branch at line 3
     was evaluated to true, and then, the next branch statement was at line 5 and was true, as well.
     This joined predicate is only evaluated to true in the failed test case, allowing us to distinguish
     it from the other test cases.

     3.2. Gathering execution data
     In the next step, we execute the instrumented code to generate the trace data. During execution,
     we collect predicate execution data for, e.g, each test in a test suite, while also classifying each
     trace as successful or failed, depending on the test outcome. The approach expects multiple
     execution profiles and at least one failed execution to work correctly.
        The current prototype of our approach uses a simple rule for generating joined predicates:
     each pair of two simple predicates is a possible joined predicate. During execution of the
     instrumented code, each time a predicate gets evaluated, we dynamically create a new joined
     predicate from the last evaluated simple predicate and the currently triggered one and add it
     to our list of joined predicates, if it has not been encountered, previously. It is possible to use
     more complex rules for the creation of joined predicates, e.g., using the previous two predicates
     to create a joined predicate consisting of three simple ones or even only considering specific
     types of predicates.
        During the execution of tests, all predicates for an instrumented location are directly evaluated
     by the executing code, and the results are stored in a list. This list contains all predicates and
     joined predicates that got triggered during a run and is exported at the end of the run.
3.3. Mining bug signatures
In this step, we try to find the top-k discriminative bug signatures. For this, we use an adaptation
of the mining algorithm presented in [25].
   The input is a database of transactions with each transaction containing a single run of
the program and the information if that run was successful. A Gr-tree (see subsection 2.2) is
constructed from the database. The Gr-tree stores all its items in its head table and links them
to their corresponding nodes in the tree structure. The output is a list containing the top-k
discriminative signatures.
   Our mining algorithm implements the pseudo code mining algorithm described in [25]. Most
differences stem from the java implementation and do not alter the basic idea. The algorithm
also has some parameters controlling the size of the mined bug signatures, the cutoff point for
significance, and k, the size of the returned list. We used the same default parameters as defined
in [25].


4. Experimental Setup
4.1. Choosing a base for a comparison
The result of bug signature identification is different from other fault localization techniques.
Other fault localization techniques output a list of program elements ordered by suspiciousness,
while a bug signature approach instead returns a ranked list of bug signatures. Each bug
signature consists of one or multiple program elements, describing a supposed bug context.
The advantage of this additional information is difficult to quantify, as each programmer may
process the information from such a bug context in a unique way and most likely different than
a program would. A bug signature can therefore consist of program elements that do not have
an immediately obvious relation to the bug under examination.
   We decided to quantitatively compare our approaches to spectrum-based fault localization
(SBFL), a popular fault localization technique. Thus, we compare the following approaches:

         SBFL           state-of-the-art SBFL as implemented in BugLoRD 2 , using JaCoCo3 to collect ex-
                        ecution profiles (test coverage data) and using the DStar metric [29] to calculate
                        suspiciousness scores
       Predicates       our approach using predicates
 Joined Predicates      our approach using predicates and joined predicates (see subsubsection 3.1.2)



4.2. The scoring algorithm
The result of our approach, described in section 3, is a ranked list of bug signatures. The result
of SBFL is a ranked list of source code lines. As a bug signature can contain information about
one or more code locations (note that a bug signature may also contain additional information

   2
       https://github.com/hub-se/BugLoRD
   3
       https://www.eclemma.org/jacoco/index.html
that has no equivalent in SBFL), we developed an algorithm to calculate suspiciousness scores
for a bug signature. A scoring approach for fault localization experiments is a metric, first
proposed by Renieres and Reiss in [30] and used by others [31, 32], based on static program
dependencies. In the first step, a program dependence graph (PDG), a graph that contains a node
for each expression in the program, is constructed from the source code. Next, nodes in the
PDG get marked ”faulty” for being related to the bug under inspection. Using the results of the
fault localization, for each prediction, one or more nodes can be marked as ”reported” if they
contain the predicted program element. Now, a score can be calculated as the fraction of the
PDG that would need to be examined to get from a ”reported” node to a ”faulty” one by shortest
path.
   We used Soot4 [33, 34] to generate intra-procedural PDGs. Soot stores the code for each
method in a Body. Each Body contains a chain of Units which represent the actual code inside
of a method. Each Unit represents a statement in the original source code. We now define a
codelocation as the line of code in the source code and all following lines not belonging to
another Unit.


Definition 1 - codelocation
Given a reference to a line of java source code (e.g., MyClass:7) and a corresponding java method
in a Soot representation, a codelocation is the source code line of a Unit corresponding to
the referenced line and all lines after that, until the next Unit or the end of its method.

The result of our algorithm is a ranked list of bug signatures. Each bug signature contains
one or multiple predicates. Each predicate has a reference to the line of source code, where
it was evaluated. Now, we can generate codelocations from a bug signature by using all
references to lines of source code inside the bug signature. We generate the codelocations
for the bug under examination (using the source code lines associated with the bug) and the
codelocations from the bug signatures reported by our algorithm (or the SBFL results).



Definition 2 - path cost
The path cost is the sum of all edge costs on the shortest path between two codelocations.

In our setting, each edge between two classes weighs 25, each edge between two methods
weighs 10 and each edge between two Units inside a method weighs 1. The higher cost of class
and method transitions represents the additional mental effort while debugging. The numbers
are chosen based on personal experience and can be adjusted.
  Both SBFL and our approach rank their output by an internal suspiciousness score (see [25]
for a definition of discriminative significance (DS) and [35] for a definition of the used SBFL
metrics). Often, multiple elements get assigned the exact same score, due to the nature of
the used formulas, the limited availability of diverse enough execution data (i.e., test cases
with diverse execution profile) or simply due to identical execution behavior that is dictated
by the implementation itself. Even worse (from an evaluation approach), a bug signature, by
   4
       https://soot-oss.github.io/soot/
definition, usually contains multiple predicates and/or joined predicates and thus contains
multiple codelocations. This means that the actual position of a codelocation in a ranking
is nondeterministic. For each of such cases, we therefore have a best and worst case. In the best
case, the codelocation that will lead to the smallest score is evaluated first, and in the worst
case, it is evaluated last among all codelocations with the same suspiciousness score.

              𝑚𝑖𝑛𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 (𝑐𝑖 ) = |{𝑐𝑗 ∈ 𝐶 | 𝑠𝑢𝑠𝑝𝐷𝑆 (𝑐𝑗 ) > 𝑠𝑢𝑠𝑝𝐷𝑆 (𝑐𝑖 )}|
              𝑚𝑎𝑥𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 (𝑐𝑖 ) = |{𝑐𝑗 ∈ 𝐶 | 𝑠𝑢𝑠𝑝𝐷𝑆 (𝑐𝑗 ) ≥ 𝑠𝑢𝑠𝑝𝐷𝑆 (𝑐𝑖 )}|

An alternative, more correct, definition for 𝑚𝑎𝑥𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 (𝑐𝑖 ) would subtract 1 to
not count 𝑐𝑖 twice. Now, by combining the path cost and codelocation, we can define our
EvaluationScore:
                                             (︁        √︀                                  )︁
  𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 = 𝑚𝑖𝑛𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 + 𝑝𝑎𝑡ℎ𝐶𝑜𝑠𝑡 * 𝑚𝑖𝑛𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 + 1
                                              (︁        √︀                                  )︁
𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 = 𝑚𝑎𝑥𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 + 𝑝𝑎𝑡ℎ𝐶𝑜𝑠𝑡 * 𝑚𝑎𝑥𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 + 1
                    1
  𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 =          * (𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 + 𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 )
                    2
Note that 𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 is just the average of 𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 and 𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 . To calculate a real average
EvaluationScore, one would use 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑀 𝑖𝑛𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 and path cost. In our
definition, the path cost is also included in the average.
   The EvaluationScore is based on the usefulness of a bug prediction to a programmer while
debugging. The EvaluationScore is 0, a perfect score, if the first reported codelocation is
exactly the line of source code associated with the bug. The EvaluationScore grows by one for
every additional codelocation to check. Additionally, the EvaluationScore is scaled with the
number of codelocations already checked, as a programmer checking the first few locations
will be motivated to look deeper, but when having already looked at multiple locations before,
might stop his investigation sooner. So, while there might be an incredibly long path in the
Unit-graph linking the suspected codelocation and the bug location, it is highly unlikely that
this connection would be useful to the programmer. I.e., the more the path cost between code
locations and bug increases, the less likely it is to track down the bug.

4.3. Evaluation Subjects
The Defects4J Benchmark5 [36] is a collection of open source projects with each being available
in multiple buggy versions. Each buggy version consists of the respective source code and a
change set with changes exclusively related to the bug. This omission of other changes (e.g.,
refactorings) in the change set makes it more reliable as a source for the lines of code related to
the bug. Sobreira et al. [37] did an analysis of many Defects4J bugs and showed a method for
linking a bug to lines of code. The big number of real (not artificially created) bugs from many
different projects make this a sensible benchmark choice for our purposes.
   We used the bugs in the version 2.0.0 from the projects in Table 1. From the original set of
bugs, we had to exclude 21 bugs from our final results. The most common reasons were problems
   5
       https://github.com/rjust/defects4j
Table 1
Used projects from the Defects4J Benchmark

                           project                      size[loc]   #bugs

                           jfreechart (Chart)                96k       26
                           commons-cli (Cli)                  2k       39
                           commons-codec (Codec)              3k       18
                           commons-csv (Csv)                  1k       16
                           gson (Gson)                        6k       18
                           commons-lang (Lang)               22k       64
                           commons-math (Math)               84k      106
                           joda-time (Time)                  90k       26

                                                Total                 313
                                     Applicable after
                                     Instrumentation                  292
                                     Applicable after
                                       Runtime Eval.                  162


with compilation of the source code, crashes of the instrumenter and JVM crashes during test
execution. During the runtime evaluation, we encountered some additional problems, because
we could not relate the Unit containing the line of source code to the given bug/prediction, or
because the evaluation produced a time out. This leaves us with 162 bugs which were used for
the following graphs.


5. Results
In Figure 1, we see that 𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 (EvaluationScore if the reported location is examined first
among all locations ranked equally; see section 4 for more details) for both joined predicates and
predicates is lower (better) than for SBFL. 𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 is more similar for all three approaches,
with the joined predicates being significantly worse than simple predicates. If we look at the
included p-values, we can see that the EvaluationScores for both predicate approaches are
significantly different in our data set. The only non significant difference is between the worst
EvaluationScore for joined predicates and SBFL.
   In Table 2, we can see the mean values for 𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 and 𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 . For a better visualization,
we can look at Figure 3(left), where the total EvaluationScores for each approach are summed
up. Both predicate approaches have significantly lower total EvaluationScores than the SBFL
approach when comparing 𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 . When comparing 𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 , the results are more even,
with only simple predicates being significantly different than the other two. If we look at the
median values in Table 2, we can see the very small difference between all three approaches
for 𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 . This shows us that a lot of bugs have similar, very small scores and most of the
                                                p−values from Wilcoxon Signed Rank Tests (paired)
                                                                                  p = 0.0063
                                                                            p = 5e−05
                                                                            p = 0.15
                                                                    p = 0.004
                                                                p = 3.8e−20
                                                         p = 0.01
                                        400



                      EvaluationScore
                                                                                                       Score Type
                                                                                                            Best
                                                                                                            Worst



                                        200




                                          0

                                                Joined Predicates      Predicates          SBFL


Figure 1: Summary of results


differences stem from a few bugs with high (i.e. bad) scores. When comparing the medians for
𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡 , the median for joined predicates is nearly double as big as the medians of the other
approaches.
    In Figure 3(right), we further split up our results regarding the different Defects4J projects
used. It shows the different mean values for 𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 separated by project. Here, we can see
that both predicate approaches have lower (i.e., better) results for nearly all projects, with only
’cli’ in favor of SBFL.
    Because a lot of the scores are close to or equal to 0, i.e., the best result, we additionally
looked at the density of scores. Figure 2 shows the density distribution of 𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 from 0 to
75. The dashed lines show the mean values for each approach. Here, we see that the simple
predicate approach has the highest density in the 0-10-range. This represents a lot of very
low 𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 results – results where the correct codelocation was highly ranked, while not
having too many similarly ranked codelocations. SBFL seems to have both very good scores,
but also a lot of very bad ones, leading to a significantly worse mean value.

Table 2
Mean and median values for 𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡 and 𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡

                        Approach                       𝑆𝑐𝑜𝑟𝑒𝑏𝑒𝑠𝑡          𝑆𝑐𝑜𝑟𝑒
                                                                           ˜︂   𝑏𝑒𝑠𝑡       𝑆𝑐𝑜𝑟𝑒𝑤𝑜𝑟𝑠𝑡           𝑆𝑐𝑜𝑟𝑒
                                                                                                                  ˜︂  𝑤𝑜𝑟𝑠𝑡

               Joined Predicates                               3.97                    0            41.25             26.33
                     Predicates                                2.94                    0            26.14             12.50
                                              SBFL           18.94                     1            48.34             13.50
                                                                           Type       Joined Predicates     Predicates        SBFL




                                                         0.04



                                                      Density




                                                         0.02




                                                         0.00

                                                                 0                    20               40                        60
                                                                                                Average


Figure 2: Density plot for 𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 with x-axis limited to 75

                                            Best                            Worst
                        8000

                                                                                                               time              30.8         20.9    51



                                                                                                               math              31.3         17.6    46.9

                        6000
                                                                                                               lang              11.4         8.1     18.7
                                                                                                Defects4J
Total EvaluationScore




                                                                                                project                                                      Avg
                                                                                                   chart       gson              18.2         10.4    28.3     50
                                                                                                   cli                                                         40
                                                                                                   codec                                                       30
                        4000                                                                       csv                                                         20
                                                                                                   gson
                                                                                                                csv              10.5          7      42.3     10
                                                                                                   lang
                                                                                                   math
                                                                                                   time       codec              23.7         20.6    26.5


                        2000                                                                                     cli             17.7         10.4    2.8



                                                                                                               chart             17.4         18.2    34
                                                                                                                                 s




                                                                                                                                               es




                                                                                                                                                      FL
                                                                                                                                 te




                          0
                                                                                                                                             at
                                                                                                                                 a




                                                                                                                                                     SB
                                                                                                                              ic




                                                                                                                                         ic
                                                                                                                            ed




                                                                                                                                        ed
                                                                                                                            Pr




                                                                                                                                        Pr




                                 Joined Predicates   SBFL         Joined Predicates    SBFL
                                                                                                                         ed




                               Predicates                       Predicates
                                                                                                                       in
                                                                                                                   Jo




Figure 3: Left: Best and Worst EvaluationScore, Right: Mean values for 𝑆𝑐𝑜𝑟𝑒𝑎𝑣𝑔 , separated by
Defects4J project


5.1. Spread and Path Cost
Next, we analyze the EvaluationScore in more detail to better understand the differences
between the predicate based and SBFL approaches. For this, we split up the EvaluationScore
into its two main components:
                • path cost: number of Units in the graph between the reported codelocation and the
                  buggy codelocation

                • penalty: number of codelocations to be examined before finding the reported
                  codelocation in the ranking


   We see in Figure 4(left), that SBFL has a significantly lower path cost than the two predicate
based approaches. This is expected, as SBFL directly outputs a line of code as result, while a bug
signature from a predicate based approach also contains additional information about the state
of the program in a faulty run. The code lines we extract and use are actually the locations of
the instrumentation sites for the predicates that the bug signature consists of. Since not every
executable line is a valid instrumentation site for a predicate, but some of the evaluated bugs
are located on such lines of code, the predicate based approaches can not reach a path cost of
0 in those cases.
                   p−values from Wilcoxon Signed Rank Tests (paired)                             p−values from Wilcoxon Signed Rank Tests (paired)
                                       0.0034                                                                         0.00039

      60                                            0.0026                                                                        0.88

                                0.86                                                    600                p < 2.22e−16




      40
                                                                       Joined           400
Path cost




                                                                       Predicates                                                                    Joined
                                                                       Predicates                                                                    Predicates
                                                                                    Spread




                                                                       SBFL                                                                          Predicates
                                                                                                                                                     SBFL



      20
                                                                                        200




            0                                                                                0

                Joined Predicates      Predicates            SBFL                                Joined Predicates   Predicates          SBFL

Figure 4: Left: Comparison of path cost (number of Units in the graph between
the two codelocations) Right:   Comparison of spread (𝑚𝑎𝑥𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 −
𝑚𝑖𝑛𝐶𝑜𝑑𝑒𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑏𝑒𝑓 𝑜𝑟𝑒 )


   Comparing the spread in Figure 4(right), one can see that joined predicates differ significantly
from the other approaches. This leads to the conclusion that joined predicates have more ties in
the ranking of its results which makes sense, as in addition to bug signatures having the same
rank, all predicates in the bug signature will have the same rank as the bug signature, itself.
Additionally, a bug signature consisting of three predicates can have up to three codelocations
, while a bug signature with joined predicates increases this number by one for each joined
predicate that is part of the bug signature. For example, a bug signature consisting of 2 joined
and one simple predicate can have up to five codelocations.
   An additional observation we made during our experiments is the partly unfeasibly high run
time of our experiments. Some bug signature mining runs took multiple days of computation
time, making the approach in its current form largely unusable in a real world scenario – at
least without further optimization.


6. Conclusions
After having analyzed the data from section 5, we summarize the following conclusions:

RQ1: Can a bug signature based approach be used for bug localization?

Yes, as we have seen in, e.g., Figure 1 and Figure 3(left), a bug signature (predicate) based
approach consistently gets a lower EvaluationScore than SBFL.
   These are welcome results, because while we expected it, as bug signature approaches have
been shown as effective in other languages, e.g., [23, 25] showed predicate based approaches in
C, there were multiple additional, java-specific difficulties with regard to the instrumentation.

RQ2: Can a bug signature based approach be enhanced by using joined predicates?

Maybe, but we did not manage to find any significant improvement by using our (rather simple)
joined predicate approach over the single predicate approach. In our studies, we only evaluated
joined predicate with a simple ”two-following” generation rule, and we did not exhaust other
possible rules for generating joined predicates, yet. While a three-following rule might deem
too time-intensive, we have many ideas for more elaborate rules concerning two predicates.
Another factor could be our definition of the EvaluationScore. As we discussed in section 4,
we did not find a way to use an established scoring algorithm to evaluate bug signatures. While
the problems we faced were beyond the scope of this project, most of them could be solved in
the future, allowing for a new evaluation of our results or future results.
   We still strongly believe that a bug signature based approach can be enhanced by using joined
predicates, but with other joined predicates beyond a simple combination of two following
predicates into a joined one.
   Another point is the differences between SBFL and predicate approaches visible in Fig-
ure 4(left). SBFL has a significantly lower average path cost than the two predicate approaches.
We can explain this with the different output formats used by the two techniques. The fact
that the predicate based approaches still get better overall scores thus must mean they perform
better in the non path cost part of our EvaluationScore.
   Furthermore, the bug signature mining algorithm described in [25] which our approaches
are based on, too, actually has a parameter for controlling how many predicates can be part of a
single bug signature. A bigger size limit significantly increases the computation time, which is
why we used a size limit of 3, as recommended in [25]. This could have been chosen too small,
as in the joined predicate approach, the joined predicates now compete with the others for a
spot inside a bug signature. On the one hand, while a bigger bug signature may potentially
be more helpful to a real programmer, it would negatively impact our score, as it does not
differentiate between processing multiple predicates inside one bug signature or multiple small
bug signatures containing the same predicates. Even worse for our evaluation (and likely any
other based on lines of source code), just one ”correct” predicate inside a bug signature is all
that is needed to score it positively, while ignoring all the others inside the bug signature. On
the other hand, in [25], the authors noted that increasing the size of the mined bug signatures
increases the predictive power but increases the computation time.
   Especially noteworthy is that all the extra information a programmer is supposed to get out
of a bug signature, e.g., it containing a predicate with 𝑣𝑎𝑟 = 𝑛𝑢𝑙𝑙 for a 𝑣𝑎𝑟 where 𝑛𝑢𝑙𝑙 is not
expected, can lead to another investigation path than just the information to investigate that
line. An evaluation method considering those aspects could have completely different results
than ours.
   Acknowledgements Lars Grunske would like to thank the intensive care unit of the Achen-
bach hospital in Königs Wusterhausen for the tremendous help during his Covid-19 case.


References
 [1] M. Zhivich, R. K. Cunningham, The real cost of software errors, IEEE Security & Privacy
     Magazine 7 (2009) 87–90. doi:10.1109/msp.2009.56.
 [2] N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, J. Penix, Using static analysis to
     find bugs, IEEE Software 25 (2008) 22–29. doi:10.1109/ms.2008.130.
 [3] M. Perscheid, B. Siegmund, M. Taeumel, R. Hirschfeld, Studying the advancement in
     debugging practice of professional software developers, Software Quality Journal 25 (2016)
     83–110. doi:10.1007/s11219-015-9294-2.
 [4] W. E. Wong, R. Gao, Y. Li, R. Abreu, F. Wotawa, A survey on software fault localization,
     IEEE Transactions on Software Engineering 42 (2016) 707–740. doi:10.1109/tse.2016.
     2521368.
 [5] J. A. Jones, M. J. Harrold, Empirical evaluation of the tarantula automatic fault-localization
     technique, in: Proceedings of the 20th IEEE/ACM international Conference on Automated
     software engineering - ASE '05, ACM Press, 2005, pp. 273–282. doi:10.1145/1101908.
     1101949.
 [6] T. Reps, T. Ball, M. Das, J. Larus, The use of program profiling for software maintenance
     with applications to the year 2000 problem, ACM SIGSOFT Software Engineering Notes
     22 (1997) 432–449. doi:10.1145/267896.267925.
 [7] T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, K. Vaswani, Holmes: Effective statistical
     debugging via efficient path profiling, in: 2009 IEEE 31st International Conference on
     Software Engineering, IEEE, IEEE, 2009, pp. 34–44. doi:10.1109/icse.2009.5070506.
 [8] R. Santelices, J. A. Jones, Y. Yu, M. J. Harrold, Lightweight fault-localization using multiple
     coverage types, in: 2009 IEEE 31st International Conference on Software Engineering,
     IEEE, IEEE, 2009, pp. 56–66. doi:10.1109/icse.2009.5070508.
 [9] W. Masri, Fault localization based on information flow coverage, Software Testing,
     Verification and Reliability 20 (2009) 121–147. doi:10.1002/stvr.409.
[10] R. A. Assi, W. Masri, Identifying Failure-Correlated Dependence Chains, in: 2011 IEEE
     Fourth International Conference on Software Testing, Verification and Validation Work-
     shops, 2011, pp. 607–616. doi:10.1109/ICSTW.2011.42.
[11] T. B. Le, D. Lo, C. L. Goues, L. Grunske, A learning-to-rank based fault localization
     approach using likely invariants, in: Proceedings of the 25th International Symposium
     on Software Testing and Analysis, ISSTA 2016, ACM, 2016, pp. 177–188. doi:10.1145/
     2931037.2931049.
[12] M. D. Ernst, J. Cockrell, W. G. Griswold, D. Notkin, Dynamically discovering likely program
     invariants to support program evolution, IEEE Transactions on Software Engineering 27
     (2001) 99–123. doi:10.1109/32.908957.
[13] S. Hangal, M. S. Lam, Tracking down software bugs using automatic anomaly detection,
     in: Proceedings of the 24th International Conference on Software Engineering, ICSE ’02,
     Association for Computing Machinery, New York, NY, USA, 2002, pp. 291–301. doi:10.
     1145/581339.581377.
[14] B. Pytlik, M. Renieris, S. Krishnamurthi, S. P. Reiss, Automated fault localization using
     potential invariants, CoRR cs.SE/0310040 (2003). arXiv:preprintcs/0310040.
[15] S. K. Sahoo, J. Criswell, C. Geigle, V. Adve, Using likely invariants for automated software
     fault localization, in: Proceedings of the eighteenth international conference on Archi-
     tectural support for programming languages and operating systems, 2013, pp. 139–152.
     doi:10.1145/2451116.2451131.
[16] M. A. Alipour, A. Groce, Extended program invariants: applications in testing and fault
     localization, in: Proceedings of the Ninth International Workshop on Dynamic Analysis,
     2012, pp. 7–11. doi:10.1145/2338966.2336799.
[17] B. Liblit, M. Naik, A. X. Zheng, A. Aiken, M. I. Jordan, Scalable statistical bug isolation,
     ACM SIGPLAN Notices 40 (2005) 15–26. doi:10.1145/1064978.1065014.
[18] C. Liu, X. Yan, L. Fei, J. Han, S. P. Midkiff, Sober: Statistical model-based bug lo-
     calization, in: Proceedings of the 10th European software engineering conference
     held jointly with 13th ACM SIGSOFT international symposium on Foundations of soft-
     ware engineering - ESEC/FSE-13, Proceedings of 13th ACM SIGSOFT International
     Symposium on Foundations of Software Engineering, ACM Press, 2005, pp. 286–295.
     doi:10.1145/1081706.1081753.
[19] C. Liu, L. Fei, X. Yan, J. Han, S. P. Midkiff, Statistical debugging: A hypothesis testing-based
     approach, IEEE Transactions on Software Engineering 32 (2006) 831–848. doi:10.1109/
     TSE.2006.105.
[20] P. A. Nainar, T. Chen, J. Rosin, B. Liblit, Statistical debugging using compound boolean
     predicates, in: D. S. Rosenblum, S. G. Elbaum (Eds.), Proceedings of the 2007 international
     symposium on Software testing and analysis - ISSTA '07, ACM Press, 2007, pp. 5–15.
     doi:10.1145/1273463.1273467.
[21] Z. You, Z. Qin, Z. Zheng, Statistical fault localization using execution sequence, in: 2012
     International Conference on Machine Learning and Cybernetics, volume 3, IEEE, IEEE,
     2012, pp. 899–905. doi:10.1109/icmlc.2012.6359473.
[22] V. Dallmeier, C. Lindig, A. Zeller, Lightweight defect localization for java, in: ECOOP
     2005 - Object-Oriented Programming, Springer Berlin Heidelberg, 2005, pp. 528–550.
     doi:10.1007/11531142_23.
[23] H.-Y. Hsu, J. A. Jones, A. Orso, Rapid: Identifying bug signatures to support debugging
     activities, in: 2008 23rd IEEE/ACM International Conference on Automated Software
     Engineering, IEEE, IEEE, 2008, pp. 439–442. doi:10.1109/ase.2008.68.
[24] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases,
     in: VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases,
     September 12-15, 1994, Santiago de Chile, Chile, Morgan Kaufmann, 1994, pp. 487–499.
     URL: http://www.vldb.org/conf/1994/P487.PDF. doi:10.5555/645920.672836.
[25] C. Sun, S.-C. Khoo, Mining succinct predicated bug signatures, in: Proceedings of the 2013
     9th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2013, ACM Press,
     2013, pp. 576–586. doi:10.1145/2491411.2491449.
[26] J. Li, H. Li, L. Wong, J. Pei, G. Dong, Minimum description length principle: Generators
     are preferable to closed patterns., in: Proceedings of the 21st National Conference on
     Artificial Intelligence - Volume 1, volume 1 of AAAI’06, 2006, pp. 409–414.
[27] Z. Zuo, S.-C. Khoo, C. Sun, Efficient predicated bug signature mining via hierarchical
     instrumentation, in: Proceedings of the 2014 International Symposium on Software
     Testing and Analysis - ISSTA 2014, ACM Press, 2014, pp. 215–224. doi:10.1145/2610384.
     2610400.
[28] E. Bruneton, ASM 4.0 A Java bytecode engineering library, 2007.
[29] W. E. Wong, V. Debroy, R. Gao, Y. Li, The DStar method for effective software fault
     localization, IEEE Transactions on Reliability 63 (2014) 290–308. doi:10.1109/tr.2013.
     2285319.
[30] M. Renieres, S. Reiss, Fault localization with nearest neighbor queries, in: 18th IEEE
     International Conference on Automated Software Engineering, 2003. Proceedings., ASE’03,
     IEEE Comput. Soc, 2003, pp. 30–39. doi:10.1109/ase.2003.1240292.
[31] H. Cleve, A. Zeller, Locating causes of program failures, in: Proceedings of the 27th
     international conference on Software engineering - ICSE '05, ICSE ’05, ACM Press, 2005,
     pp. 342–351. doi:10.1145/1062455.1062522.
[32] P. A. Nainar, T. Chen, J. Rosin, B. Liblit, Statistical debugging using compound boolean
     predicates, in: D. S. Rosenblum, S. G. Elbaum (Eds.), Proceedings of the 2007 international
     symposium on Software testing and analysis - ISSTA '07, ACM Press, 2007, pp. 5–15.
     doi:10.1145/1273463.1273467.
[33] R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, V. Sundaresan, Soot, in: CASCON
     First Decade High Impact Papers on - CASCON '10, ACM Press, Mississauga, Ontario,
     Canada, 2010, p. 13. doi:10.1145/1925805.1925818.
[34] P. Lam, E. Bodden, O. Lhoták, L. Hendren, The soot framework for java program analysis:
     a retrospective, in: Cetus Users and Compiler Infastructure Workshop (CETUS 2011),
     volume 15, 2011, p. 35.
[35] S. Heiden, L. Grunske, T. Kehrer, F. Keller, A. Hoorn, A. Filieri, D. Lo, An evaluation of pure
     spectrum-based fault localization techniques for large-scale software systems, Software:
     Practice and Experience 49 (2019) 1197–1224. doi:10.1002/spe.2703.
[36] R. Just, D. Jalali, M. D. Ernst, Defects4j: a database of existing faults to enable controlled
     testing studies for java programs, in: Proceedings of the 2014 International Symposium on
     Software Testing and Analysis - ISSTA 2014, ACM Press, 2014, pp. 437–440. doi:10.1145/
     2610384.2628055.
[37] V. Sobreira, T. Durieux, F. Madeiral, M. Monperrus, M. de Almeida Maia, Dissection of
     a bug dataset: Anatomy of 395 patches from defects4j, in: 2018 IEEE 25th International
     Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2018, pp.
     130–140. doi:10.1109/saner.2018.8330203.