<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data-Augmented Software Diagnosis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amir Elmishali</string-name>
          <email>amir9979@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roni Stern</string-name>
          <email>roni.stern@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meir Kalech</string-name>
          <email>kalech@bgu.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ben Gurion University of the Negev</institution>
        </aff>
      </contrib-group>
      <fpage>247</fpage>
      <lpage>252</lpage>
      <abstract>
        <p>The task of software diagnosis algorithms is to identify which software components are faulty, based on the observed behavior of the system. Software diagnosis algorithms have been studied in the Artificial Intelligence community, using a model-based and spectrum-based approaches. In this work we show how software fault prediction algorithms, which have been studied in the software engineering literature, can be used to improve software diagnosis. Software fault prediction algorithms predict which software components is likely to contain faults using machine learning techniques. The resulting dataaugmented diagnosis algorithm we propose is able to overcome of key problems in software diagnosis algorithms: ranking diagnoses and distinguishing between diagnoses with high probability and low probability. This allows to significantly reduce the outputted list of diagnoses. We demonstrate the efficiency of the proposed approach empirically on both synthetic bugs and bugs extracted from the Eclipse open source project. Results show that the accuracy of the found diagnoses is substantially improved when using the proposed combination of software fault prediction and software diagnosis algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Software is prevalent in practically all fields of life, and its
complexity is growing. Unfortunately, software failures are
common and their impact can be very costly. As a result,
there is a growing need for automated tools to identify
software failures and isolate the faulty software components,
such as classes and functions, that have caused the failure.
We focus on the latter task, of isolating faults in software
components, and refer to this task as software diagnosis.</p>
      <p>Model-based diagnosis (MBD) is an approach to
automated diagnosis that uses a model of the diagnosed system
to infer possible diagnoses, i.e., possible explanations of the
observed system failure. While MBD was successfully
applied to a range of domains [1; 2; 3; 4], it has not been
applied successfully yet to software. The reason for this is that
in software development, there is usually no formal model
of the developed software. To this end, a scalable software
diagnosis algorithm called Barinel has been proposed [5].
Barinel is a combination of MBD and Spectrum Fault
Localization (SFL). SFL considers traces of executions, and
finds diagnoses by considering the correlation between
execution traces and which executions have failed. While very
scalable, Barinel suffers from one key disadvantage: it can
return a very large set of possible diagnoses for the
software developer to choose from. To handle this disadvantage,
Abreu et al. [5] proposed a Bayesian approach to compute
a likelihood score for each diagosis. Then, diagnoses are
prioritize according to their likelihood scores.</p>
      <p>Thanks to the open source movement and current
software engineering tools such as version control and issue
tracking systems, there is much more information about a
diagnosed system than revealed by the traces of performed
tests. For example, version control systems store all
revisions of every source files, and it is quite common that a
bug occurs in a source file that was recently revised. Barinel
is agnostic to this data. We propose a data-driven approach
to better prioritize the set of diagnoses returned by Barinel.</p>
      <p>In particular, we use methods from the software
engineering literature to learn from collected data how to
predict which software components are expected to be faulty.
These predictions are then integrated into Barinel to better
prioritize the diagnoses it outputs and provide more accurate
estimates of each diagnosis likelihood.</p>
      <p>The resulting data-augmented diagnosis algorithm is part
of a broader software troubleshooting paradigm that we call
Learn, Diagnose, and Plan (LDP). In this paradigm,
illustrated in Figure 1(a), the troubleshooting algorithm learns
which source files are likely to fail from past faults,
previous source code revisions, and other sources. When a test
fails, a data-augmented diagnosis algorithm considers the
observed failed and passed tests to suggest likely diagnoses
leveraging the knowledge learned from past data. If further
tests are necessary to determine which software component
caused the failure, such test are planned automatically,
taking into consideration the diagnoses found. This process
continues until a sufficiently accurate diagnoses is found.</p>
      <p>In this work we implemented this paradigm and simulated
its execution on a popular open source software project – the
Eclipse CDT. Information from the Git version control and
the Bugzilla issue tracking systems was used, as illustrated
in Figure 1(b) and explained in the experimental results.</p>
      <p>Results show a huge advantage of using our
dataaugmented diagnoser over Barinel with uniform priors for
both finding more accurate diagnoses and for better
selecting tests for troubleshooting. Moreover, to demonstrate the
potential benefit of our data-augmented approach we also</p>
      <p>Server Logs
Issue Tracking</p>
      <p>System</p>
      <p>Source Code
Source Code
AI Engine
AI Engine
Developer</p>
      <p>QA Tester
Developer
(a) Learn, Diagnos, and Plan Paradigm
(b) Our current implementation
experimented with a synthetic fault prediction model that is
correctly identifies the faulty component. As expected,
using the synthetic fault prediction model is better than using
the learned fault prediction model, thus suggesting room for
further improvements in future work. To our knowledge,
this is the first work to integrate successfully a data-driven
approach into software diagnosis.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Model-Based Diagnosis for Software</title>
      <p>The input to classical MBD algorithms is a tuple
hSD, COM P S, OBSi, where SD is a formal description
of the diagnosed system’s behavior, COM P S is the set of
components in the system that may be faulty, and OBS
is a set of observations. A diagnosis problem arises when
SD and OBS are inconsistent with the assumption that all
the components in COM P S are healthy. The output of an
MBD algorithm is a set of diagnoses.</p>
      <p>Definition 1 (Diagnosis). A set of components Δ
COM P S is a diagnosis if
⊆
^ (¬h(C)) ∧</p>
      <p>^ (h(C0)) ∧ SD ∧ OBS
C∈Δ</p>
      <p>C0∈/Δ
is consistent, i.e., if assuming that the components in Δ are
faulty, then SD is consistent with OBS.</p>
      <p>The set of components (COM P S) in software diagnoses
can be, for example, the set of classes, or all functions, or
even a component per line of code. Low level granularity of
components, e.g., setting each line of code as a component,
will result in very focused diagnoses (e.g., pointing on the
exact line of code that was faulty). Focusing the diagnoses
in such way comes at a price of an increase in the
computational effort. Automatically choosing the most suitable level
of granularity is a topic for future work.</p>
      <p>Observations (OBS) in software diagnosis are observed
executions of tests. Every observed test t is labeled as
“passed” or “failed”, denoted by passed(t) and f ailed(t),
respectively. This labeling is done manually by the tester or
automatically in case of automated tests (e.g., failed
assertions).</p>
      <p>There are two main approaches for applying MBD to
software diagnosis, each definingSD somewhat differently.
The first approach requiresSD to be a logical model of the
correct functionality of every software component [6]. This
approach allows using logical reasoning techniques to infer
diagnoses. The main drawbacks of this approach is that it
does not scale well and modeling the behavior of software
component is often infeasible.</p>
      <sec id="sec-2-1">
        <title>2.1 SFL for Software Diagnosis</title>
        <p>An alternative approach to software diagnosis has been
proposed by Abreu et al. (5; 7), based on spectrum-based fault
localization (SFL). In this SFL-based approach, there is no
need for a logical model of the correct functionality of every
software component in the system. Instead, the traces of the
observed tests are considered.</p>
        <p>Definition 2 (Trace). A trace of a test t, denoted by trace(t),
is the sequence of components involved in executing t.</p>
        <p>Traces of tests can be collected in practice with
common software profilers (e.g., Java’s JVMTI). Recent work
showed how test traces can be collected with low
overhead [8]. Also, many implemented applications maintain
a log with some form of this information.</p>
        <p>In the SFL-based approach, SD is implicitly defined in
SFL by the assumption that a test will pass if all the
components in its trace are not faulty. Let h(C) denote the health
predicate for a component C, i.e., h(C) is true if C is not
faulty. Then we can formally define SD in the SFL-based
approach with the following set of Horn clauses:
^</p>
        <p>h(C)) → passed(test)
∀test (</p>
        <p>C∈trace(test)
Thus, if a test failed then we can infer that at least one of the
components in its trace is faulty. In fact, a trace of a failed
test is a conflict .</p>
        <p>Definition 3 (Conflict) . A set of components Γ ⊆ COM P S
is a conflict if V h(C) ∧ SD ∧ OBS is inconsistent.</p>
        <p>C∈Γ</p>
        <p>Many MBD algorithms use conflicts to direct the search
towards diagnoses, exploiting the fact that a diagnosis must
be a hitting set of all the conflicts [9; 10; 11]. Intuitively,
since at least one component in every conflict is faulty, only
a hitting set of all conflicts can explain the unexpected
observation (failed test).</p>
        <p>
          Barinel is a recently proposed software MBD
algorithm [5] based on exactly this concept: considering traces
of tests with failed outcome as conflicts and returning their
hitting sets as diagnoses. With a fast hitting set algorithm,
such as the STACATTO hitting set algorithm proposed by
Abreu et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], Barinel can scale well to large systems.
The main drawback of using Barinel is that it often outputs
a large set of diagnoses, thus providing weaker guidance to
the programmer that is assigned to solve the observed bug.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Prioritizing Diagnoses</title>
        <p>To address this problem, Barinel computes a score for every
diagnosis it returns, estimating the likelihood that it is true.
This serves as a way to prioritize the large set of diagnoses
returned by Barinel.</p>
        <p>The exact details of how this score is compute is given
by Abreu et al. [5]. For the purpose of this paper, it is
important to note that the score computation used by Barinel
is Bayesian: it computes for a given diagnosis the posterior
probability that it is correct given the observed passes and
failed tests. As a Bayesian approach, Barinel also requires
some assumption about the prior probability of each
component to be faulty. Prior works using Barinel has set these
priors uniformly to all components. In this work, we
propose a data-driven way to set these priors more intelligently
and demonstrate experimentally that this has a huge impact
of the overall performance of the resulting diagnoser.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data-Augmented Software Diagnosis</title>
      <p>The prior probabilities used by Barinel represent the a-priori
probability of a component to be faulty, without considering
any observed system behavior. Fortunately, there is a line of
work on software fault prediction in the software
engineering literature that deals exactly with this question: which
software components is more likely to have a bug. We
propose to use these software fault predictions as priors to be
used by Barinel. First, we provide some background on
software fault prediction.
3.1</p>
      <sec id="sec-3-1">
        <title>Software Fault Prediction</title>
        <p>Fault prediction in software is a classification problem.
Given a software component, the goal is to determine its
class – healthy or faulty. Supervised machine learning
algorithms are commonly used these days to solve classification
problems. They work as follows. As input, they are given a
set of instances, in our case these are software components,
and their correct labeling, i.e., the correct class for each
instance. They output a classification model , which maps an
instance to a class.</p>
        <p>Learning algorithm extract features from a given instance,
and try to learn from the given labeled instances the relation
between the features of an instance and its class. Key to
the success of machine learning algorithms is the choice of
features used. Many possible features were proposed in the
literature for software fault prediction.</p>
        <p>
          Radjenovic et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] surveyed the features used by
existing software prediction algorithms and categorizes them
into three families. Traditional. These features are
traditional software complexity metrics, such as number of lines
of code, McCabe [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and Halstead [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] complexity
measures.
        </p>
        <p>Object Oriented. These features are software
complexity metrics that are specifically designed for object oriented
programs. This includes metrics like cohesion and coupling
levels and depth of inheritance.</p>
        <p>Process. These features are computed from the software
change history. They try to capture the dynamics of the
software development process, considering metrics such as lines
added and deleted in the previous version and the age of the
software component.</p>
        <p>
          It is not clear from the literature which combination of
features yields the most accurate fault predictions. In a
preliminary set of experiments we found that the
combination of features that performed best is a combination of 68
features from the features listed by Radjenovic et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
worked best. This list of features included the McCabe [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
and Halstead [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] complexity measures, several object
oriented measures such as the number of methods overriding
a superclass, number of public methods, number of other
classes referenced, and is the class abstract, and several
process features such as the age of the source file, the number
of revisions made to it in the last release, the number of
developers contributed to its development, and the number of
lines changed since the latest version.
        </p>
        <p>As shown in the experimental results section, the
resulting fault prediction model was accurate enough so that the
overall data-augmented software diagnoser be more
effective than Barinel with uniform priors. However, we are not
sure that a better combination of features cannot be found,
and this can be a topic for future work. The main novelty of
our work is in integrating a software fault prediction model
with the Barinel.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Integrating the Fault Prediction Model</title>
        <p>The software fault prediction model generated as described
above is a classifier, accepting as input a software
component and outputting a binary prediction: is the component
predicted to be faulty or not. Barinel, however, requires
a real number that estimates the prior probability of each
component to be faulty.</p>
        <p>To obtain this estimated prior from the fault prediction
model, we rely on the fact that most prediction models also
output a confidence score, indicating the model’s confidence
about the classified class. Let conf (C) denote this
confidence for component C. We use conf (C) for Barinel’s
prior if C is classified as faulty, and1 − conf (C) otherwise.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>To demonstrate the benefits of the proposed data-augmented
approach, we implemented it and evaluated it as follows.
4.1</p>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>As a benchmark, we used the source files, tests, and
bugs reported for the Eclipse CDT open source software
project (eclipse.org/cdt). Eclipse CDT is a popular
open source Integrated Development Environment (IDE) for
C/C++. The first release dates back to December 2003 and
the latest release we consider, labeled CDT 8.2.0, was
released in June 2013. It consists of 8,502 source code files
and have had more than 10,129 bugs reported so far (for all
releases). In addition, there are 3,493 automated tests coded
using the JUnit unit testing framework.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Determining Faulty Files</title>
        <p>
          Eclipse CDT is developed using the Git version control
system and the Bugzilla issue tracking system. Git maintains
all versions of each source file in a repository. This
enables computing process metrics for every version of every
source file. Similarly, Bugzilla is used to maintain all
reported bugs. Some source file versions are marked in the
Git repository as versions in which a specific bug was fixed.
The Git repository for Eclipse CDT contained matching
versions of source files for 6,730 out of 10,129 bugs reported as
fixed in Bugzilla. We performed our experiments on these
6,730 bugs.
For both learning and testing a fault prediction model, we
require a mapping between reported bug and the source files
that were faulty and caused it. One possible assumption is
that every source file revision that is marked as fixing bug
X is a faulty file that causedX. We call this the “All files”
assumption. The “All files” assumption may overestimate
the number of faulty files as some of these files may have
been modified due to other reasons, not related to the bug.
Even if all changes in a revision are related to fixing a bug,
it still does not mean that all these files are faulty. For
example, properties files and XML configuration files. As a
crude heuristic to overcome this, we also experiment with
an alternative assumption that we call the “Most modified”
assumption. In the “Most modified” assumption, for a given
bug X we only consider a single source file as faulty from
all the files associated with bug X, We chose from these
source file the one in which the revision made to that source
file was the most extensive. The extensiveness of the
revision is measured by the number of lines added, updated,
and deleted to the source file in this revision. Below we
present experiments for both “All files” and “Most
modified” assumptions. S´liwerski et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] proposed a more
elaborate method to heuristically identify the source files
that are caused the bug, when analyzing a similar data set.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Training and Testing Set</title>
        <p>The sources files and reported bugs from 5 releases, 8.0.0–
8.1.1, were used to train the model of our data-augmented
diagnoser, and the source files and reported bugs from
release 8.1.2 were used to evaluate it.
4.2</p>
      </sec>
      <sec id="sec-4-4">
        <title>Comparing Fault Prediction Accuracy</title>
        <p>As a preliminary, we evaluated the quality of the fault
prediction models used by our data-augmented diagnoser on
our Eclipse CDT benchmark.</p>
        <p>All files
Random Forest
J48
Naive Bayes
Most modified
Random Forest
J48
Naive Bayes</p>
        <p>Precision
0.56
0.44
0.27
Precision
0.44
0.15
0.08</p>
        <p>Recall
0.09
0.17
0.31
Recall
0.04
0.03
0.31</p>
        <p>F-Measure
0.16
0.25
0.29
F-Measure
0.08
0.05
0.12</p>
        <p>AUC
0.84
0.61
0.80
AUC
0.76
0.55
0.715</p>
        <p>We used the Weka software package (www.cs.
waikato.ac.nz/ml/weka) to experiment with several
learning algorithms and compared the resulting fault
prediction models. Specifically, we evaluated the following
learning algorithms: Random Forest, J48 (Weka’s
implementation of a decision tree learning algorithm), and Naive Bayes.
Table 1 shows the precision, recall, F-measure, and AUC
of the fault prediction models generated by each of these
learning algorithms. These are standard metrics for
evaluating classifiers. In brief, precision is the ratio of faulty files
among all files identified by the evaluated model as faulty.
Recall is the number of faulty files identified as such by the
evaluated model divided by the total number of faulty files.
F-measure is a known combination of precision and recall.
The AUC metric addresses the known tradeoff between
recall and precision, where high recall often comes at the price
of low precision. This tradeoff can be controlled by setting
different sensitivity thresholds to the evaluated model. AUC
is the area under the curve plotting the accuracy as a
function of the recall (every point is a different threshold value).</p>
        <p>
          All metrics range between zero and one (where one is
optimal) and are standard metrics in machine learning and
information retrieval. The unfamiliar reader can find more
details in Machine Learning books, e.g. Mitchell’s classical
book [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>The results for both “All files” and “Most modified”
assumptions show that the Random Forest classifier obtained
the overall best results. This corresponds to many recent
works. Thus, in the results reported henceforth, we only
used the model generated by the Random Forest classifier
in our data-augmented diagnoser. The precision and
especially recall results are fairly low. This is understandable,
as most files are healthy, and thus the training set is very
imbalanced. This is a known inhibitor to performance of
standard learning algorithms. We have experimented with
several known methods to handle this imbalanced setting,
such as SMOTE and random under sampling, but these did
not produce substantially better results. However, as we
show below, even this imperfect prediction model is able
to improve the existing data-agnostic software diagnosis
algorithm. Note that we also experimented with other popular
learning algorithms such as Support Vector Machine (SVM)
and Artificial Neural Network (ANN), but their results were
worse than those shown in Table 1.</p>
        <p>Next, we evaluate the performance of our data-augmented
diagnoser in two diagnostic tasks: finding diagnoses and
guiding test generation.
4.3</p>
      </sec>
      <sec id="sec-4-5">
        <title>Diagnosis Task</title>
        <p>First, we compared the data-agnostic diagnoser with the
proposed data-augmented diagnoser in the task of finding
accurate diagnoses. The input is a set of tests, with their
traces and outcomes and the output is a set of diagnoses,
each diagnosis having a score that estimates its correctness.
This score was computed by Barinel as desribed earlier in
the paper, where the data-agnostic diagnoser uses uniform
priors and the proposed data-augmented diagnoser uses the
predicted fault probabilities from the learned model.</p>
        <p>Diagnoser
Data-agnostic
Data-augmented
Syn. (0.6,0.01)
Syn. (0.6,0.1)
Syn. (0.6,0.2)
Syn. (0.6,0.3)
Syn. (0.6,0.4)</p>
        <p>Most modified
Precision Recall
0.72 0.27
0.90 0.32
0.97 0.39
0.84 0.35
0.77 0.34
0.73 0.33
0.69 0.32</p>
        <p>All files
Precision Recall
0.55 0.26
0.73 0.35
0.96 0.45
0.89 0.42
0.83 0.39
0.78 0.37
0.74 0.36</p>
        <p>To compare the set of diagnoses returned by the
different diagnosers, we computed the weighted average of their
precision and recall. This was computed as follows. First,
the precision and recall for every diagnoses was computed.
Then, we averaged these values, weighted by the score given
to the diagnoses by Barinel. This enables aggregating the
precision and recall of a set of diagnoses while
incorporating which diagnoses are regarded as more likely according
to Barinel’s. For brevity, we will refer to this weighted
average precision and weighted average recall as simply
precision and recall.
Table 2 shows the precision and recall results of the
dataagnostic diagnoser and our data-augmented diagnoser, for
both “Most modified” and “All files” assumptions. Each
result in the table is an average over the precision and
recall obtained for 50 problem instances. A problem instance
consists of (1) a bug from one of the bugs reported for
release 8.1.2. of Eclipse CDT, and (2) a set of 25 tests, chosen
randomly, while ensuring that at least one tests would pass
through the faulty files.</p>
        <p>Both precision and recall of the data-augmented and
dataagnostic diagnosers support the main hypothesis of this
work: a data-augmented diagnoser can yield substantially
better diagnoses that a data-agnostic diagnoser. For
example, the precision of the data-augmented diagnoser under the
“Most modified” assumption is 0.9 while that of the
dataagnostic diagnoser is only 0.72. The superior performance
of the data-augmented diagnoser is shown for both “Most
modified” and “All files” assumptions. Another observation
that can be made from the results in Table 2 is that while the
precision of the data-augmented diagnoses is very high and
is substantially better than that of the data-agnostic
diagnoser, the improvement in recall is relatively more modest.
This can be explained by the precision and recall results of
the learned model, shown in Table 1 and discussed earlier.
There too, the recall results was far worse than the
precision results (recall that we are using the model learned by
the Random Forest learning algorithm). It is possible that
learning a model with higher recall may result in higher
recall for the resulting diagnoses. We explore the impact of
learning more accurate fault prediction model next.</p>
      </sec>
      <sec id="sec-4-6">
        <title>Synthetic Priors</title>
        <p>
          The data-augmented diagnoser is based on the priors
generated by the learned fault prediction model. Building
better fault prediction models is an active field of study [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
and thus future fault prediction models may be more
accurate than the ones used by our data-augmented diagnoser.
To evaluate the benefit of a more accurate fault prediction
model on our data-augmented diagnoser, we created a
synthetic fault prediction model, in which faulty source files
get Pf probability and healthy source files get Ph, where
Pf and Ph are parameters. Setting Ph = Pf would cause
the data-augmented diagnoser to behave in a uniform
distribution exactly like the data-agnostic diagnoser, setting the
same prior probability for all source files to be faulty. By
contrast, setting Ph = 0 and Pf = 1 represent an optimal
fault prediction model, that exactly predicts which files are
faulty and which are healthy.
        </p>
        <p>The lines marked “Syn. (X,Y)” in Table 2 mark the
performance of the data-augmented diagnoser when using
this synthetic fault prediction model, where X = Pf and
Y = Ph. Note that we experimented with many values of
Pf and Ph, and presented above a representative subset of
these results.</p>
        <p>As expected, setting lowering the value of Ph results in
more better diagnoses being found. Setting a very low Ph
value improves the precision significantly up to almost
perfect precision (0.97 and 0.96 for the “Most modified” and
“All files”, respectively). The recall results, while also
improving as we lower Ph, do not reach a very high value. For
Ph = 0.01, the obtained recall is almost 0.39 and 0.45 for
the “Most modified” and “All files”, respectively.</p>
        <p>A possible explanation for these low recall results lays in
the fact that all the evaluated diagnosers use the Barinel
diagnosis algorithm with different fault priors. Barinel uses
these priors only to prioritize diagnoses, but Barinel
considers as diagnoses hitting sets of faulty traces. Thus, if two
faulty components are used in the same trace, only one of
them will be detected even if both have very high likelihood
of being faulty according to the fault prediction model.</p>
      </sec>
      <sec id="sec-4-7">
        <title>Considering More Tests</title>
        <p>Next, we investigate the impact of adding more tests to the
accuracy of the returned diagnoses.</p>
        <p>Figure 2 shows the precision and recall results (Figures 2
(a) and (b), respectively), as a function of the number of
observed tests. We compared the different diagnosers, given
25, 40, 70, 100, and 130 observed tests.</p>
        <p>The results show two interesting trends in both precision
and recall. First, as expected, the data-agnostic diagnoser
performs worse than the data-augmented diagnoser, which
in terms performs worse than the diagnoser using a synthetic
fault prediction model, with Ph = 0.01. This supports our
main hypothesis — that data-augmented diagnosers can be
better than a data-agnostic diagnoser. Also, the better
performance of Syn. (0.6, 0.01) demonstrates that future
research on improving the fault prediction model will results
in a better diagnoser.</p>
        <p>
          The second trend is that adding more tests reduces the
precision and recall of the returned diagnoses. This, at
first glance, seem counter-intuitive, as we would expect
more tests to allow finding more accurate diagnoses and
thus higher recall and precision. This non-intuitive results
can be explained by how tests were chosen. As explained
above, the observed tests were chosen randomly, only
verifying that at least one test passes through each faulty source
file. Adding randomly selected tests adds noise to the
diagnoser. By contrast, intelligent methods to choose which
tests to add can improve the accuracy of the diagnoses [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
This is explored in the next section. Another reason for the
degraded performance when adding more tests is that more
tests may pass through more fault source files, in addition
to those from the specific reported bug used to generate the
problem instance in the first place. Thus, adding more tests
increases the amount of faulty source files to detect.
4.4
        </p>
      </sec>
      <sec id="sec-4-8">
        <title>Troubleshooting Task</title>
        <p>
          Efficient diagnosers are key components of
troubleshooting algorithms. Troubleshooting algorithms choose which
tests to perform to find the most accurate diagnosis.
Zamir et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] proposed several troubleshootings algorithms
specifically designed to work with Barinel for
troubleshooting software bugs. In the below preliminary study, we
evaluated the impact of our data-augmented diagnoser on the
overall performance of troubleshooting algorithms.
Specifically, we implemented the so-called highest probability
(HP) troubleshooting algorithm, in which tests are chosen
in the following manner. HP chooses a test that is expected
to pass through the source file having the highest probability
of being faulty, given the diagnoses probabilities.
        </p>
        <p>We run the HP troubleshooting algorithm with each of
the diagnosers mentioned above (all rows in Table 2). We
compared the HP troubleshooting algorithm using different
diagnosers by counting the number of tests were required to
reach a diagnoses of score higher than 0.7.</p>
        <p>Table 3 shows the average number of tests performed by
the HP troubleshooting algorithm until it halts (with a
suitable diagnosis). The results show the same over-arching</p>
        <p>Most modified
20.24
10.80
3.94
15.44
19.78
20.90
20.74</p>
        <p>All files
18.06
15.45
14.91
17.83
18.99
19.24
19.18
theme: the data-augmented diagnoser is much better than
the data-agnostic diagnoser for this troubleshooting task.
Also, using the synthetic fault prediction model can result
in even further improvement, thus suggesting future work
for improving the learned fault prediction model.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion, and Future Work</title>
      <p>We presented a method for using information about the
diagnosed system to improve Barinel, a scalable, effective,
software diagnosis algorithm [7]. In particular, we
incorporated a software fault prediction model into Barinel. The
resulting data-augmented diagnoser is shown to outperform
Barinel without such a fault prediction model. This was
verified experimentally using a real source code system
(Eclipse CDT), real reported bugs and information from
the software’s source control repository. Results also
suggests that future work on improving the learned fault
prediction model will result in an improved diagnosis accuracy.
In addition, it is worthwhile to incorporate the proposed
data-augmented diagnosis methods with other proposed
improvements of the based SFL-based software diagnosis, as
those proposed by Hofer et al. [19; 20].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] [2] [3]</source>
          [4]
          <string-name>
            <surname>Brian</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            and
            <given-names>P. Pandurang</given-names>
          </string-name>
          <string-name>
            <surname>Nayak</surname>
          </string-name>
          .
          <article-title>A model-based approach to reactive self-configuring systems</article-title>
          .
          <source>InConference on Artificial Intelligence (AAAI)</source>
          , pages
          <fpage>971</fpage>
          -
          <lpage>978</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Feldman</surname>
          </string-name>
          , Helena Vicente de Castro, Arjan van Gemund,
          <string-name>
            <given-names>and Gregory</given-names>
            <surname>Provan</surname>
          </string-name>
          .
          <article-title>Model-based diagnostic decision-support system for satellites</article-title>
          .
          <source>In IEEE Aerospace Conference</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Peter</given-names>
            <surname>Struss</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Price</surname>
          </string-name>
          .
          <article-title>Model-based systems in the automotive industry</article-title>
          .
          <source>AI magazine</source>
          ,
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>17</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Dietmar</given-names>
            <surname>Jannach</surname>
          </string-name>
          and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Schmitz</surname>
          </string-name>
          .
          <article-title>Model-based diagnosis of spreadsheet programs: a constraint-based debugging approach</article-title>
          .
          <source>Automated Software Engineering</source>
          ,
          <volume>1</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          130 70 #
          <article-title>Tests (b) Recall results Rui Abreu</article-title>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Zoeteweij</surname>
          </string-name>
          , and
          <string-name>
            <surname>Arjan J. C. van Gemund</surname>
          </string-name>
          .
          <article-title>Simultaneous debugging of software faults</article-title>
          .
          <source>Journal of Systems and Software</source>
          ,
          <volume>84</volume>
          (
          <issue>4</issue>
          ):
          <fpage>573</fpage>
          -
          <lpage>586</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Franz</given-names>
            <surname>Wotawa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Nica</surname>
          </string-name>
          .
          <article-title>Program debugging using constraints - is it feasible? Quality Software</article-title>
          , International Conference on,
          <volume>0</volume>
          :
          <fpage>236</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Spectrum-based multiple fault localization</article-title>
          .
          <source>In Automated Software Engineering (ASE)</source>
          , pages
          <fpage>88</fpage>
          -
          <lpage>99</lpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Perez</surname>
          </string-name>
          , Rui Abreu, and
          <string-name>
            <given-names>André</given-names>
            <surname>Riboira</surname>
          </string-name>
          .
          <article-title>A dynamic code coverage approach to maximize fault localization efficiency</article-title>
          .
          <source>Journal of Systems and Software</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Johan de Kleer and Brian</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Diagnosing multiple faults</article-title>
          .
          <source>Artif</source>
          . Intell.,
          <volume>32</volume>
          (
          <issue>1</issue>
          ):
          <fpage>97</fpage>
          -
          <lpage>130</lpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Brian</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            and
            <given-names>Robert J.</given-names>
          </string-name>
          <string-name>
            <surname>Ragno</surname>
          </string-name>
          .
          <string-name>
            <surname>Conflict-directed</surname>
            <given-names>A</given-names>
          </string-name>
          *
          <article-title>and its role in model-based embedded systems</article-title>
          .
          <source>Discrete Appl</source>
          . Math.,
          <volume>155</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1562</fpage>
          -
          <lpage>1595</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Roni</surname>
            <given-names>Stern</given-names>
          </string-name>
          , Meir Kalech, Alexander Feldman, and
          <string-name>
            <surname>Gregory</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Provan</surname>
          </string-name>
          .
          <article-title>Exploring the duality in conflict-directed model-based diagnosis</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Rui</given-names>
            <surname>Abreu</surname>
          </string-name>
          and
          <string-name>
            <surname>Arjan JC van Gemund</surname>
          </string-name>
          .
          <article-title>A low-cost approximate minimal hitting set algorithm and its application to model-based diagnosis</article-title>
          .
          <source>In SARA</source>
          , volume
          <volume>9</volume>
          , pages
          <fpage>2</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Danijel</surname>
            <given-names>Radjenovic</given-names>
          </string-name>
          , Marjan Hericko, Richard Torkar, and
          <string-name>
            <given-names>Ales</given-names>
            <surname>Zivkovic</surname>
          </string-name>
          .
          <article-title>Software fault prediction metrics: A systematic literature review</article-title>
          .
          <source>Information &amp; Software Technology</source>
          ,
          <volume>55</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1397</fpage>
          -
          <lpage>1418</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Thomas</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>McCabe</surname>
          </string-name>
          .
          <article-title>A complexity measure</article-title>
          .
          <source>IEEE Trans. Software Eng.</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>308</fpage>
          -
          <lpage>320</lpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Maurice</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Halstead</surname>
          </string-name>
          .
          <source>Elements of Software Science (Operating and Programming Systems Series)</source>
          . Elsevier Science Inc., New York, NY, USA,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Jacek</surname>
            <given-names>S</given-names>
          </string-name>
          ´liwerski, Thomas Zimmermann, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Zeller</surname>
          </string-name>
          .
          <article-title>When do changes induce fixes? ACM sigsoft software engineering notes</article-title>
          ,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Mitchell</surname>
          </string-name>
          .
          <article-title>Machine learning</article-title>
          .
          <source>McGraw Hill</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Roni</given-names>
            <surname>Stern</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Meir</given-names>
            <surname>Kalech</surname>
          </string-name>
          .
          <article-title>Using modelbased diagnosis to improve software testing</article-title>
          .
          <source>In AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Birgit</surname>
            <given-names>Hofer</given-names>
          </string-name>
          , Franz Wotawa, and
          <string-name>
            <given-names>Rui</given-names>
            <surname>Abreu</surname>
          </string-name>
          .
          <article-title>Ai for the win: Improving spectrum-based fault localization</article-title>
          .
          <source>ACM SIGSOFT Software Engineering Notes</source>
          ,
          <volume>37</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Birgit</given-names>
            <surname>Hofer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Franz</given-names>
            <surname>Wotawa</surname>
          </string-name>
          .
          <article-title>Spectrum enhanced dynamic slicing for better fault localization</article-title>
          .
          <source>In ECAI</source>
          , pages
          <fpage>420</fpage>
          -
          <lpage>425</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>