=Paper= {{Paper |id=Vol-2289/paper8 |storemode=property |title=QR-Augmented Spectrum-based Fault Localization |pdfUrl=https://ceur-ws.org/Vol-2289/paper8.pdf |volume=Vol-2289 |authors=Alexandre Perez,Rui Abreu |dblpUrl=https://dblp.org/rec/conf/safeprocess/PerezA18 }} ==QR-Augmented Spectrum-based Fault Localization== https://ceur-ws.org/Vol-2289/paper8.pdf
                        QR-Augmented Spectrum-based Fault Localization*

                                           Alexandre Perez1 , Rui Abreu2
                             1
                               University of Porto and HASLab, INESC-TEC, Portugal
                                2
                                  IST, University of Lisbon and INESC-ID, Portugal
                            alexandre.perez@fe.up.pt, rui@computer.org


                          Abstract                                   issue pointed out by 4 is the fact that many SFL studies as-
                                                                     sume perfect fault understanding — that is, these studies
     Spectrum-based fault localization (SFL) corre-                  expect that once users inspect a faulty component, they will
     lates a system’s components with observed fail-                 correctly identify it as such —, which does not always hold
     ures. By reasoning about coverage, SFL allows                   in practice [6].
     for a lightweight way of pinpointing faults. This                  This paper proposes an approach that inspects the state
     abstraction comes at the cost of missing certain                of system components, with the intent of augmenting re-
     faults, such as errors of omission, and failing to              ports generated by SFL techniques and hence providing
     provide enough contextual information to explain                more diagnostic information. Recording the state of indi-
     why components are considered suspicious. We                    vidual components in each execution quickly becomes in-
     propose an approach, named Q-SFL, that lever-                   tractable, even for a lightweight approach as SFL. There-
     ages qualitative reasoning to augment the infor-                fore, we leverage Qualitative Reasoning (QR), which pro-
     mation made available to SFL techniques. It qual-               vides a way of describing a set of values by their discrete,
     itatively partitions system components, and treats              behavioral qualities, to enable the reasoning about a sys-
     each qualitative state as a new SFL component to                tem’s behavior without exact quantitative information [7;
     be used when diagnosing. Our empirical evalua-                  8]. Precise numerical quantities are avoided and replaced
     tion shows that augmenting SFL with qualitative                 by qualitative descriptions — such as, for instance: high,
     components can improve diagnostic accuracy in                   low, zero, increasing or decreasing.
     54% of the considered real-world subjects.                         We apply QR to the SFL analysis, in an approach named
                                                                     Q-SFL, enabling the introduction of quantitative landmarks
               In Memoriam: Danny Bobrow.                            that will partition the domains of relevant components into
                                                                     a set of qualitative descriptions, and insert new SFL com-
1   Introduction                                                     ponents for each of these descriptions. As behavioral qual-
                                                                     ities are now considered as components, their involvement
SFL [1] was shown to be a lightweight, yet effective, tech-          in system executions is therefore recorded and ranked ac-
nique for locating faults in a system. It consists of keeping        cording to their similarity to observed failures, enriching the
a record of which components are involved in each system             SFL report as a result. This can have benefits in fault com-
execution and subsequently ranking those components ac-              prehension — because qualitative properties are considered
cording to their similarity to failing executions. The intu-         besides merely recording involvement — and even improve
ition being that a faulty component is very likely to be in-         diagnostic report accuracy — whenever a qualitative state
volved in failing executions and not as likely to be covered         is more correlated with failing behavior than its enclosing
in nominal ones. Over the years, many extensions were pro-           system component.
posed to improve SFL’s applicability and effectiveness, such            We perform an empirical evaluation of Q-SFL with real-
as exploring which similarity coefficients yield better fault        world faults from the Defects4J [9] catalog of faulty soft-
localization results [2] and handling multiple or intermittent       ware programs. Results show that Q-SFL has the potential
faults [3].                                                          to improve the accuracy of SFL reports —with 54% of con-
   Despite the developments and achievements in SFL re-              sidered subjects exhibiting a lower effort to diagnose faults.
search, we are unable to find many accounts of successful            Although the results are promising, we discuss several mat-
transitions of this technology into the industry at large. We        ters that need further research —namely, uncovering a land-
argue that this is motivated largely by the issues raised by 4       marking strategy that exhibits consistently better results and
in their 4 user study of automated debugging techniques [4;          studying to what extent fault comprehension is improved.
5]. Namely, the authors found that there is significant inter-          This paper’s contributions are:
est drop-off after users inspect a small number of compo-                • An approach, named Q-SFL, inspired by qualitative
nents from the ranked list of potential faults. This issue is              reasoning (QR) research, to augment program spectra
exacerbated as the scale of the system increases. Another                  used by SFL techniques by partitioning system compo-
   *                                                                       nents into a set of qualitative states which are treated as
     This is a preliminary version of the paper titled “Leveraging
Qualitative Reasoning to Improve SFL”, to be presented at IJCAI-           SFL components.
ECAI 2018.                                                               • Empirical evidence that QR-enhanced spectra can re-
        duce the effort to diagnose real software bugs in 54%       Landmark L1:                Landmark L2:                Landmark L3:             Landmark L4:
                                                                    T = -273.15°C                  T = 0°C                   T = 100°C                 T = +∞°C
        of considered subjects.                                                       Q1-                         Q2-                       Q3-


2       Background                                                                  Q1: Solid                  Q2: Liquid                  Q3: Gas

This section briefly summarizes the concepts upon which
                                                                                      Q1+                        Q2+                     Q3+
our approach is based on.                                           Absolute zero           Freezing/Melting        Boiling/Condensation                Infinity
                                                                                                  Point                     Point
2.1      Spectrum-based Fault Localization (SFL)
In SFL, the following is given:                                     Figure 1: Example of a possible qualitative discretization of water
   • A finite set C = {c1 , ..., cM } of M system compo-            temperature.
     nents;
   • A finite set T = {t1 , ..., tN } of N transactions, which      In cases where values for gl are not available, they can be
     can be seen as records of a system execution;                  estimated by maximizing Pr((A, e) | dk ) under parameters
   • An error vector e = {e1 , ..., eN } of transaction out-        {gl | l ∈ dk } [13].
     comes, where ei = 1 if ti failed, ei = 0 otherwise;               To measure the accuracy of SFL approaches, the cost of
                                                                    diagnosis (Cd ) metric is often used [4]. It measures the
   • A M × N activity matrix A, where Aij = 1 if compo-             number of candidates to be inspected until the real fault is
     nent cj is involved in transaction ti , Aij = 0 otherwise.     reached, assuming candidates are inspected by descending
The goal of SFL is to pinpoint which (sets of) components           order of probability. A Cd of 0 indicates an ideal diagnosis
are more likely to have caused the system to fail. Earlier ap-      where the real fault is at the top of the ranked list of candi-
proaches to SFL measure the similarity between a compo-             dates.
nent’s involvement in transactions and the error vector [1;
2]. Later on, spectrum-based reasoning (SR) was intro-              2.2    Qualitative Reasoning (QR)
duced [3], leveraging a Bayesian reasoning framework to di-         QR creates a discrete representation of the continuous
agnose, even when multiple, intermittent faults are present.        world [7; 8; 15], enabling the reasoning of space, time, and
The two main steps of SR are candidate generation and can-          quantity with merely a small amount of information. It is
didate ranking:                                                     motivated by the fact that humans are able to draw conclu-
Candidate Generation                                                sions about the physical world around them with limited in-
The first step in SR is to generate a set D = {d1 , ..., dk }       formation, without the need of solving complex differential
of diagnosis candidates. A diagnosis candidate dk ⊆ C is            equations.
valid if every failed transaction involved at least one com-           Figure 1 provides an example of a potential discretization
ponent from dk . Candidate dk is minimal if no valid can-           of the water temperature into three qualitative values: Q1,
didate is contained in dk . We are only interested in min-          Q2 and Q3. Our representation resolution — the granular-
imal candidates, as they can subsume all others. Heuris-            ity of the information detail — coincides with that of the
tic approaches to finding these minimal hitting sets include        three physical states of matter that water can assume: solid,
S TACCATO [10], S AFARI [11] and M HS2 [12].                        liquid, and gas. Note that the established resolution will ul-
                                                                    timately define the granularity of the conclusions one can
Candidate Ranking
                                                                    draw from QR. To define the qualitative states, one needs
For each candidate dk , their fault probability is calculated
                                                                    to establish landmarks. Landmarks are constant quantita-
using the Naïve Bayes rule1 [13]
                                                                    tive values that establish a point of comparison [16]. In
                                Y Pr((Ai , ei ) | dk )
   Pr(dk | (A, e)) = Pr(dk ) ·                            (1)       this example, we know that if the water is in the liquid
                                          Pr(Ai )                   state (Q2), then its temperature is somewhere between land-
                                 i ∈ 1..N
                                                                    mark L2 — corresponding to 0°C, the freezing point of wa-
Let Ai be short for {Aij |1 ∈ 1..M }, representing compo-           ter — and landmark L3 — 100°C, its boiling point. Sim-
nent activity in ith transaction. Pr(Ai ) is a normalizing term     ilarly, we can derive that ice (Q1) temperature assumes a
that is identical for all candidates. Let pl denote the prior       value between the absolute zero (L1) and the melting point
probability2 that a component cl is at fault. The prior prob-       (L2); and that water vapor (Q3) ranges between the conden-
ability for a candidate dk is given by                              sation point (L3) and positive infinity (L4).
                          Y          Y
              Pr(dk ) =        pl ·        (1 − pl )         (2)       QR also supports the representation of derivatives be-
                        l ∈ dk      l ∈ C\dk
                                                                    tween two quantities. They are usually represented with
                                                                    ’+’ and ’-’ signs, denoting value increases and decreases,
  Pr(Ai , ei | dk ) is used to bias the prior probability taking    respectively. This enables the use of sign algebra to rea-
observations into account. Let gl (referred to as component         son about direct influence and proportionality between two
goodness) denote the probability that cl performs nominally         qualitative values. Derivatives also enable envisionments.
                          Y
                                        gl     if ei = 0           An envisionment establishes a set of transitions between
                                                                    qualitative states [15], essentially modeling the abstracted
                         
                         
                           l ∈ (dk ∩Ai )
  Pr((Ai , ei ) | dk ) =            Y                        (3)    world. A possible transition in our example’s envisionment
                          1  −             gl otherwise
                         
                                                                   is the following: given that we observe Q2+ — that is, we
                                 l ∈ (dk ∩Ai )
                                                                    observe that the liquid water’s temperature is rising —, then
    1
     Probabilities are calculated assuming conditional indepen-     we know that the only possible following states are Q2 (con-
dence throughout the process.                                       tinues in the liquid state) and Q3 (condensates into vapor),
   2
     In the case of software diagnosis, one can approximate pl as   but never Q1 (freezes into ice).
1/1000, i.e., 1 fault for each 1000 lines of code [14].                Summarily, with a QR framework, we establish a way to
(i) represent quantities through discrete states, (ii) provide                                         A0      t1    t2     t3     t4
a way to compare values between these states, (iii) enable                                             c1      1     1      1      1
derivations and sign algebra, and (iv) model envisionments             A    t1   t2   t3    t4         c01     0     0      1      1
detailing possible transitions between states. With such a             c1   1    1    1     1          c001    1     1      0      0
framework, we can model, plan, simulate and reason about               c2   0    0    1     1          c2      0     0      1      1
                                                                       c3   1    1    1     0          c3      1     1      1      0
a multitude of intricate problems in an abstract way.
                                                                       c4   0    0    1     0          c03     1     0      0      0
                                                                       e    1    1    0     0          c003    0     1      1      0
3     Approach                                                        (a) Regular spectrum.            c4      0     0      1      0
                                                                     Pr({c1 } | (A, e)) = 0.30         e0      1     1      0      0
This section motivates the need to augment standard
spectrum-based fault localization approaches with more               Pr({c3 } | (A, e)) = 0.70     (b) QR-augmented spectrum.
                                                                                                     Pr({c1 } | (A0 , e0 )) = 0.05
contextual information, and details our Q-SFL approach to
                                                                                                    Pr({c001 } | (A0 , e0 )) = 0.82
achieve it by qualitatively partitioning SFL components.                                             Pr({c3 } | (A0 , e0 )) = 0.12
                                                                                                   Pr({c03 , c003 } | (A0 , e0 )) = 0.01
3.1    Limitations of Spectrum-based Analyses
SFL faces several issues preventing it from widespread                  Figure 2: Example of coverage partitioning via QR.
adoption and use. Not the least of which is the lack of con-
textual information, essential for understanding why diag-        partitioning to inspect existing components during each sys-
nostic candidates are considered suspicious. This has been        tem execution and assign them a qualitative state. Each of
pointed out in the software fault localization literature [4].    these qualitative states is then considered as a separate SFL
As SFL reasons about failures at the spectrum level, it only      component whose involvement per transaction is recorded
has access to whether a component was involved or not             and fed into the SFL framework for diagnosis. Note that
in each system transaction. While this enables a flexible,        we use software diagnosis to help describe Q-SFL and, later
lightweight analysis, the necessary abstraction can impose        on, to evaluate it. Despite this, the Q-SFL concept applies
a tradeoff both in accuracy and comprehension. Although           to other SFL use cases where one can inspect the state of
there have been efforts to incorporate more data into the di-     components, as is the case of electronic circuit diagnosis,
agnostic process — by modeling component behavior that            and others.
considers the system’s state and previous diagnoses [17]; or         Figure 2 depicts how QR can be employed to enhance
by leveraging prediction models trained from issue track-         program spectra. Figure 2a shows a 4 transaction by 4
ing data [18] — they were focused on conditioning the fault       component spectrum, along with resulting diagnostic scores
probability of existing diagnostic candidates, increasing ac-     from applying reasoning-based SFL. Candidate generation,
curacy but not necessarily increasing the ability to compre-      following the methodology described in Section 2.1 yields
hend the diagnostic report.                                       two candidate diagnoses — components c1 and c3 can in-
   Aside from comprehension, and because SFL only rea-            dependently explain the observed failures as both cover all
sons about component involvement in failing transactions,         failing test cases. For this example, suppose that c1 is the
omission errors — such as bound checks — also become              faulty component. Since c1 is involved in more passing
difficult to diagnose [19]. The abstract nature of the spec-      transactions, the SFL framework will assign it a lower fault
tra that is fed into current SFL frameworks also leads to         probability than c3 . To improve the accuracy of the SFL
the formation of ambiguity groups and facilitates the occur-      framework, one needs more contextual information about
rence of coincidental correctness. An ambiguity group is a        component executions.
group of components with identical involvement in all trans-         We envision three different types of landmarking strate-
actions [20]. Since they exhibit the same coverage pattern,       gies that can be employed to define qualitative state bound-
no component within an ambiguity group can be uniquely            aries: (i) manual landmarking, where the system’s develop-
identified as the root cause of failure, potentially hindering    ers manually define what are the possible qualitative states
accuracy. Coincidental correctness refers to the event when       for a given component; (ii) static landmarking, where land-
no failure is detected, even though a fault has been executed     marks depend on the type of a component; and (iii) dynamic
[21]. Depending on the component granularity selected for
                                                                  landmarking, where a component’s value is inspected at
the analysis, coincidental correctness can happen at a fre-       runtime, and partitioned into a set of categories. Examples
quent rate. In particular, when two tests share the same cov-     of dynamic strategies will be presented in Section 4.
erage path, but produce different outcomes, it becomes sig-          Figure 2b depicts the QR-augmented spectrum, where
nificantly harder to distinguish them without further contex-     components representing qualitative partitions of both c1
tual information. Coincidental correctness can potentially        and c3 are added to the original spectrum. An example of
lead to exonerating real faults as they are observed to be-       such partitioning using static landmarking: if c1 represents
have nominally.                                                   a software procedure that contains a numeric parameter i,
                                                                  we can create two qualitative components c01 and c001 that
3.2    Q-SFL                                                      represent invocations of c1 with i ≥ 0 and i < 0, respec-
We argue that all of the issues described above can be at least   tively. This is a sign-based static partitioning strategy. Note
attenuated if we supplement the SFL framework with more           that the original components c1 and c3 are not removed from
contextual information about the system under analysis to         the QR-augmented spectrum, as partitions may not provide
perform the diagnosis. Our Q-SFL approach consists of par-        further fault isolation.
titioning several SFL components into multiple, meaning-             If we are to diagnose the new spectrum from Figure 2b,
ful, qualitatively distinct subcomponents, to be used in the      component c001 is now the top-ranked diagnostic candidate.
fault localization. We leverage the QR concept of domain          This QR-augmented spectrum avoids spurious inspections
of component c3 , and provides additional contextual infor-        where Cd is the cost of diagnosis, as explained in Sec-
mation about the fault, namely that i < 0 is often observed        tion 2.1. A positive ∆Cd means that the faulty component
in failing transactions.                                           has risen in the ranking reported by SFL techniques when
   By landmarking data units associated with SFL compo-            QR is used, yielding a lowered cost of diagnosing.
nents so that they are assigned a qualitative state at runtime,
we are providing more context to the diagnostic process, and       4.2   Results and Discussion
in some cases, consequently reducing the diagnostic effort.        We were able to automatically partition the faulty method
Such partitioning is also of crucial importance towards mini-      in 167 D4J subjects. The remaining D4J subjects were dis-
mizing the impact and frequency of ambiguity grouping and          carded because (i) the faulty method does not contain pa-
coincidental correctness, as new, distinct components are          rameters nor does it return a value; because (ii) the faulty
added to the system’s spectrum.                                    method only contains non-primitive, non-null, complex-
                                                                   typed parameters, which cannot be handled by the set of
4       Evaluation                                                 partitioning strategies described in Section 4.1; or because
                                                                   (iii) the aforementioned partitioning strategies were unable
To evaluate our approach, we compare the cost of diagnos-
                                                                   to create qualitative states whose coverage differs from their
ing a collection of faulty software programs using regular
                                                                   enclosing method.
spectra against using QR-augmented spectra.
                                                                      Our first research question is
4.1      Methodology
We have sourced experimental subjects from the Defects4J3            RQ1: Does augmenting spectra with qualitative com-
(D4J) database. D4J is a catalog of 395 real, reproducible           ponents improve their diagnosability?
software bugs from 6 open-source projects — namely
JFreeChart, Google Closure compiler, Apache Commons                In RQ1, we are concerned with finding if there exist. qual-
Lang, Apache Commons Math, Mockito, and Joda-Time.                 itative partitionings able to improve the fault localization
For each bug, a developer-written, fault-revealing test suite      ranking to the extent that faulty components are inspected
is made available.                                                 earlier — thus decreasing developer wasted effort in a de-
    We run the fault-revealing test suite of each buggy D4J        bugging task. Hence, for each D4J subject, we choose
subject, gathering method-level coverage and test outcomes,        as the landmarking strategy to consider in the evaluation
to construct its spectrum. Besides coverage, we also record        the one that is able to create the largest set of distinct,
the value of all primitive-type arguments and return values        non-ambiguous qualitative components out of the faulty
for every method call. This enables us to experiment with          method(s). The breakdown of selected partitioning strate-
different qualitative partitioning strategies in an offline man-   gies per subject is as follows: Sign partitioning: 102 (61%
ner.                                                               of subjects); X-means: 25 (15%); k-NN: 8 (5%); Linear
    Using the recorded argument and return value data, we          Regression: 1 (1%); Logistic Regression: 4 (2%); De-
create multiple (automated) partitioning models resulting in       cision Tree: 11 (7%); Random Forest: 16 (10%). Our
several Q-SFL variants. A static partitioning variant using        sign-partitioning default strategy was used to qualitatively
automated sign partitioning based on the variable’s type, as       enhance the majority of considered subjects, while other
described in Section 3.2, was considered. For dynamic par-         strategies such as linear classification and logistic regres-
titioning, several clustering and classification algorithms4       sion were rarely selected. We believe the reason that su-
were considered: k-NN, linear classification, logistic regres-     pervised learning approaches — which were fed test case
sion, decision trees, random forest, and x-means clustering        outcomes as the target class label — only exhibited superior
Test outcomes are used as the class labels in the case of          performance in 40 subjects (24%) is due to the fact that the
supervised models. Note that we are not using the afore-           number of failing tests in test suites is often much smaller
mentioned models for prediction, but rather as a partitioning      than the amount of passing tests, weakening the resulting
scheme based on observed values. Hence, we do not break            partitioning model.
our data into training and test sets, as is customary in pre-         Figure 3 shows a scatter plot with the ∆Cd of all subjects
diction scenarios. Because we use automated, domain inde-          under analysis. Shown in a red background are the 15 sub-
pendent partitioning, only primitive types are considered in       jects (9%) with a negative ∆Cd — meaning that the report
the evaluation.                                                    has suffered a decrease in accuracy after augmenting the
    To evaluate a QR-enhanced spectrum against its respec-         spectra. The majority of these subjects belong to the Clo-
tive original spectrum, we first reduce the Q-SFL diagnos-         sure project. The 62 subjects (37%) with ∆Cd = 0, where
tic report to method components. This reduction is done            the faulty component has remained in the same position of
by considering the highest fault probability of any subcom-        the ranking, are shown in a white background. Lastly, 90
ponent belonging to each method, to effectively be able to         subjects (54%) that exhibited a positive ∆Cd — cases where
compare method-level diagnostic effort between the two ap-         QR-enhanced spectra improved diagnosability — are shown
proaches. A change in diagnostic effort is measured using          in green. All in all, Q-SFL is at least as effective as the orig-
                                                                   inal approach in 92% of scenarios.
          ∆Cd = Cd (Original) − Cd (QR-Enhanced)            (4)       Table 1 presents statistics computed to assess whether the
    3
      Defects4J 1.1.0 is available at https://github.com/          observed metrics yield statistically significant results. QR-
rjust/defects4j (accessed June 2018).                              enhanced spectra exhibits an overall lower effort to diagnose
    4
      We chose popular classification algorithms implemented in    when compared to the original spectra, with less variance.
the Scikit-learn package. X-means, as implemented in the           To assess significance, we first performed the Shapiro-Wilk
pyclustering package, was selected as it can automatically         test for normality of effort data in both the original spec-
decide the optimal number of clusters to use [22].                 tra case and QR case. With 99% confidence, the test’s re-
       103
                    ΔCd < 0                              ΔCd ≤ 0
                              ΔCd ≥ 0                                  ΔCd > 0                                                             140
                                                                                                                                                                                                        ΔCd < 0
                                                                                                                                                                                                        ΔCd = 0
       102
                                                                                                                                                                                                        ΔCd > 0
                                                                                                                                           120

       101
                                                                                                                                           100




                                                                                                                              # Subjects
       100                                                                                                                                 80
ΔCd




        0
                                                                                                                                           60
      −100


                                                                                                                                           40
                                                                                                               Closure
      −101                                                                                                     Commons Lang
                                                                                                               Commons Math                20
                                                                                                               JFreeChart
                                                                                                               JodaTime
            2
      −10                                                                                                      Mockito
                                                                                                                                            0
                                                                                                                                                 Sign   X-means   k-NN      Linear    Logistic   Tree    Forest
                0             20          40        60               80              100           120   140        160
                                                                   Subject Number                                                                                        Strategies




Figure 3: Difference in Cd between original and QR-enhanced spectra per sub- Figure 4: Breakdown of diagnostic performance per par-
ject.                                                                        titioning strategy.


                                         Table 1: Statistical tests.                                              gories for every partitioning strategy considered in this eval-
                                                                                                                  uation. This bar plot tells us that no single strategy achieves
                                               Original                             QR-enhanced                   the same number of positive ∆Cd scenarios as the partition
                                               Spectra                                Spectra                     cardinality selection scheme employed to answer RQ1 and
                Mean Cd                         60.28                                      37.56                  to produce Figure 3. Furthermore, strategies that were of-
                                                                                                                  ten picked by that criterion (namely, sign partitioning and
                Median Cd                        6.00                                      2.50                   X-means strategies) also show an increased number of neg-
                                                                                                                  ative ∆Cd scenarios when compared to others. This leads us
         Cd Variance                           2.10×104                               1.56×104                    to conclude that no single strategy (out of the ones that were
                                                                                                                  analyzed) is able to consistently show improved diagnoses.
                                              W = 0.46              W = 0.32
        Shapiro-Wilk
                                        p-value = 2.20×10−22 p-value = 1.10×10−24
                                                                                                                     RQ2? No, at least for the automated landmarking strate-
             Wilcoxon                                     Z = 5.45                                                gies considered in the evaluation, there is no evidence that
            Signed-rank                             p-value = 5.10×10−10                                          a single automated strategy can consistently outperform the
                                                                                                                  original spectra. However, since Q-SFL can improve diag-
                                                                                                                  nosability, as per the answer to RQ1, we presume that man-
sults tell us that the distributions are not normal. Given                                                        ual or more complex, context-aware, automated white-box
that Cd is not normally distributed and that each observation                                                     strategies — which can perform static and dynamic source
is paired, we use the non-parametrical statistical hypothesis                                                     code analysis — are much more suited to outperform the
test Wilcoxon signed-rank. Our null-hypothesis is that the                                                        original spectra due to more effective and more informed
median difference between the two observations (i.e., ∆Cd )                                                       partitioning.
is zero. The fifth row from Table 1 shows the resulting Z
statistic and p-value of Wilcoxon’s test. With 99% confi-                                                         5       Related Work
dence, we refute the null-hypothesis.
   RQ1? Yes, augmenting faulty spectra with new compo-                                                            There have been previous forays into enhancing the diag-
nents resulting from qualitative landmarking of method pa-                                                        nostic report of automated fault localization techniques to
rameter (and method return) values yields a statistically sig-                                                    either improve their accuracy or comprehension of the fail-
nificant improved diagnostic report.                                                                              ing component.
   To be able to answer RQ1, we have selected for each sub-                                                          SFL approaches to debugging typically present their re-
ject the strategy with the highest number of qualitative parti-                                                   port to users as a list of suspicious components that is sorted
tions targeting the faulty method, as we were only concerned                                                      according to the likelihood of being faulty. 1 have proposed
with the existence of a partitioning strategy that would im-                                                      a visual way of depicting the results of a similarity-based
prove diagnosability. However, in practice, it is not realistic                                                   software SFL diagnosis, color-coding each component ac-
to know a-priori what the faulty method is5 . Hence, our                                                          cording to their suspiciousness score [1]. 24 expand on the
second research question is                                                                                       visual concept by leveraging tree-based visualizations that
                                                                                                                  innately exploit the tree-like structure of Java code, natu-
                                                                                                                  rally aggregating neighboring components and aiding explo-
      RQ2: Is there a particular automated landmarking                                                            ration of suspicious code regions [24]. Another approach to
      strategy that consistently shows improved diagnosabil-                                                      improve the comprehension of faults was proposed by 25,
      ity?                                                                                                        called Whyline, which allows the users to obtain evidence
                                                                                                                  about the program’s execution before forming an explana-
  Figure 4 shows a breakdown of the number of subjects                                                            tion of the cause by providing the ability to ask “why did”
that fall into the ∆Cd < 0, ∆Cd = 0 and ∆Cd > 0 cate-                                                             and “why didn’t” questions about program output [25].
                                                                                                                     26 have proposed an extension to SFL to improve com-
        5                                                                                                         prehension. It uses integration coverage data, by way of
    Although some effort has been put forth to hierarchically de-
bug programs using SFL [23].                                                                                      capturing method invocation pairs, to guide the fault local-
ization process. By calculating the fault likelihood of com-           extent that qualitative domain partitioning aids fault
ponent pairs, the authors are able to generate roadmaps for            understanding.
component investigation, guiding users through likely faulty
paths and increasing the amount of contextual cues [26].          Acknowledgments
   Advancements in bug prediction [27] have enabled its
use within automated fault localization processes. 28 pro-        This material is based upon work supported by the scholar-
pose an ensemble approach to fault localization that exploits     ship number SFRH/BD/95339/2013 from Fundação para a
information from versioning systems, bug tracking reposi-         Ciência e Tecnologia (FCT).
tories and structured information retrieval from the source
code [28]. 17 rely on kernel density estimation models of         References
component behavior and previous diagnoses to better esti-
mate the component goodness parameter in spectrum-based           [1] James A. Jones, Mary Jean Harrold, and John T.
reasoning [17]. 18 also modify the traditional spectrum-              Stasko. Visualization of test information to assist fault
based reasoning framework by leveraging a fault prediction            localization. In ICSE’02, pages 467–477, 2002.
model trained with historical information from the project’s      [2] Lucia, David Lo, Lingxiao Jiang, Ferdian Thung, and
versioning system and bug tracker to compute the prior                Aditya Budi. Extended comprehensive study of asso-
probability distribution of diagnostic candidates [18].               ciation measures for fault localization. Journal of Soft-
   Augmenting fault-localization via slicing has also been            ware: Evolution and Process, 26(2):172–219, 2014.
proposed. 29 have proposed the use of dynamic backward
slices — comprised of statements that directly or indirectly      [3] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van
effect the computation of the output value through data- or           Gemund. Spectrum-based multiple fault localization.
control-dependency chains — as components in similarity-              In ASE’09, pages 88–99, 2009.
based SFL [29]. 30 propose an approach that leverages a           [4] Chris Parnin and Alessandro Orso. Are automated de-
model-based slicing-hitting-set-computation — which com-              bugging techniques actually helping programmers? In
putes the dynamic slices of all faulty variables in all failed        ISSTA’11, pages 199–209, 2011.
test cases, derives minimal diagnostic candidates from the
slices and computes fault probabilities for each statement        [5] Aaron Ang, Alexandre Perez, Arie van Deursen, and
based on number of the diagnoses that contain it [30].                Rui Abreu. Revisiting the practical use of automated
                                                                      software fault localization techniques. In IWPD’17,
                                                                      pages 175–182, 2017.
6     Conclusion
                                                                  [6] Carlos Gouveia, José Campos, and Rui Abreu. Using
This paper proposes a new approach to spectrum-based fault            HTML5 visualizations in software fault localization.
localization that leverages qualitative reasoning (QR). The           In VISSOFT’13, pages 1–10, 2013.
Q-SFL approach splits components form the software sys-
tem under analysis into a set of qualitative states through       [7] Kenneth D. Forbus. Qualitative reasoning. In The
the creation of qualitative landmarks that partition a com-           Computer Science and Engineering Handbook, pages
ponent’s domain. These qualitative states are then consid-            715–733. 1997.
ered as SFL components to be ranked using traditional fault-      [8] Brian C. Williams and Johan de Kleer. Qualitative rea-
localization methodologies. Since we treat these qualitative          soning about physical systems: A return to roots. Ar-
states as components, our diagnostic reports not only rec-            tificial Intelligence, 51(1-3):1–9, 1991.
ommend likely fault locations, but also provide an insight
on what behaviors the faulty components assume when fail-         [9] René Just, Darioush Jalali, and Michael D. Ernst. De-
ures are detected, facilitating the comprehension of the fault.       fects4j: a database of existing faults to enable con-
   We evaluate the approach on subjects from the Defects4J            trolled testing studies for java programs. In ISSTA’14,
catalog of real faults from medium and large-sized open               pages 437–440, 2014.
source software projects. Results show that spectra which         [10] Rui Abreu and Arjan J. C. van Gemund. A low-cost
were augmented using qualitative partitioning of method pa-            approximate minimal hitting set algorithm and its ap-
rameters shows a (statistically significant) improvement in            plication to model-based diagnosis. In SARA’09, 2009.
the diagnostic accuracy in 54% of scenarios. However, we
                                                                  [11] Alexander Feldman, Gregory M. Provan, and Arjan
also found no evidence of automated partitioning strategies
that were consistently better than the original spectra, mean-         J. C. van Gemund. Computing minimal diagnoses by
ing that more intricate, context-aware partitioning strategies         greedy stochastic search. In AAAI’08, pages 911–918,
will likely be necessary for practical applications of the ap-         2008.
proach.                                                           [12] Nuno Cardoso and Rui Abreu. MHS2: A map-reduce
   This work lays the first stone in a series of efforts to            heuristic-driven minimal hitting set search algorithm.
more deeply integrate reasoning-based AI approaches into               In MUSEPAT’13, pages 25–36, 2013.
spectrum-based fault localization. It paves the way for fur-      [13] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van
ther efforts by the fault localization research community,             Gemund. A new bayesian approach to multiple inter-
namely by:                                                             mittent fault diagnosis. In IJCAI’09, pages 653–658,
    1. Improving automated landmarking by expanding its                2009.
       application to complex non-primitive objects and by        [14] John Carey, Neil Gross, Marcia Stepanek, and Otis
       exploring ensembles of multiple strategies.
                                                                       Port. Software hell. In Business Week, pages 391–411,
    2. Conducting a systematic user study investigating the            1999.
[15] Johan de Kleer. Multiple representations of knowledge    [23] Alexandre Perez, Rui Abreu, and André Riboira. A
     in a mechanics problem-solver. In IJCAI’77, pages             dynamic code coverage approach to maximize fault lo-
     299–304, 1977.                                                calization efficiency. Journal of Systems and Software,
[16] Benjamin Kuipers. Qualitative simulation. Artificial          90:18–28, 2014.
     Intelligence, 29(3):289–338, 1986.                       [24] José Campos, André Riboira, Alexandre Perez, and
[17] Nuno Cardoso and Rui Abreu. A kernel density                  Rui Abreu. GZoltar: an eclipse plug-in for testing and
     estimate-based approach to component goodness mod-            debugging. In ASE’12, pages 378–381, 2012.
     eling. In AAAI’13, 2013.                                 [25] Andrew Jensen Ko and Brad A. Myers. Designing the
[18] Amir Elmishali, Roni Stern, and Meir Kalech. Data-            whyline: a debugging interface for asking questions
     augmented software diagnosis. In AAAI’16, pages               about program behavior. In CHI’04, pages 151–158,
     4003–4009, 2016.                                              2004.
[19] Xiaofeng Xu, Vidroha Debroy, W. Eric Wong, and           [26] Higor Amario de Souza and Marcos Lordello Chaim.
     Donghui Guo. Ties within fault localization rankings:         Adding context to fault localization with integration
     Exposing and addressing the problem. International            coverage. In ASE’13, pages 628–633, 2013.
     Journal of Software Engineering and Knowledge En-        [27] Marco D’Ambros, Michele Lanza, and Romain
     gineering, 21(6):803–827, 2011.                               Robbes. An extensive comparison of bug prediction
[20] G. N. Stenbakken, T. M. Souders, and G. W. Stewart.           approaches. In MSR’10, pages 31–41, 2010.
     Ambiguity groups and testability. IEEE Transactions      [28] Shaowei Wang and David Lo. Version history, simi-
     on Instrumentation and Measurement, 38(5):941–947,            lar report, and structure: putting them together for im-
     1989.                                                         proved bug localization. In ICPC’14, pages 53–63,
[21] Debra J. Richardson and Margaret C. Thompson. An              2014.
     analysis of test data selection criteria using the RE-   [29] Xiaoguang Mao, Yan Lei, Ziying Dai, Yuhua Qi, and
     LAY model of fault detection. IEEE Transactions on            Chengsong Wang. Slice-based statistical fault local-
     Software Engineering, 19(6):533–553, 1993.                    ization. Journal of Systems and Software, 89:51–62,
[22] Dan Pelleg and Andrew W. Moore. X-means: Extend-              2014.
     ing k-means with efficient estimation of the number of   [30] Birgit Hofer and Franz Wotawa. Spectrum enhanced
     clusters. In ICML’00, pages 727–734, 2000.                    dynamic slicing for better fault localization.        In
                                                                   ECAI’12, pages 420–425, 2012.