=Paper= {{Paper |id=Vol-2289/paper8 |storemode=property |title=QR-Augmented Spectrum-based Fault Localization |pdfUrl=https://ceur-ws.org/Vol-2289/paper8.pdf |volume=Vol-2289 |authors=Alexandre Perez,Rui Abreu |dblpUrl=https://dblp.org/rec/conf/safeprocess/PerezA18 }} ==QR-Augmented Spectrum-based Fault Localization== https://ceur-ws.org/Vol-2289/paper8.pdf

QR-Augmented Spectrum-based Fault Localization*

Alexandre Perez1 , Rui Abreu2
1
University of Porto and HASLab, INESC-TEC, Portugal
2
IST, University of Lisbon and INESC-ID, Portugal
alexandre.perez@fe.up.pt, rui@computer.org

Abstract issue pointed out by 4 is the fact that many SFL studies as-
sume perfect fault understanding — that is, these studies
Spectrum-based fault localization (SFL) corre- expect that once users inspect a faulty component, they will
lates a system’s components with observed fail- correctly identify it as such —, which does not always hold
ures. By reasoning about coverage, SFL allows in practice [6].
for a lightweight way of pinpointing faults. This This paper proposes an approach that inspects the state
abstraction comes at the cost of missing certain of system components, with the intent of augmenting re-
faults, such as errors of omission, and failing to ports generated by SFL techniques and hence providing
provide enough contextual information to explain more diagnostic information. Recording the state of indi-
why components are considered suspicious. We vidual components in each execution quickly becomes in-
propose an approach, named Q-SFL, that lever- tractable, even for a lightweight approach as SFL. There-
ages qualitative reasoning to augment the infor- fore, we leverage Qualitative Reasoning (QR), which pro-
mation made available to SFL techniques. It qual- vides a way of describing a set of values by their discrete,
itatively partitions system components, and treats behavioral qualities, to enable the reasoning about a sys-
each qualitative state as a new SFL component to tem’s behavior without exact quantitative information [7;
be used when diagnosing. Our empirical evalua- 8]. Precise numerical quantities are avoided and replaced
tion shows that augmenting SFL with qualitative by qualitative descriptions — such as, for instance: high,
components can improve diagnostic accuracy in low, zero, increasing or decreasing.
54% of the considered real-world subjects. We apply QR to the SFL analysis, in an approach named
Q-SFL, enabling the introduction of quantitative landmarks
In Memoriam: Danny Bobrow. that will partition the domains of relevant components into
a set of qualitative descriptions, and insert new SFL com-
1 Introduction ponents for each of these descriptions. As behavioral qual-
ities are now considered as components, their involvement
SFL [1] was shown to be a lightweight, yet effective, tech- in system executions is therefore recorded and ranked ac-
nique for locating faults in a system. It consists of keeping cording to their similarity to observed failures, enriching the
a record of which components are involved in each system SFL report as a result. This can have benefits in fault com-
execution and subsequently ranking those components ac- prehension — because qualitative properties are considered
cording to their similarity to failing executions. The intu- besides merely recording involvement — and even improve
ition being that a faulty component is very likely to be in- diagnostic report accuracy — whenever a qualitative state
volved in failing executions and not as likely to be covered is more correlated with failing behavior than its enclosing
in nominal ones. Over the years, many extensions were pro- system component.
posed to improve SFL’s applicability and effectiveness, such We perform an empirical evaluation of Q-SFL with real-
as exploring which similarity coefficients yield better fault world faults from the Defects4J [9] catalog of faulty soft-
localization results [2] and handling multiple or intermittent ware programs. Results show that Q-SFL has the potential
faults [3]. to improve the accuracy of SFL reports —with 54% of con-
Despite the developments and achievements in SFL re- sidered subjects exhibiting a lower effort to diagnose faults.
search, we are unable to find many accounts of successful Although the results are promising, we discuss several mat-
transitions of this technology into the industry at large. We ters that need further research —namely, uncovering a land-
argue that this is motivated largely by the issues raised by 4 marking strategy that exhibits consistently better results and
in their 4 user study of automated debugging techniques [4; studying to what extent fault comprehension is improved.
5]. Namely, the authors found that there is significant inter- This paper’s contributions are:
est drop-off after users inspect a small number of compo- • An approach, named Q-SFL, inspired by qualitative
nents from the ranked list of potential faults. This issue is reasoning (QR) research, to augment program spectra
exacerbated as the scale of the system increases. Another used by SFL techniques by partitioning system compo-
* nents into a set of qualitative states which are treated as
This is a preliminary version of the paper titled “Leveraging
Qualitative Reasoning to Improve SFL”, to be presented at IJCAI- SFL components.
ECAI 2018. • Empirical evidence that QR-enhanced spectra can re-
duce the effort to diagnose real software bugs in 54% Landmark L1: Landmark L2: Landmark L3: Landmark L4:
T = -273.15°C T = 0°C T = 100°C T = +∞°C
of considered subjects. Q1- Q2- Q3-

2 Background Q1: Solid Q2: Liquid Q3: Gas

This section briefly summarizes the concepts upon which
Q1+ Q2+ Q3+
our approach is based on. Absolute zero Freezing/Melting Boiling/Condensation Infinity
Point Point
2.1 Spectrum-based Fault Localization (SFL)
In SFL, the following is given: Figure 1: Example of a possible qualitative discretization of water
• A finite set C = {c1 , ..., cM } of M system compo- temperature.
nents;
• A finite set T = {t1 , ..., tN } of N transactions, which In cases where values for gl are not available, they can be
can be seen as records of a system execution; estimated by maximizing Pr((A, e) | dk ) under parameters
• An error vector e = {e1 , ..., eN } of transaction out- {gl | l ∈ dk } [13].
comes, where ei = 1 if ti failed, ei = 0 otherwise; To measure the accuracy of SFL approaches, the cost of
diagnosis (Cd ) metric is often used [4]. It measures the
• A M × N activity matrix A, where Aij = 1 if compo- number of candidates to be inspected until the real fault is
nent cj is involved in transaction ti , Aij = 0 otherwise. reached, assuming candidates are inspected by descending
The goal of SFL is to pinpoint which (sets of) components order of probability. A Cd of 0 indicates an ideal diagnosis
are more likely to have caused the system to fail. Earlier ap- where the real fault is at the top of the ranked list of candi-
proaches to SFL measure the similarity between a compo- dates.
nent’s involvement in transactions and the error vector [1;
2]. Later on, spectrum-based reasoning (SR) was intro- 2.2 Qualitative Reasoning (QR)
duced [3], leveraging a Bayesian reasoning framework to di- QR creates a discrete representation of the continuous
agnose, even when multiple, intermittent faults are present. world [7; 8; 15], enabling the reasoning of space, time, and
The two main steps of SR are candidate generation and can- quantity with merely a small amount of information. It is
didate ranking: motivated by the fact that humans are able to draw conclu-
Candidate Generation sions about the physical world around them with limited in-
The first step in SR is to generate a set D = {d1 , ..., dk } formation, without the need of solving complex differential
of diagnosis candidates. A diagnosis candidate dk ⊆ C is equations.
valid if every failed transaction involved at least one com- Figure 1 provides an example of a potential discretization
ponent from dk . Candidate dk is minimal if no valid can- of the water temperature into three qualitative values: Q1,
didate is contained in dk . We are only interested in min- Q2 and Q3. Our representation resolution — the granular-
imal candidates, as they can subsume all others. Heuris- ity of the information detail — coincides with that of the
tic approaches to finding these minimal hitting sets include three physical states of matter that water can assume: solid,
S TACCATO [10], S AFARI [11] and M HS2 [12]. liquid, and gas. Note that the established resolution will ul-
timately define the granularity of the conclusions one can
Candidate Ranking
draw from QR. To define the qualitative states, one needs
For each candidate dk , their fault probability is calculated
to establish landmarks. Landmarks are constant quantita-
using the Naïve Bayes rule1 [13]
tive values that establish a point of comparison [16]. In
Y Pr((Ai , ei ) | dk )
Pr(dk | (A, e)) = Pr(dk ) · (1) this example, we know that if the water is in the liquid
Pr(Ai ) state (Q2), then its temperature is somewhere between land-
i ∈ 1..N
mark L2 — corresponding to 0°C, the freezing point of wa-
Let Ai be short for {Aij |1 ∈ 1..M }, representing compo- ter — and landmark L3 — 100°C, its boiling point. Sim-
nent activity in ith transaction. Pr(Ai ) is a normalizing term ilarly, we can derive that ice (Q1) temperature assumes a
that is identical for all candidates. Let pl denote the prior value between the absolute zero (L1) and the melting point
probability2 that a component cl is at fault. The prior prob- (L2); and that water vapor (Q3) ranges between the conden-
ability for a candidate dk is given by sation point (L3) and positive infinity (L4).
Y Y
Pr(dk ) = pl · (1 − pl ) (2) QR also supports the representation of derivatives be-
l ∈ dk l ∈ C\dk
tween two quantities. They are usually represented with
’+’ and ’-’ signs, denoting value increases and decreases,
Pr(Ai , ei | dk ) is used to bias the prior probability taking respectively. This enables the use of sign algebra to rea-
observations into account. Let gl (referred to as component son about direct influence and proportionality between two
goodness) denote the probability that cl performs nominally qualitative values. Derivatives also enable envisionments.
 Y
 gl if ei = 0 An envisionment establishes a set of transitions between
qualitative states [15], essentially modeling the abstracted


l ∈ (dk ∩Ai )
Pr((Ai , ei ) | dk ) = Y (3) world. A possible transition in our example’s envisionment
 1 − gl otherwise

 is the following: given that we observe Q2+ — that is, we
l ∈ (dk ∩Ai )
observe that the liquid water’s temperature is rising —, then
1
Probabilities are calculated assuming conditional indepen- we know that the only possible following states are Q2 (con-
dence throughout the process. tinues in the liquid state) and Q3 (condensates into vapor),
2
In the case of software diagnosis, one can approximate pl as but never Q1 (freezes into ice).
1/1000, i.e., 1 fault for each 1000 lines of code [14]. Summarily, with a QR framework, we establish a way to
(i) represent quantities through discrete states, (ii) provide A0 t1 t2 t3 t4
a way to compare values between these states, (iii) enable c1 1 1 1 1
derivations and sign algebra, and (iv) model envisionments A t1 t2 t3 t4 c01 0 0 1 1
detailing possible transitions between states. With such a c1 1 1 1 1 c001 1 1 0 0
framework, we can model, plan, simulate and reason about c2 0 0 1 1 c2 0 0 1 1
c3 1 1 1 0 c3 1 1 1 0
a multitude of intricate problems in an abstract way.
c4 0 0 1 0 c03 1 0 0 0
e 1 1 0 0 c003 0 1 1 0
3 Approach (a) Regular spectrum. c4 0 0 1 0
Pr({c1 } | (A, e)) = 0.30 e0 1 1 0 0
This section motivates the need to augment standard
spectrum-based fault localization approaches with more Pr({c3 } | (A, e)) = 0.70 (b) QR-augmented spectrum.
Pr({c1 } | (A0 , e0 )) = 0.05
contextual information, and details our Q-SFL approach to
Pr({c001 } | (A0 , e0 )) = 0.82
achieve it by qualitatively partitioning SFL components. Pr({c3 } | (A0 , e0 )) = 0.12
Pr({c03 , c003 } | (A0 , e0 )) = 0.01
3.1 Limitations of Spectrum-based Analyses
SFL faces several issues preventing it from widespread Figure 2: Example of coverage partitioning via QR.
adoption and use. Not the least of which is the lack of con-
textual information, essential for understanding why diag- partitioning to inspect existing components during each sys-
nostic candidates are considered suspicious. This has been tem execution and assign them a qualitative state. Each of
pointed out in the software fault localization literature [4]. these qualitative states is then considered as a separate SFL
As SFL reasons about failures at the spectrum level, it only component whose involvement per transaction is recorded
has access to whether a component was involved or not and fed into the SFL framework for diagnosis. Note that
in each system transaction. While this enables a flexible, we use software diagnosis to help describe Q-SFL and, later
lightweight analysis, the necessary abstraction can impose on, to evaluate it. Despite this, the Q-SFL concept applies
a tradeoff both in accuracy and comprehension. Although to other SFL use cases where one can inspect the state of
there have been efforts to incorporate more data into the di- components, as is the case of electronic circuit diagnosis,
agnostic process — by modeling component behavior that and others.
considers the system’s state and previous diagnoses [17]; or Figure 2 depicts how QR can be employed to enhance
by leveraging prediction models trained from issue track- program spectra. Figure 2a shows a 4 transaction by 4
ing data [18] — they were focused on conditioning the fault component spectrum, along with resulting diagnostic scores
probability of existing diagnostic candidates, increasing ac- from applying reasoning-based SFL. Candidate generation,
curacy but not necessarily increasing the ability to compre- following the methodology described in Section 2.1 yields
hend the diagnostic report. two candidate diagnoses — components c1 and c3 can in-
Aside from comprehension, and because SFL only rea- dependently explain the observed failures as both cover all
sons about component involvement in failing transactions, failing test cases. For this example, suppose that c1 is the
omission errors — such as bound checks — also become faulty component. Since c1 is involved in more passing
difficult to diagnose [19]. The abstract nature of the spec- transactions, the SFL framework will assign it a lower fault
tra that is fed into current SFL frameworks also leads to probability than c3 . To improve the accuracy of the SFL
the formation of ambiguity groups and facilitates the occur- framework, one needs more contextual information about
rence of coincidental correctness. An ambiguity group is a component executions.
group of components with identical involvement in all trans- We envision three different types of landmarking strate-
actions [20]. Since they exhibit the same coverage pattern, gies that can be employed to define qualitative state bound-
no component within an ambiguity group can be uniquely aries: (i) manual landmarking, where the system’s develop-
identified as the root cause of failure, potentially hindering ers manually define what are the possible qualitative states
accuracy. Coincidental correctness refers to the event when for a given component; (ii) static landmarking, where land-
no failure is detected, even though a fault has been executed marks depend on the type of a component; and (iii) dynamic
[21]. Depending on the component granularity selected for
landmarking, where a component’s value is inspected at
the analysis, coincidental correctness can happen at a fre- runtime, and partitioned into a set of categories. Examples
quent rate. In particular, when two tests share the same cov- of dynamic strategies will be presented in Section 4.
erage path, but produce different outcomes, it becomes sig- Figure 2b depicts the QR-augmented spectrum, where
nificantly harder to distinguish them without further contex- components representing qualitative partitions of both c1
tual information. Coincidental correctness can potentially and c3 are added to the original spectrum. An example of
lead to exonerating real faults as they are observed to be- such partitioning using static landmarking: if c1 represents
have nominally. a software procedure that contains a numeric parameter i,
we can create two qualitative components c01 and c001 that
3.2 Q-SFL represent invocations of c1 with i ≥ 0 and i < 0, respec-
We argue that all of the issues described above can be at least tively. This is a sign-based static partitioning strategy. Note
attenuated if we supplement the SFL framework with more that the original components c1 and c3 are not removed from
contextual information about the system under analysis to the QR-augmented spectrum, as partitions may not provide
perform the diagnosis. Our Q-SFL approach consists of par- further fault isolation.
titioning several SFL components into multiple, meaning- If we are to diagnose the new spectrum from Figure 2b,
ful, qualitatively distinct subcomponents, to be used in the component c001 is now the top-ranked diagnostic candidate.
fault localization. We leverage the QR concept of domain This QR-augmented spectrum avoids spurious inspections
of component c3 , and provides additional contextual infor- where Cd is the cost of diagnosis, as explained in Sec-
mation about the fault, namely that i < 0 is often observed tion 2.1. A positive ∆Cd means that the faulty component
in failing transactions. has risen in the ranking reported by SFL techniques when
By landmarking data units associated with SFL compo- QR is used, yielding a lowered cost of diagnosing.
nents so that they are assigned a qualitative state at runtime,
we are providing more context to the diagnostic process, and 4.2 Results and Discussion
in some cases, consequently reducing the diagnostic effort. We were able to automatically partition the faulty method
Such partitioning is also of crucial importance towards mini- in 167 D4J subjects. The remaining D4J subjects were dis-
mizing the impact and frequency of ambiguity grouping and carded because (i) the faulty method does not contain pa-
coincidental correctness, as new, distinct components are rameters nor does it return a value; because (ii) the faulty
added to the system’s spectrum. method only contains non-primitive, non-null, complex-
typed parameters, which cannot be handled by the set of
4 Evaluation partitioning strategies described in Section 4.1; or because
(iii) the aforementioned partitioning strategies were unable
To evaluate our approach, we compare the cost of diagnos-
to create qualitative states whose coverage differs from their
ing a collection of faulty software programs using regular
enclosing method.
spectra against using QR-augmented spectra.
Our first research question is
4.1 Methodology
We have sourced experimental subjects from the Defects4J3 RQ1: Does augmenting spectra with qualitative com-
(D4J) database. D4J is a catalog of 395 real, reproducible ponents improve their diagnosability?
software bugs from 6 open-source projects — namely
JFreeChart, Google Closure compiler, Apache Commons In RQ1, we are concerned with finding if there exist. qual-
Lang, Apache Commons Math, Mockito, and Joda-Time. itative partitionings able to improve the fault localization
For each bug, a developer-written, fault-revealing test suite ranking to the extent that faulty components are inspected
is made available. earlier — thus decreasing developer wasted effort in a de-
We run the fault-revealing test suite of each buggy D4J bugging task. Hence, for each D4J subject, we choose
subject, gathering method-level coverage and test outcomes, as the landmarking strategy to consider in the evaluation
to construct its spectrum. Besides coverage, we also record the one that is able to create the largest set of distinct,
the value of all primitive-type arguments and return values non-ambiguous qualitative components out of the faulty
for every method call. This enables us to experiment with method(s). The breakdown of selected partitioning strate-
different qualitative partitioning strategies in an offline man- gies per subject is as follows: Sign partitioning: 102 (61%
ner. of subjects); X-means: 25 (15%); k-NN: 8 (5%); Linear
Using the recorded argument and return value data, we Regression: 1 (1%); Logistic Regression: 4 (2%); De-
create multiple (automated) partitioning models resulting in cision Tree: 11 (7%); Random Forest: 16 (10%). Our
several Q-SFL variants. A static partitioning variant using sign-partitioning default strategy was used to qualitatively
automated sign partitioning based on the variable’s type, as enhance the majority of considered subjects, while other
described in Section 3.2, was considered. For dynamic par- strategies such as linear classification and logistic regres-
titioning, several clustering and classification algorithms4 sion were rarely selected. We believe the reason that su-
were considered: k-NN, linear classification, logistic regres- pervised learning approaches — which were fed test case
sion, decision trees, random forest, and x-means clustering outcomes as the target class label — only exhibited superior
Test outcomes are used as the class labels in the case of performance in 40 subjects (24%) is due to the fact that the
supervised models. Note that we are not using the afore- number of failing tests in test suites is often much smaller
mentioned models for prediction, but rather as a partitioning than the amount of passing tests, weakening the resulting
scheme based on observed values. Hence, we do not break partitioning model.
our data into training and test sets, as is customary in pre- Figure 3 shows a scatter plot with the ∆Cd of all subjects
diction scenarios. Because we use automated, domain inde- under analysis. Shown in a red background are the 15 sub-
pendent partitioning, only primitive types are considered in jects (9%) with a negative ∆Cd — meaning that the report
the evaluation. has suffered a decrease in accuracy after augmenting the
To evaluate a QR-enhanced spectrum against its respec- spectra. The majority of these subjects belong to the Clo-
tive original spectrum, we first reduce the Q-SFL diagnos- sure project. The 62 subjects (37%) with ∆Cd = 0, where
tic report to method components. This reduction is done the faulty component has remained in the same position of
by considering the highest fault probability of any subcom- the ranking, are shown in a white background. Lastly, 90
ponent belonging to each method, to effectively be able to subjects (54%) that exhibited a positive ∆Cd — cases where
compare method-level diagnostic effort between the two ap- QR-enhanced spectra improved diagnosability — are shown
proaches. A change in diagnostic effort is measured using in green. All in all, Q-SFL is at least as effective as the orig-
inal approach in 92% of scenarios.
∆Cd = Cd (Original) − Cd (QR-Enhanced) (4) Table 1 presents statistics computed to assess whether the
3
Defects4J 1.1.0 is available at https://github.com/ observed metrics yield statistically significant results. QR-
rjust/defects4j (accessed June 2018). enhanced spectra exhibits an overall lower effort to diagnose
4
We chose popular classification algorithms implemented in when compared to the original spectra, with less variance.
the Scikit-learn package. X-means, as implemented in the To assess significance, we first performed the Shapiro-Wilk
pyclustering package, was selected as it can automatically test for normality of effort data in both the original spec-
decide the optimal number of clusters to use [22]. tra case and QR case. With 99% confidence, the test’s re-
103
ΔCd < 0 ΔCd ≤ 0
ΔCd ≥ 0 ΔCd > 0 140
ΔCd < 0
ΔCd = 0
102
ΔCd > 0
120

101
100

# Subjects
100 80
ΔCd

0
60
−100

40
Closure
−101 Commons Lang
Commons Math 20
JFreeChart
JodaTime
2
−10 Mockito
0
Sign X-means k-NN Linear Logistic Tree Forest
0 20 40 60 80 100 120 140 160
Subject Number Strategies

Figure 3: Difference in Cd between original and QR-enhanced spectra per sub- Figure 4: Breakdown of diagnostic performance per par-
ject. titioning strategy.

Table 1: Statistical tests. gories for every partitioning strategy considered in this eval-
uation. This bar plot tells us that no single strategy achieves
Original QR-enhanced the same number of positive ∆Cd scenarios as the partition
Spectra Spectra cardinality selection scheme employed to answer RQ1 and
Mean Cd 60.28 37.56 to produce Figure 3. Furthermore, strategies that were of-
ten picked by that criterion (namely, sign partitioning and
Median Cd 6.00 2.50 X-means strategies) also show an increased number of neg-
ative ∆Cd scenarios when compared to others. This leads us
Cd Variance 2.10×104 1.56×104 to conclude that no single strategy (out of the ones that were
analyzed) is able to consistently show improved diagnoses.
W = 0.46 W = 0.32
Shapiro-Wilk
p-value = 2.20×10−22 p-value = 1.10×10−24
RQ2? No, at least for the automated landmarking strate-
Wilcoxon Z = 5.45 gies considered in the evaluation, there is no evidence that
Signed-rank p-value = 5.10×10−10 a single automated strategy can consistently outperform the
original spectra. However, since Q-SFL can improve diag-
nosability, as per the answer to RQ1, we presume that man-
sults tell us that the distributions are not normal. Given ual or more complex, context-aware, automated white-box
that Cd is not normally distributed and that each observation strategies — which can perform static and dynamic source
is paired, we use the non-parametrical statistical hypothesis code analysis — are much more suited to outperform the
test Wilcoxon signed-rank. Our null-hypothesis is that the original spectra due to more effective and more informed
median difference between the two observations (i.e., ∆Cd ) partitioning.
is zero. The fifth row from Table 1 shows the resulting Z
statistic and p-value of Wilcoxon’s test. With 99% confi- 5 Related Work
dence, we refute the null-hypothesis.
RQ1? Yes, augmenting faulty spectra with new compo- There have been previous forays into enhancing the diag-
nents resulting from qualitative landmarking of method pa- nostic report of automated fault localization techniques to
rameter (and method return) values yields a statistically sig- either improve their accuracy or comprehension of the fail-
nificant improved diagnostic report. ing component.
To be able to answer RQ1, we have selected for each sub- SFL approaches to debugging typically present their re-
ject the strategy with the highest number of qualitative parti- port to users as a list of suspicious components that is sorted
tions targeting the faulty method, as we were only concerned according to the likelihood of being faulty. 1 have proposed
with the existence of a partitioning strategy that would im- a visual way of depicting the results of a similarity-based
prove diagnosability. However, in practice, it is not realistic software SFL diagnosis, color-coding each component ac-
to know a-priori what the faulty method is5 . Hence, our cording to their suspiciousness score [1]. 24 expand on the
second research question is visual concept by leveraging tree-based visualizations that
innately exploit the tree-like structure of Java code, natu-
rally aggregating neighboring components and aiding explo-
RQ2: Is there a particular automated landmarking ration of suspicious code regions [24]. Another approach to
strategy that consistently shows improved diagnosabil- improve the comprehension of faults was proposed by 25,
ity? called Whyline, which allows the users to obtain evidence
about the program’s execution before forming an explana-
Figure 4 shows a breakdown of the number of subjects tion of the cause by providing the ability to ask “why did”
that fall into the ∆Cd < 0, ∆Cd = 0 and ∆Cd > 0 cate- and “why didn’t” questions about program output [25].
26 have proposed an extension to SFL to improve com-
5 prehension. It uses integration coverage data, by way of
Although some effort has been put forth to hierarchically de-
bug programs using SFL [23]. capturing method invocation pairs, to guide the fault local-
ization process. By calculating the fault likelihood of com- extent that qualitative domain partitioning aids fault
ponent pairs, the authors are able to generate roadmaps for understanding.
component investigation, guiding users through likely faulty
paths and increasing the amount of contextual cues [26]. Acknowledgments
Advancements in bug prediction [27] have enabled its
use within automated fault localization processes. 28 pro- This material is based upon work supported by the scholar-
pose an ensemble approach to fault localization that exploits ship number SFRH/BD/95339/2013 from Fundação para a
information from versioning systems, bug tracking reposi- Ciência e Tecnologia (FCT).
tories and structured information retrieval from the source
code [28]. 17 rely on kernel density estimation models of References
component behavior and previous diagnoses to better esti-
mate the component goodness parameter in spectrum-based [1] James A. Jones, Mary Jean Harrold, and John T.
reasoning [17]. 18 also modify the traditional spectrum- Stasko. Visualization of test information to assist fault
based reasoning framework by leveraging a fault prediction localization. In ICSE’02, pages 467–477, 2002.
model trained with historical information from the project’s [2] Lucia, David Lo, Lingxiao Jiang, Ferdian Thung, and
versioning system and bug tracker to compute the prior Aditya Budi. Extended comprehensive study of asso-
probability distribution of diagnostic candidates [18]. ciation measures for fault localization. Journal of Soft-
Augmenting fault-localization via slicing has also been ware: Evolution and Process, 26(2):172–219, 2014.
proposed. 29 have proposed the use of dynamic backward
slices — comprised of statements that directly or indirectly [3] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van
effect the computation of the output value through data- or Gemund. Spectrum-based multiple fault localization.
control-dependency chains — as components in similarity- In ASE’09, pages 88–99, 2009.
based SFL [29]. 30 propose an approach that leverages a [4] Chris Parnin and Alessandro Orso. Are automated de-
model-based slicing-hitting-set-computation — which com- bugging techniques actually helping programmers? In
putes the dynamic slices of all faulty variables in all failed ISSTA’11, pages 199–209, 2011.
test cases, derives minimal diagnostic candidates from the
slices and computes fault probabilities for each statement [5] Aaron Ang, Alexandre Perez, Arie van Deursen, and
based on number of the diagnoses that contain it [30]. Rui Abreu. Revisiting the practical use of automated
software fault localization techniques. In IWPD’17,
pages 175–182, 2017.
6 Conclusion
[6] Carlos Gouveia, José Campos, and Rui Abreu. Using
This paper proposes a new approach to spectrum-based fault HTML5 visualizations in software fault localization.
localization that leverages qualitative reasoning (QR). The In VISSOFT’13, pages 1–10, 2013.
Q-SFL approach splits components form the software sys-
tem under analysis into a set of qualitative states through [7] Kenneth D. Forbus. Qualitative reasoning. In The
the creation of qualitative landmarks that partition a com- Computer Science and Engineering Handbook, pages
ponent’s domain. These qualitative states are then consid- 715–733. 1997.
ered as SFL components to be ranked using traditional fault- [8] Brian C. Williams and Johan de Kleer. Qualitative rea-
localization methodologies. Since we treat these qualitative soning about physical systems: A return to roots. Ar-
states as components, our diagnostic reports not only rec- tificial Intelligence, 51(1-3):1–9, 1991.
ommend likely fault locations, but also provide an insight
on what behaviors the faulty components assume when fail- [9] René Just, Darioush Jalali, and Michael D. Ernst. De-
ures are detected, facilitating the comprehension of the fault. fects4j: a database of existing faults to enable con-
We evaluate the approach on subjects from the Defects4J trolled testing studies for java programs. In ISSTA’14,
catalog of real faults from medium and large-sized open pages 437–440, 2014.
source software projects. Results show that spectra which [10] Rui Abreu and Arjan J. C. van Gemund. A low-cost
were augmented using qualitative partitioning of method pa- approximate minimal hitting set algorithm and its ap-
rameters shows a (statistically significant) improvement in plication to model-based diagnosis. In SARA’09, 2009.
the diagnostic accuracy in 54% of scenarios. However, we
[11] Alexander Feldman, Gregory M. Provan, and Arjan
also found no evidence of automated partitioning strategies
that were consistently better than the original spectra, mean- J. C. van Gemund. Computing minimal diagnoses by
ing that more intricate, context-aware partitioning strategies greedy stochastic search. In AAAI’08, pages 911–918,
will likely be necessary for practical applications of the ap- 2008.
proach. [12] Nuno Cardoso and Rui Abreu. MHS2: A map-reduce
This work lays the first stone in a series of efforts to heuristic-driven minimal hitting set search algorithm.
more deeply integrate reasoning-based AI approaches into In MUSEPAT’13, pages 25–36, 2013.
spectrum-based fault localization. It paves the way for fur- [13] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van
ther efforts by the fault localization research community, Gemund. A new bayesian approach to multiple inter-
namely by: mittent fault diagnosis. In IJCAI’09, pages 653–658,
1. Improving automated landmarking by expanding its 2009.
application to complex non-primitive objects and by [14] John Carey, Neil Gross, Marcia Stepanek, and Otis
exploring ensembles of multiple strategies.
Port. Software hell. In Business Week, pages 391–411,
2. Conducting a systematic user study investigating the 1999.
[15] Johan de Kleer. Multiple representations of knowledge [23] Alexandre Perez, Rui Abreu, and André Riboira. A
in a mechanics problem-solver. In IJCAI’77, pages dynamic code coverage approach to maximize fault lo-
299–304, 1977. calization efficiency. Journal of Systems and Software,
[16] Benjamin Kuipers. Qualitative simulation. Artificial 90:18–28, 2014.
Intelligence, 29(3):289–338, 1986. [24] José Campos, André Riboira, Alexandre Perez, and
[17] Nuno Cardoso and Rui Abreu. A kernel density Rui Abreu. GZoltar: an eclipse plug-in for testing and
estimate-based approach to component goodness mod- debugging. In ASE’12, pages 378–381, 2012.
eling. In AAAI’13, 2013. [25] Andrew Jensen Ko and Brad A. Myers. Designing the
[18] Amir Elmishali, Roni Stern, and Meir Kalech. Data- whyline: a debugging interface for asking questions
augmented software diagnosis. In AAAI’16, pages about program behavior. In CHI’04, pages 151–158,
4003–4009, 2016. 2004.
[19] Xiaofeng Xu, Vidroha Debroy, W. Eric Wong, and [26] Higor Amario de Souza and Marcos Lordello Chaim.
Donghui Guo. Ties within fault localization rankings: Adding context to fault localization with integration
Exposing and addressing the problem. International coverage. In ASE’13, pages 628–633, 2013.
Journal of Software Engineering and Knowledge En- [27] Marco D’Ambros, Michele Lanza, and Romain
gineering, 21(6):803–827, 2011. Robbes. An extensive comparison of bug prediction
[20] G. N. Stenbakken, T. M. Souders, and G. W. Stewart. approaches. In MSR’10, pages 31–41, 2010.
Ambiguity groups and testability. IEEE Transactions [28] Shaowei Wang and David Lo. Version history, simi-
on Instrumentation and Measurement, 38(5):941–947, lar report, and structure: putting them together for im-
1989. proved bug localization. In ICPC’14, pages 53–63,
[21] Debra J. Richardson and Margaret C. Thompson. An 2014.
analysis of test data selection criteria using the RE- [29] Xiaoguang Mao, Yan Lei, Ziying Dai, Yuhua Qi, and
LAY model of fault detection. IEEE Transactions on Chengsong Wang. Slice-based statistical fault local-
Software Engineering, 19(6):533–553, 1993. ization. Journal of Systems and Software, 89:51–62,
[22] Dan Pelleg and Andrew W. Moore. X-means: Extend- 2014.
ing k-means with efficient estimation of the number of [30] Birgit Hofer and Franz Wotawa. Spectrum enhanced
clusters. In ICML’00, pages 727–734, 2000. dynamic slicing for better fault localization. In
ECAI’12, pages 420–425, 2012.