Toward Interactive Spreadsheet Debugging

               Dietmar Jannach                        Thomas Schmitz                       Kostyantyn
             TU Dortmund, Germany                   TU Dortmund, Germany                  Shchekotykhin
             dietmar.jannach@tu-                   thomas.schmitz@tu-               University Klagenfurt, Austria
                 dortmund.de                          dortmund.de                   kostya@ifit.uni-klu.ac.at


ABSTRACT                                                          techniques showed promising results in helping users to lo-
Spreadsheet applications are often developed in a compara-        cate faults in spreadsheets, a number of challenges remain.
bly unstructured process without rigorous quality assurance       In this paper, we address the question of how the end user
mechanisms. Faults in spreadsheets are therefore common           can be better supported in situations when many diagno-
and finding the true causes of an unexpected calculation out-     sis candidates are returned by the reasoning engine. We
come can be tedious already for small spreadsheets. The goal      will sketch different interactive candidate discrimination ap-
of the Exquisite project is to provide spreadsheet developers     proaches in which the user is queried by the system about
with better tool support for fault identification. Exquisite is   the correctness of individual cells’ values and formulas.
based on an algorithmic debugging approach relying on the
principles of Model-Based Diagnosis and is designed as a          2.   DIAGNOSING SPREADSHEETS
plug-in to MS Excel. In this paper, we give an overview of
the project, outline open challenges, and sketch different ap-
proaches for the interactive minimization of the set of fault
                                                                  Algorithmic Approach.
                                                                     In [14], Reiter proposed a domain-independent and logic-
candidates.
                                                                  based characterization of the MBD problem. A diagnos-
                                                                  able system comprises a set of interconnected components
Categories and Subject Descriptors                                Comps, each of which can possibly fail. A system descrip-
H.4 [Information Systems Applications]: Spreadsheets;             tion SD specifies how components behave when they work
D.2.8 [Software Engineering]: Testing and Debugging               correctly, i.e., given some inputs, the definitions in SD and
                                                                  Comps determine the expected outputs. In case the expected
                                                                  outputs deviate from what is actually observed, the diagno-
1.   INTRODUCTION                                                 sis problem consists of identifying a subset of Comps, which,
   Spreadsheet applications are mostly developed in an un-        if assumed faulty, explains the observations.
structured, ad-hoc process without detailed domain analysis,
principled design or in-depth testing. As a result, spread-
sheets might often be of limited quality and contain faults,
which is particularly problematic when they are used as de-
cision making aids. Over the last years, researchers have
proposed a number of ways of transferring principles, prac-
tices and techniques of software engineering to the spread-
sheet domain, including modeling approaches, better test
support, refactoring, or techniques for problem visualiza-
tion, fault localization, and repair [2, 5, 9, 13, 15].           Figure 1: A spreadsheet with an unexpected output
   The Exquisite project [12] continues these lines of research
and proposes a constraint-based approach for algorithmic             The main idea can be transferred to the spreadsheet do-
spreadsheet debugging. Technically, the main idea is to           main as follows [12]. In the example shown in Figure 1,
translate the spreadsheet under investigation as well as user-    the intended formula in cell C2 should be an addition, but
specified test cases into a Constraint Satisfaction Problem       the developer made a typo. When testing the spreadsheet
(CSP) and then use Model-Based Diagnosis (MBD) [14] to            with the inputs {A1=1, A2=6} and the expected output
find the diagnosis candidates. In terms of a CSP, each can-       {C1=20}, the user notices an unexpected output (36) in C1.
didate is a set of constraints that have to be modified to        MBD reasoning now aims to find minimal subsets of the pos-
correct a failure. In our previous works, we demonstrated         sibly faulty components – in our case the cells with formulas
the general feasibility of the approach and presented details     – which can explain the observed discrepancy. A closer in-
of an MS Excel plug-in, which allows the user to interactively    vestigation of the problem reveals that only two minimal
specify test cases, run the diagnosis process and then explore    explanations exist in our example if we only allow integer
the possible candidates identified by our algorithm [12].         values: “C1 is faulty” and “B2 is faulty”. The formula in
   Using constraint reasoning and diagnosis approaches for        cell B1 alone cannot be the sole cause of the problem with
spreadsheet debugging – partially in combination with other       B2 and C1 being correct as 18 is not a factor of the ex-
techniques – was also considered in [3, 4, 11]. While these       pected value 20, i.e., there is no solution to the equation
B1  18  20, B1 P N. Note that we assume that the constant
values in the spreadsheet are correct. However, our approach
can be easily extended to deal with erroneous constants.
   In [12], we describe a plug-in component for MS Excel,
which relies on an enhanced and parallelized version of this
diagnostic procedure, additional dependency-based search
space pruning, and a technique for fast conflict detection.
   We have evaluated our approach in two ways. (A) We an-
alyzed the required running times using a number of spread-                  Figure 2: Another faulty spreadsheet
sheets in which we injected faults (mutations). The results
showed that our method can find the diagnoses for many
small- and mid-sized spreadsheets containing about 100 for-         If we ask the user for a correct value of B1, then the user
mulas within a few seconds. (B) We conducted a user study         will answer “2”. Based on that information, B1 can be ruled
in the form of an error detection exercise involving 24 sub-      out as a diagnosis and the problem must be somewhere else.
jects. The results showed that participants who used the          However, if we had asked the user for the correct value of C1,
Exquisite tool were both more effective and efficient than        the user would have answered “20” and we could immediately
those who only relied on MS Excel’s standard fault local-         infer that {D1} remains as the only possible diagnosis.
ization mechanisms. A post-experiment questionnaire indi-           The question arises how we can automatically determine
cated that both groups would appreciate better tool support       which questions we should ask to the user. Next, we sketch
for fault localization in commercial tools [12].                  two possible strategies for interactive querying.

The Problem of Discriminating Between Diagnoses.                  3.2      Querying for Cell Values
   While we see our results so far to be promising, an open         The first approach (Algorithm 1) is based on interactively
issue is that the set of diagnosis candidates can be large.       asking the user about the correct values of intermediate cells
Finding the true cause of the error within a larger set of        as done in the example1 .
possible explanations can be tedious and, finally, make the
approach impractical when a user has to inspect too many              Algorithm 1: Querying cell values
alternatives. Since this is a general problem of MBD, the
                                                                       Input: A faulty spreadsheet P, a test case T
question of how to help the user to better discriminate be-
                                                                       S = Diagnoses for P given T
tween the candidates and focus on the most probable ones
                                                                       while |S | ¡ 1 do
was in the focus of several works from the early days of MBD
                                                                         foreach intermediate cell c P P not asked so far do
research [7]. In the next section, we will propose two possible            val = computed value of c given T
remedies for this problem in the spreadsheet domain.                       count(c) = 0
                                                                           foreach Diagnosis d P S do
3.    TOWARD INTERACTIVE DEBUGGING                                           CSP 1  CSP of P given T z Constraints(d)
   Early works in MBD research were dealing with fault di-                   if CSP 1 Y tc  valu has a solution then
agnosis of electrical circuits. In this domain, an engineer                      inc(count(c))
can make additional measurements, e.g., of the voltage at
certain nodes. These measurements can then help to reduce                Query user for expected value v of the cell c where
the set of candidates because some observations may rule                    count(c) is minimal
out certain explanations for the observed behavior.                      T  T Y tc  v u
   Each measurement however induces additional costs or                  S = Diagnoses for P given T
effort from the user. One goal of past research was thus to
automatically determine “good” measurement points, i.e.,
those which help to narrow down the candidate space fast            The goal of the algorithm is to minimize the number of re-
and thus minimize the number of required measurements. In         quired interactions. Therefore, as long as there is more than
[7], for example, an approach based on information theory         one diagnosis, we determine which question would help us
was proposed where the possible measurement points were           most to reduce the set of remaining diagnoses. To do so, we
ranked according to the expected information gain.                check for each possible question (intermediate cell c), how
                                                                  many diagnoses would remain if we knew that the cell value
3.1    Example                                                    val is correct given the test case T . Since every diagnosis
   Figure 2 exemplifies how additional measurements (inputs       candidate d corresponds to a relaxed version CSP 1 of the
by the user) can help us to find the cause of a fault. The        original CSP, where the latter is a translation of the spread-
example is based on the one from Figure 1. The user has           sheet P and the test case T , we check if CSP 1 together
corrected the formula in C1 and added a formula in D1 that        with the assignment tc  valu has a solution. Next, we
should multiply the value of C1 by 10. Again, a typo was          ask the user for the correct value of the cell for which the
made and the observed result in D1 is 30 instead of the           smallest number of remaining diagnoses was observed. The
expected value of 200 for the input values {A1=1, A2=6}.          user-provided value is then added to the set of values known
   Given this unluckily chosen test case, MBD returns four        to be correct for T and the process is repeated.
possible single-element candidates {B1}, {B2}, {C1}, and            To test the approach, we used a number of spreadsheets
{D1}, i.e., every formula could be the cause. To narrow           containing faults from [12], measured how many interactions
down this set, we could query the user about the correctness      1
                                                                      Actually, a user-provided range restriction for C1 (15   C1
of the intermediate results in the cells B1, B2, and C1.               25) would have been sufficient in the example.
      Scenario           #C    #F     #D     I Rand     Min
      Sales forecast     143    1      89      46.7     11
      Hospital form      38     1      15       7.6      5
      Revenue calc.      110    3     200      11.8      9
      Example Fig. 3     170    1      85      50.3     12

       Table 1: Results for querying cell values.
                                                                    Figure 3: A small extract of a faulty spreadsheet
                                                                    with structurally identical formulas
it requires to isolate the single correct diagnosis using Algo-
rithm 1 and compared it to a random measurement strategy.
   The results given in Table 1 show that for the tested exam-
                                                                       The rationale of this last strategy, which we will now
ples the number of required interactions can be measurably
                                                                    discuss in more detail, is that in many real-world spread-
lowered compared to a random strategy. The sales forecast
                                                                    sheets, structurally identical formulas exist in neighboring
spreadsheet, for example, comprises 143 formulas (#C) and
                                                                    cells, which, for example, perform a row-wise aggregation of
contains 1 fault (#F). Using our approach, only 11 cells
                                                                    cell values. Such repetitive structures are one of the ma-
(Min) have to be inspected by the user to find the diagnosis
                                                                    jor reasons that the number of diagnosis candidates grows
explaining a fault within 89 diagnosis candidates (#D). Re-
                                                                    quickly. Thus, when the user has inspected one formula,
peated runs of the randomized strategy lead to more than
                                                                    we can ask him if the given answer also applies to all copy-
45 interactions (IRand) on average.
                                                                    equivalent formulas, which we can automatically detect.
   As our preliminary evaluation shows, the developed heuris-
                                                                       In the example spreadsheet shown in Figure 3 the user
tic decreases the number of user interaction required to find
                                                                    made a mistake in cell M13 entering a minus instead of the
the correct diagnosis. In our future work, we will also con-
                                                                    intended plus symbol. A basic MBD method would in the
sider other heuristics. Note however that there are also prob-
                                                                    worst case and depending on the test cases return every
lem settings in which all possible queries lead to the same
                                                                    single formula as equally ranked diagnosis candidates. When
reduction of the candidate space. Nonetheless, as the ap-
                                                                    applying the value-based query strategy of Section 3.2, the
proach shows to be helpful at least in some cases, we plan
                                                                    user would be asked to give feedback on the values of M1 to
to explore the following extensions in the future.
                                                                    M12, which however requires a lot of manual calculations.
                                                                       With the techniques proposed in this section, the formulas
       Instead of asking for expected cell values we can ask
                                                                    of the spreadsheet would first be ranked based on their fault
       for the correctness of individual calculated values or
                                                                    probability. Let us assume that our heuristics say that users
       for range specifications. This requires less effort by the
                                                                    most probably make mistakes when writing IF-statements.
       user but does not guarantee that one single candidate
                                                                    In addition, the formula M13 is syntactically more complex
       can be isolated.
                                                                    as those in M1 to M12 and thus more probable to be faulty.
       Additional test cases can help to rule out some can-            Based on this ranking, we would, for example, ask the
       didates. Thus, we plan to explore techniques for au-         user to inspect the formula of G1 first. Given the feedback
       tomated test-case generation. As spreadsheets often          that the formula is correct, we can ask the user to check the
       consist of multiple blocks of calculations which only        copy-equivalent formulas of G1 to L12. This task, however,
       have few links to other parts of the program, one tech-      can be very easily done by the user by navigating through
       nique could be to generate test cases for such smaller       these cells and by checking if the formulas properly reflect
       fragments, which are easier to validate for the user.        the intended semantics, i.e., that the formulas were copied
                                                                    correctly. After that, the user is asked to inspect the formula
3.3     Querying for Formula Correctness                            M13 according to the heuristic which is actually the faulty
                                                                    one. In this example, we thus only needed 3 user interactions
   Calculating expected values for intermediate cells can be
                                                                    to find the cause of the error. In a random strategy, the user
difficult for users as they have to consider also the cells pre-
                                                                    would have to inspect up to half of the formulas in average
ceding the one under investigation. Thus, we propose ad-
                                                                    depending on the test case. The evaluation of the described
ditional strategies in which we ask for the correctness of
                                                                    techniques is part of our ongoing work.
individual formulas. Answering such queries can in the best
case be done by inspecting only one particular formula.
                                                                    3.4    User Acceptance Issues
  1. We can query the user about the elements of the most              Independent of the chosen strategy, user studies have to
     probable diagnoses as done in [16], e.g., by limiting the      be performed to assess which kinds of user input one can re-
     search depth and by estimating fault probabilities.            alistically expect, e.g., for which problem scenarios the user
                                                                    should be able to provide expected (ranges of) values for
  2. In case of multiple-fault diagnoses, we can ask the user       intermediate cells. In addition, the spreadsheet inspection
     to inspect those formulas first that appear in the most        exercise conducted in [12] indicates that users follow differ-
     diagnoses. If one cell appears in all diagnoses, it must       ent fault localization strategies: some users for example start
     definitely contain an error.                                   from the inputs whereas others begin at the “result cells”.
                                                                    Any interactive querying strategy should therefore be care-
  3. After having queried the user about the correctness of         fully designed and assessed with real users. As a part of a
     one particular formula, we can search for copy-equiva-         future work we furthermore plan to develop heuristics to se-
     lent formulas and ask the user to confirm the correct-         lect one of several possible debugging techniques depending
     ness of these formulas.                                        on the users problem identification strategy.
4.   ADDITIONAL CHALLENGES                                        Exquisite project, we have identified a number of open chal-
   Other open issues in the context of MBD-based debug-           lenges in the domain and outlined approaches for interactive
ging that we plan to investigate in future work include the       spreadsheet debugging.
following aspects.
                                                                  Acknowledgements
Probability-Based Ranking.                                        This work was supported by the EU through the programme
   Another approach to discriminate between diagnoses is          “Europäischer Fonds für regionale Entwicklung - Investition
to try to rank the sometimes numerous candidates in a way         in unsere Zukunft” under contract number 300251802.
that those considered to be the most probable ones are listed
first. Typically, one would for example list diagnoses candi-     6.   REFERENCES
dates of smaller cardinality first, assuming that single faults
                                                                   [1] R. Abraham and M. Erwig. AutoTest: A Tool for
are more probable than double faults. In addition, we can
                                                                       Automatic Test Case Generation in Spreadsheets. In
use fault statistics from the literature for different types of
                                                                       Proceedings VL/HCC 2006, pages 43–50, 2006.
faults to estimate the probability of each diagnosis. In the
spreadsheet domain, we could also rely on indicators like          [2] R. Abraham and M. Erwig. GoalDebug: A
formula complexity or other spreadsheet smells [10], the lo-           Spreadsheet Debugger for End Users. In Proc. ICSE
cation of the cell within the spreadsheet’s overall structure,         2007, pages 251–260, 2007.
results from Spectrum-Based Fault Localization [11], or the        [3] R. Abreu, A. Riboira, and F. Wotawa.
number of recent changes made to a formula. User studies               Constraint-based Debugging of Spreadsheets. In Proc.
in the form of spreadsheet construction exercises as done in           CIbSE 2012, pages 1–14, 2012.
[6] can help to identify or validate such heuristics.              [4] S. Außerlechner, S. Fruhmann, W. Wieser, B. Hofer,
                                                                       R. Spörk, C. Mühlbacher, and F. Wotawa. The Right
Problem Encoding and Running Times.                                    Choice Matters! SMT Solving Substantially Improves
   For larger problem instances, the required running times            Model-Based Debugging of Spreadsheets. In Proc.
for the diagnosis can exceed what is acceptable for inter-             QSIC 2013, pages 139–148, 2013.
active debugging. Faster commercial constraint solvers can         [5] S. Badame and D. Dig. Refactoring meets Spreadsheet
alleviate this problem to some extent. However, also au-               Formulas. In Proc. ICSM 2012, pages 399–409, 2012.
tomated problem decomposition and dependency analysis              [6] P. S. Brown and J. D. Gould. An Experimental Study
methods represent a powerful means to be further explored              of People Creating Spreadsheets. ACM TOIS,
to reduce the search complexity.                                       5(3):258–272, 1987.
   Another open issue is that in works relying on a CSP-           [7] J. de Kleer and B. C. Williams. Diagnosing Multiple
encoding of the spreadsheets, e.g., [3, 11] and our work,              Faults. Artificial Intelligence, 32(1):97–130, 1987.
the calculations are limited to integers, which is caused by       [8] F. Hermans. Improving Spreadsheet Test Practices. In
the limited floating-point support of free constraint solvers.         Proc. CASCON 2013, pages 56–69, 2013.
More research is required in this area, including the incor-       [9] F. Hermans, M. Pinzger, and A. van Deursen.
poration of alternative reasoning approaches like, e.g., linear        Supporting Professional Spreadsheet Users by
optimization.                                                          Generating Leveled Dataflow Diagrams. In ICSE
                                                                       2011, pages 451–460, 2011.
User Interface Design.                                            [10] F. Hermans, M. Pinzger, and A. van Deursen.
   Finally, as spreadsheet developers are usually not pro-             Detecting Code Smells in Spreadsheet Formulas. In
grammers, the user interface (UI) design plays a central               Proc. ICSM 2012, pages 409–418, 2012.
role and suitable UI metaphors and a corresponding non-           [11] B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and
technical terminology have to be developed. In Exquisite,              E. Getzner. On the Empirical Evaluation of Fault
we tried to leave the user as much as possible within the              Localization Techniques for Spreadsheets. In Proc.
known MS Excel environment. Certain concepts like “test                FASE 2013, pages 68–82, 2013.
cases” are, however, not present in modern spreadsheet tools      [12] D. Jannach and T. Schmitz. Model-based diagnosis of
and require some learning effort from the developer. The               spreadsheet programs - A constraint-based debugging
recent work of [8] indicates that users are willing to spend           approach. Autom Softw Eng, to appear, 2014.
some extra effort, e.g., in test case specification, to end up    [13] D. Jannach, T. Schmitz, B. Hofer, and F. Wotawa.
with more fault-free spreadsheets.                                     Avoiding, finding and fixing spreadsheet errors - a
   How the interaction mechanisms actually should be de-               survey of automated approaches for spreadsheet QA.
signed to be usable at least by experienced users, is largely          Journal of Systems and Software, to appear, 2014.
open in our view. In previous spreadsheet testing and de-
                                                                  [14] R. Reiter. A Theory of Diagnosis from First
bugging approaches like [1] or [2], for example, additional
                                                                       Principles. Artificial Intelligence, 32(1):57–95, 1987.
input was required by the user. In-depth studies about the
                                                                  [15] G. Rothermel, L. Li, C. Dupuis, and M. Burnett.
usability of these extensions to standard spreadsheet envi-
                                                                       What You See Is What You Test: A Methodology for
ronments are quite rare.
                                                                       Testing Form-Based Visual programs. In Proc. ICSE
                                                                       1998, pages 198–207, 1998.
5.   SUMMARY                                                      [16] K. Shchekotykhin, G. Friedrich, P. Fleiss, and
  In this paper, we have discussed perspectives of constraint          P. Rodler. Interactive ontology debugging: Two query
and model-based approaches for algorithmic spreadsheet de-             strategies for efficient fault localization. Journal of
bugging. Based on our insights obtained so far from the                Web Semantics, 12-13:788–103, 2012.