Toward Interactive Spreadsheet Debugging Dietmar Jannach Thomas Schmitz Kostyantyn TU Dortmund, Germany TU Dortmund, Germany Shchekotykhin dietmar.jannach@tu- thomas.schmitz@tu- University Klagenfurt, Austria dortmund.de dortmund.de kostya@ifit.uni-klu.ac.at ABSTRACT techniques showed promising results in helping users to lo- Spreadsheet applications are often developed in a compara- cate faults in spreadsheets, a number of challenges remain. bly unstructured process without rigorous quality assurance In this paper, we address the question of how the end user mechanisms. Faults in spreadsheets are therefore common can be better supported in situations when many diagno- and finding the true causes of an unexpected calculation out- sis candidates are returned by the reasoning engine. We come can be tedious already for small spreadsheets. The goal will sketch different interactive candidate discrimination ap- of the Exquisite project is to provide spreadsheet developers proaches in which the user is queried by the system about with better tool support for fault identification. Exquisite is the correctness of individual cells’ values and formulas. based on an algorithmic debugging approach relying on the principles of Model-Based Diagnosis and is designed as a 2. DIAGNOSING SPREADSHEETS plug-in to MS Excel. In this paper, we give an overview of the project, outline open challenges, and sketch different ap- proaches for the interactive minimization of the set of fault Algorithmic Approach. In [14], Reiter proposed a domain-independent and logic- candidates. based characterization of the MBD problem. A diagnos- able system comprises a set of interconnected components Categories and Subject Descriptors Comps, each of which can possibly fail. A system descrip- H.4 [Information Systems Applications]: Spreadsheets; tion SD specifies how components behave when they work D.2.8 [Software Engineering]: Testing and Debugging correctly, i.e., given some inputs, the definitions in SD and Comps determine the expected outputs. In case the expected outputs deviate from what is actually observed, the diagno- 1. INTRODUCTION sis problem consists of identifying a subset of Comps, which, Spreadsheet applications are mostly developed in an un- if assumed faulty, explains the observations. structured, ad-hoc process without detailed domain analysis, principled design or in-depth testing. As a result, spread- sheets might often be of limited quality and contain faults, which is particularly problematic when they are used as de- cision making aids. Over the last years, researchers have proposed a number of ways of transferring principles, prac- tices and techniques of software engineering to the spread- sheet domain, including modeling approaches, better test support, refactoring, or techniques for problem visualiza- tion, fault localization, and repair [2, 5, 9, 13, 15]. Figure 1: A spreadsheet with an unexpected output The Exquisite project [12] continues these lines of research and proposes a constraint-based approach for algorithmic The main idea can be transferred to the spreadsheet do- spreadsheet debugging. Technically, the main idea is to main as follows [12]. In the example shown in Figure 1, translate the spreadsheet under investigation as well as user- the intended formula in cell C2 should be an addition, but specified test cases into a Constraint Satisfaction Problem the developer made a typo. When testing the spreadsheet (CSP) and then use Model-Based Diagnosis (MBD) [14] to with the inputs {A1=1, A2=6} and the expected output find the diagnosis candidates. In terms of a CSP, each can- {C1=20}, the user notices an unexpected output (36) in C1. didate is a set of constraints that have to be modified to MBD reasoning now aims to find minimal subsets of the pos- correct a failure. In our previous works, we demonstrated sibly faulty components – in our case the cells with formulas the general feasibility of the approach and presented details – which can explain the observed discrepancy. A closer in- of an MS Excel plug-in, which allows the user to interactively vestigation of the problem reveals that only two minimal specify test cases, run the diagnosis process and then explore explanations exist in our example if we only allow integer the possible candidates identified by our algorithm [12]. values: “C1 is faulty” and “B2 is faulty”. The formula in Using constraint reasoning and diagnosis approaches for cell B1 alone cannot be the sole cause of the problem with spreadsheet debugging – partially in combination with other B2 and C1 being correct as 18 is not a factor of the ex- techniques – was also considered in [3, 4, 11]. While these pected value 20, i.e., there is no solution to the equation B1  18  20, B1 P N. Note that we assume that the constant values in the spreadsheet are correct. However, our approach can be easily extended to deal with erroneous constants. In [12], we describe a plug-in component for MS Excel, which relies on an enhanced and parallelized version of this diagnostic procedure, additional dependency-based search space pruning, and a technique for fast conflict detection. We have evaluated our approach in two ways. (A) We an- alyzed the required running times using a number of spread- Figure 2: Another faulty spreadsheet sheets in which we injected faults (mutations). The results showed that our method can find the diagnoses for many small- and mid-sized spreadsheets containing about 100 for- If we ask the user for a correct value of B1, then the user mulas within a few seconds. (B) We conducted a user study will answer “2”. Based on that information, B1 can be ruled in the form of an error detection exercise involving 24 sub- out as a diagnosis and the problem must be somewhere else. jects. The results showed that participants who used the However, if we had asked the user for the correct value of C1, Exquisite tool were both more effective and efficient than the user would have answered “20” and we could immediately those who only relied on MS Excel’s standard fault local- infer that {D1} remains as the only possible diagnosis. ization mechanisms. A post-experiment questionnaire indi- The question arises how we can automatically determine cated that both groups would appreciate better tool support which questions we should ask to the user. Next, we sketch for fault localization in commercial tools [12]. two possible strategies for interactive querying. The Problem of Discriminating Between Diagnoses. 3.2 Querying for Cell Values While we see our results so far to be promising, an open The first approach (Algorithm 1) is based on interactively issue is that the set of diagnosis candidates can be large. asking the user about the correct values of intermediate cells Finding the true cause of the error within a larger set of as done in the example1 . possible explanations can be tedious and, finally, make the approach impractical when a user has to inspect too many Algorithm 1: Querying cell values alternatives. Since this is a general problem of MBD, the Input: A faulty spreadsheet P, a test case T question of how to help the user to better discriminate be- S = Diagnoses for P given T tween the candidates and focus on the most probable ones while |S | ¡ 1 do was in the focus of several works from the early days of MBD foreach intermediate cell c P P not asked so far do research [7]. In the next section, we will propose two possible val = computed value of c given T remedies for this problem in the spreadsheet domain. count(c) = 0 foreach Diagnosis d P S do 3. TOWARD INTERACTIVE DEBUGGING CSP 1  CSP of P given T z Constraints(d) Early works in MBD research were dealing with fault di- if CSP 1 Y tc  valu has a solution then agnosis of electrical circuits. In this domain, an engineer inc(count(c)) can make additional measurements, e.g., of the voltage at certain nodes. These measurements can then help to reduce Query user for expected value v of the cell c where the set of candidates because some observations may rule count(c) is minimal out certain explanations for the observed behavior. T  T Y tc  v u Each measurement however induces additional costs or S = Diagnoses for P given T effort from the user. One goal of past research was thus to automatically determine “good” measurement points, i.e., those which help to narrow down the candidate space fast The goal of the algorithm is to minimize the number of re- and thus minimize the number of required measurements. In quired interactions. Therefore, as long as there is more than [7], for example, an approach based on information theory one diagnosis, we determine which question would help us was proposed where the possible measurement points were most to reduce the set of remaining diagnoses. To do so, we ranked according to the expected information gain. check for each possible question (intermediate cell c), how many diagnoses would remain if we knew that the cell value 3.1 Example val is correct given the test case T . Since every diagnosis Figure 2 exemplifies how additional measurements (inputs candidate d corresponds to a relaxed version CSP 1 of the by the user) can help us to find the cause of a fault. The original CSP, where the latter is a translation of the spread- example is based on the one from Figure 1. The user has sheet P and the test case T , we check if CSP 1 together corrected the formula in C1 and added a formula in D1 that with the assignment tc  valu has a solution. Next, we should multiply the value of C1 by 10. Again, a typo was ask the user for the correct value of the cell for which the made and the observed result in D1 is 30 instead of the smallest number of remaining diagnoses was observed. The expected value of 200 for the input values {A1=1, A2=6}. user-provided value is then added to the set of values known Given this unluckily chosen test case, MBD returns four to be correct for T and the process is repeated. possible single-element candidates {B1}, {B2}, {C1}, and To test the approach, we used a number of spreadsheets {D1}, i.e., every formula could be the cause. To narrow containing faults from [12], measured how many interactions down this set, we could query the user about the correctness 1 Actually, a user-provided range restriction for C1 (15 C1 of the intermediate results in the cells B1, B2, and C1. 25) would have been sufficient in the example. Scenario #C #F #D I Rand Min Sales forecast 143 1 89 46.7 11 Hospital form 38 1 15 7.6 5 Revenue calc. 110 3 200 11.8 9 Example Fig. 3 170 1 85 50.3 12 Table 1: Results for querying cell values. Figure 3: A small extract of a faulty spreadsheet with structurally identical formulas it requires to isolate the single correct diagnosis using Algo- rithm 1 and compared it to a random measurement strategy. The results given in Table 1 show that for the tested exam- The rationale of this last strategy, which we will now ples the number of required interactions can be measurably discuss in more detail, is that in many real-world spread- lowered compared to a random strategy. The sales forecast sheets, structurally identical formulas exist in neighboring spreadsheet, for example, comprises 143 formulas (#C) and cells, which, for example, perform a row-wise aggregation of contains 1 fault (#F). Using our approach, only 11 cells cell values. Such repetitive structures are one of the ma- (Min) have to be inspected by the user to find the diagnosis jor reasons that the number of diagnosis candidates grows explaining a fault within 89 diagnosis candidates (#D). Re- quickly. Thus, when the user has inspected one formula, peated runs of the randomized strategy lead to more than we can ask him if the given answer also applies to all copy- 45 interactions (IRand) on average. equivalent formulas, which we can automatically detect. As our preliminary evaluation shows, the developed heuris- In the example spreadsheet shown in Figure 3 the user tic decreases the number of user interaction required to find made a mistake in cell M13 entering a minus instead of the the correct diagnosis. In our future work, we will also con- intended plus symbol. A basic MBD method would in the sider other heuristics. Note however that there are also prob- worst case and depending on the test cases return every lem settings in which all possible queries lead to the same single formula as equally ranked diagnosis candidates. When reduction of the candidate space. Nonetheless, as the ap- applying the value-based query strategy of Section 3.2, the proach shows to be helpful at least in some cases, we plan user would be asked to give feedback on the values of M1 to to explore the following extensions in the future. M12, which however requires a lot of manual calculations. With the techniques proposed in this section, the formulas Instead of asking for expected cell values we can ask of the spreadsheet would first be ranked based on their fault for the correctness of individual calculated values or probability. Let us assume that our heuristics say that users for range specifications. This requires less effort by the most probably make mistakes when writing IF-statements. user but does not guarantee that one single candidate In addition, the formula M13 is syntactically more complex can be isolated. as those in M1 to M12 and thus more probable to be faulty. Additional test cases can help to rule out some can- Based on this ranking, we would, for example, ask the didates. Thus, we plan to explore techniques for au- user to inspect the formula of G1 first. Given the feedback tomated test-case generation. As spreadsheets often that the formula is correct, we can ask the user to check the consist of multiple blocks of calculations which only copy-equivalent formulas of G1 to L12. This task, however, have few links to other parts of the program, one tech- can be very easily done by the user by navigating through nique could be to generate test cases for such smaller these cells and by checking if the formulas properly reflect fragments, which are easier to validate for the user. the intended semantics, i.e., that the formulas were copied correctly. After that, the user is asked to inspect the formula 3.3 Querying for Formula Correctness M13 according to the heuristic which is actually the faulty one. In this example, we thus only needed 3 user interactions Calculating expected values for intermediate cells can be to find the cause of the error. In a random strategy, the user difficult for users as they have to consider also the cells pre- would have to inspect up to half of the formulas in average ceding the one under investigation. Thus, we propose ad- depending on the test case. The evaluation of the described ditional strategies in which we ask for the correctness of techniques is part of our ongoing work. individual formulas. Answering such queries can in the best case be done by inspecting only one particular formula. 3.4 User Acceptance Issues 1. We can query the user about the elements of the most Independent of the chosen strategy, user studies have to probable diagnoses as done in [16], e.g., by limiting the be performed to assess which kinds of user input one can re- search depth and by estimating fault probabilities. alistically expect, e.g., for which problem scenarios the user should be able to provide expected (ranges of) values for 2. In case of multiple-fault diagnoses, we can ask the user intermediate cells. In addition, the spreadsheet inspection to inspect those formulas first that appear in the most exercise conducted in [12] indicates that users follow differ- diagnoses. If one cell appears in all diagnoses, it must ent fault localization strategies: some users for example start definitely contain an error. from the inputs whereas others begin at the “result cells”. Any interactive querying strategy should therefore be care- 3. After having queried the user about the correctness of fully designed and assessed with real users. As a part of a one particular formula, we can search for copy-equiva- future work we furthermore plan to develop heuristics to se- lent formulas and ask the user to confirm the correct- lect one of several possible debugging techniques depending ness of these formulas. on the users problem identification strategy. 4. ADDITIONAL CHALLENGES Exquisite project, we have identified a number of open chal- Other open issues in the context of MBD-based debug- lenges in the domain and outlined approaches for interactive ging that we plan to investigate in future work include the spreadsheet debugging. following aspects. Acknowledgements Probability-Based Ranking. This work was supported by the EU through the programme Another approach to discriminate between diagnoses is “Europäischer Fonds für regionale Entwicklung - Investition to try to rank the sometimes numerous candidates in a way in unsere Zukunft” under contract number 300251802. that those considered to be the most probable ones are listed first. Typically, one would for example list diagnoses candi- 6. REFERENCES dates of smaller cardinality first, assuming that single faults [1] R. Abraham and M. Erwig. AutoTest: A Tool for are more probable than double faults. In addition, we can Automatic Test Case Generation in Spreadsheets. In use fault statistics from the literature for different types of Proceedings VL/HCC 2006, pages 43–50, 2006. faults to estimate the probability of each diagnosis. In the spreadsheet domain, we could also rely on indicators like [2] R. Abraham and M. Erwig. GoalDebug: A formula complexity or other spreadsheet smells [10], the lo- Spreadsheet Debugger for End Users. In Proc. ICSE cation of the cell within the spreadsheet’s overall structure, 2007, pages 251–260, 2007. results from Spectrum-Based Fault Localization [11], or the [3] R. Abreu, A. Riboira, and F. Wotawa. number of recent changes made to a formula. User studies Constraint-based Debugging of Spreadsheets. In Proc. in the form of spreadsheet construction exercises as done in CIbSE 2012, pages 1–14, 2012. [6] can help to identify or validate such heuristics. [4] S. Außerlechner, S. Fruhmann, W. Wieser, B. Hofer, R. Spörk, C. Mühlbacher, and F. Wotawa. The Right Problem Encoding and Running Times. Choice Matters! SMT Solving Substantially Improves For larger problem instances, the required running times Model-Based Debugging of Spreadsheets. In Proc. for the diagnosis can exceed what is acceptable for inter- QSIC 2013, pages 139–148, 2013. active debugging. Faster commercial constraint solvers can [5] S. Badame and D. Dig. Refactoring meets Spreadsheet alleviate this problem to some extent. However, also au- Formulas. In Proc. ICSM 2012, pages 399–409, 2012. tomated problem decomposition and dependency analysis [6] P. S. Brown and J. D. Gould. An Experimental Study methods represent a powerful means to be further explored of People Creating Spreadsheets. ACM TOIS, to reduce the search complexity. 5(3):258–272, 1987. Another open issue is that in works relying on a CSP- [7] J. de Kleer and B. C. Williams. Diagnosing Multiple encoding of the spreadsheets, e.g., [3, 11] and our work, Faults. Artificial Intelligence, 32(1):97–130, 1987. the calculations are limited to integers, which is caused by [8] F. Hermans. Improving Spreadsheet Test Practices. In the limited floating-point support of free constraint solvers. Proc. CASCON 2013, pages 56–69, 2013. More research is required in this area, including the incor- [9] F. Hermans, M. Pinzger, and A. van Deursen. poration of alternative reasoning approaches like, e.g., linear Supporting Professional Spreadsheet Users by optimization. Generating Leveled Dataflow Diagrams. In ICSE 2011, pages 451–460, 2011. User Interface Design. [10] F. Hermans, M. Pinzger, and A. van Deursen. Finally, as spreadsheet developers are usually not pro- Detecting Code Smells in Spreadsheet Formulas. In grammers, the user interface (UI) design plays a central Proc. ICSM 2012, pages 409–418, 2012. role and suitable UI metaphors and a corresponding non- [11] B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and technical terminology have to be developed. In Exquisite, E. Getzner. On the Empirical Evaluation of Fault we tried to leave the user as much as possible within the Localization Techniques for Spreadsheets. In Proc. known MS Excel environment. Certain concepts like “test FASE 2013, pages 68–82, 2013. cases” are, however, not present in modern spreadsheet tools [12] D. Jannach and T. Schmitz. Model-based diagnosis of and require some learning effort from the developer. The spreadsheet programs - A constraint-based debugging recent work of [8] indicates that users are willing to spend approach. Autom Softw Eng, to appear, 2014. some extra effort, e.g., in test case specification, to end up [13] D. Jannach, T. Schmitz, B. Hofer, and F. Wotawa. with more fault-free spreadsheets. Avoiding, finding and fixing spreadsheet errors - a How the interaction mechanisms actually should be de- survey of automated approaches for spreadsheet QA. signed to be usable at least by experienced users, is largely Journal of Systems and Software, to appear, 2014. open in our view. In previous spreadsheet testing and de- [14] R. Reiter. A Theory of Diagnosis from First bugging approaches like [1] or [2], for example, additional Principles. Artificial Intelligence, 32(1):57–95, 1987. input was required by the user. In-depth studies about the [15] G. Rothermel, L. Li, C. Dupuis, and M. Burnett. usability of these extensions to standard spreadsheet envi- What You See Is What You Test: A Methodology for ronments are quite rare. Testing Form-Based Visual programs. In Proc. ICSE 1998, pages 198–207, 1998. 5. SUMMARY [16] K. Shchekotykhin, G. Friedrich, P. Fleiss, and In this paper, we have discussed perspectives of constraint P. Rodler. Interactive ontology debugging: Two query and model-based approaches for algorithmic spreadsheet de- strategies for efficient fault localization. Journal of bugging. Based on our insights obtained so far from the Web Semantics, 12-13:788–103, 2012.