Tool-supported fault localization in spreadsheets:
              Limitations of current evaluation practice

                   Birgit Hofer                      Dietmar Jannach                    Thomas Schmitz
          Graz University of Technology                TU Dortmund                        TU Dortmund
               8010 Graz, Austria                44221 Dortmund, Germany            44221 Dortmund, Germany
             bhofer@ist.tugraz.at          dietmar.jannach@udo.edu thomas.schmitz@udo.edu
                                     Kostyantyn                Franz Wotawa
                                    Shchekotykhin        Graz University of Technology
                             University Klagenfurt, Austria           8010 Graz, Austria
                             kostya@ifit.uni-klu.ac.at            wotawa@ist.tugraz.at

ABSTRACT                                                         bugging techniques actually help the developer as discussed
In recent years, researchers have developed a number of tech-    in [14] for imperative programs.
niques to assist the user in locating a fault within a spread-     In this position paper, we discuss some limitations of the
sheet. The evaluation of these approaches is often based         current research practice in the field and outline potential
on spreadsheets into which artificial errors are injected. In    ways to improve the research practice in the future.
this position paper, we summarize different shortcomings of
these forms of evaluations and sketch possible remedies in-      2.   LACK OF BENCHMARK PROBLEMS
cluding the development of a publicly available spreadsheet         To demonstrate the usefulness of a new debugging tech-
corpus for benchmarking as well as user and field studies to     nique, we need spreadsheets containing faults. Since no
assess the true value of the proposed techniques.                public set of such spreadsheets exists, researchers often cre-
                                                                 ate their own suite of benchmark problems, e.g., by apply-
Categories and Subject Descriptors                               ing mutation operators to existing correct spreadsheets [2].
                                                                 Unfortunately, these problems are only rarely made pub-
H.4.1 [Information Systems Applications]: Office Au-             licly available. This makes a comparative evaluation of ap-
tomation—Spreadsheets; D.2.5 [Software Engineering]:             proaches difficult and it is often unclear if the proposed tech-
Testing and Debugging—Debugging aids                             nique is applicable to a wider class of spreadsheets.
                                                                    In some papers, spreadsheets from the EUSES corpus1
General Terms                                                    are used for evaluations. As no information exists about the
                                                                 intended semantics of these spreadsheets, mutations are ap-
Spreadsheets, Debugging, Fault Localization                      plied in order to obtain faulty versions of the spreadsheets.
                                                                 The spreadsheets in this corpus are however quite diverse,
1.   INTRODUCTION                                                e.g., with respect to their size or the types of the used formu-
   Locating the true causes why a given spreadsheet program      las. Often only a subset of the documents is used in the eval-
does not compute the expected outcomes can be a tedious          uations and the selection of the subset is not justified well.
task. Over the last years, researchers have developed a num-     Even when the benchmark problems are publicly shared like
ber of methods supporting the user in the fault localization     the ones used in [10], they may have special characteristics
and correction (debugging) process. The techniques range         that are advantageous for a certain method and, e.g., con-
from the visualization of suspicious cells or regions of the     tain only one single fault or use only certain functions or cell
spreadsheet, and the application of known practices from         data types.
software engineering like spectrum-based fault localization         A corpus of diverse benchmark problems is strongly needed
(SFL) or slicing, to declarative and constraint-based reason-    for spreadsheet debugging to make different research ap-
ing techniques [1, 3, 6, 7, 9, 11, 12, 16].                      proaches better comparable and to be able to identify short-
   However, there is a number of challenges common to all        comings of existing approaches. Such a corpus could be
these approaches. Unlike other computer science sub-areas,       incrementally built by researchers sharing their real-world
such as natural language processing, information retrieval       and artificial benchmark problems. In addition, since it is
or automated planning and scheduling, no standard bench-         not always clear if typical spreadsheet mutation operators
marks exist for spreadsheet debugging methods. The ab-           truly correspond to mistakes developers make, insights and
sence of commonly used benchmarks prevents the direct            practices from the Information Systems field should be bet-
comparison of spreadsheet debugging approaches. Further-         ter integrated into our research. This in particular includes
more, fault localization and debugging for spreadsheets re-      the use of spreadsheet construction exercises in laboratory
quire the design of a user-debugger interface. An important      settings that help us identify which kinds of mistakes users
question in this context is: what input or interaction can       make and what their debugging strategies are, see, e.g., [4].
realistically be expected from the user? Finally, the main       1
                                                                   http://esquared.unl.edu/wikka.php?wakka=
question to be answered is whether or not automated de-          EUSESSpreadsheetCorpus
3.   USABILITY AND USER ACCEPTANCE                                 ble remedies to these shortcomings we advocate the develop-
   Spreadsheet debugging research is often based on offline        ment of a corpus of benchmark problems and the increased
experimental designs, e.g., by measuring how many of the           adoption of user studies of various types as an evaluation in-
injected faults are successfully located with a given tech-        strument. As experimental settings differ from real-life, we
nique, see, e.g., [5]. In some cases, plug-ins to spreadsheet      additionally propose to use field studies to obtain insights
environments are developed like in [1] or [11]. Similar to         on how debugging methods are used in companies.
plug-ins used for other purposes, e.g., spreadsheet testing,
the usability of these plug-ins for end users is seldom in the     6.   REFERENCES
focus of the research. The proposed plug-ins typically re-          [1] R. Abraham and M. Erwig. GoalDebug: A
quire various types of input from the user at different stages          Spreadsheet Debugger for End Users. In Proc. ICSE
of the debugging process. Some of these inputs have to be               2007, pages 251–260, 2007.
provided at the beginning of the process and some can be            [2] R. Abraham and M. Erwig. Mutation Operators for
requested by the debugger during fault localization. Typical            Spreadsheets. IEEE Trans. on Softw. Eng.,
inputs of a debugger include statements about the correct-              35(1):94–108, 2009.
ness of values/formulas in individual cells [10], information
                                                                    [3] R. Abreu, A. Riboira, and F. Wotawa.
about expected values for certain cells [1, 3], specification of
                                                                        Constraint-based debugging of spreadsheets. In Proc.
multiple test cases [11], etc.
                                                                        CibSE’12, pages 1–14, 2012.
   In many cases, it remains unclear, if an average spread-
sheet developer will be willing or able to provide these inputs     [4] P. S. Brown and J. D. Gould. An Experimental Study
since concepts like test cases do not exist in the spreadsheet          of People Creating Spreadsheets. ACM TOIS,
paradigm. Therefore, researchers have to ensure that a de-              5(3):258–272, 1987.
veloper interprets the requests from the debugger correctly         [5] C. Chambers and M. Erwig. Automatic Detection of
and provides appropriate inputs as expected by the debug-               Dimension Errors in Spreadsheets. J. Vis. Lang. &
ger. One additional problem in that context is that user                Comp., 20(4):269–283, 2009.
inputs, e.g., the test case specifications, are usually consid-     [6] J. Cunha, J. a. P. Fernandes, H. Ribeiro, and J. a.
ered to be reliable and most existing approaches have no                Saraiva. Towards a catalog of spreadsheet smells. In
built-in means to deal with errors in the inputs.                       Proc. ICCSA’12, pages 202–216, 2012.
   Overall, we argue that offline experimental evaluations          [7] F. Hermans, M. Pinzger, and A. van Deursen.
should be paired with user studies whenever possible as                 Supporting Professional Spreadsheet Users by
done, e.g., in [8] or [11]. Such studies should help us validate        Generating Leveled Dataflow Diagrams. In Proc. ICSE
whether our approaches are based on realistic assumptions               2011, pages 451–460, 2011.
and are acceptable at least for ambitious users after some          [8] F. Hermans, M. Pinzger, and A. van Deursen.
training. At the same time, observations of the users’ be-              Detecting and Visualizing Inter-Worksheet Smells in
havior during debugging can be used to learn about their                Spreadsheets. In ICSE 2012, pages 441–451, 2012.
problem solving strategies and to evaluate whether the tool         [9] F. Hermans, M. Pinzger, and A. van Deursen.
actually helped to find a fault.                                        Detecting Code Smells in Spreadsheet Formulas. In
   Again, insights and practices both from the fields of In-            Proc. ICSM 2012, pages 409–418, 2012.
formation Systems and Human Computer Interaction should            [10] B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and
be the basis for these forms of experiments.                            E. Getzner. On the Empirical Evaluation of Fault
                                                                        Localization Techniques for Spreadsheets. In Proc.
4.   FIELD RESEARCH                                                     FASE 2013, pages 68–82, 2013.
  In addition to user studies in laboratory environments, re-      [11] D. Jannach and T. Schmitz. Model-based diagnosis of
search on real spreadsheets as suggested in [15] is required            spreadsheet programs - A constraint-based debugging
to determine potential differences between the experimental             approach. Autom. Softw. Eng., to appear, 2014.
usage of the proposed debugging methods and the everyday           [12] D. Jannach, T. Schmitz, B. Hofer, and F. Wotawa.
use of such tools in companies or institutes. Error rates and           Avoiding, finding and fixing spreadsheet errors - a
types found in practice could differ from what is observed in           survey of automated approaches for spreadsheet QA.
user studies whose participants in many cases are students.             Journal of Systems and Software, to appear, 2014.
In [13], e.g., a construction exercise with business managers      [13] F. Karlsson. Using two heads in practice. In Proc.
was done to determine error rates. In addition, the user                WEUSE 2008, pages 43–47, 2008.
acceptance of fault localization tools could vary strongly be-     [14] C. Parnin and A. Orso. Are Automated Debugging
cause of different expectations of professional users with re-          Techniques Actually Helping Programmers? In Proc.
spect to the utilized tools. To ensure the usability for real           ISSTA 2011, pages 199–209, 2011.
users, existing spreadsheets can be examined and question-         [15] S. G. Powell, K. R. Baker, and B. Lawson. A critical
naires with users can be made, as done, e.g., in [7].                   review of the literature on spreadsheet errors. Decision
                                                                        Support Systems, 46(1):128–138, 2008.
5.   CONCLUSIONS                                                   [16] J. Reichwein, G. Rothermel, and M. Burnett. Slicing
                                                                        Spreadsheets: An Integrated Methodology for
  A number of proposals have been made in the recent liter-
                                                                        Spreadsheet Testing and Debugging. In Proc. DSL
ature to assist the user in the process of locating faults in a
                                                                        1999, pages 25–38, 1999.
given spreadsheet. In this position paper, we have identified
some limitations of current research practice regarding the
comparability and reproducibility of the results. As possi-