Tool-supported fault localization in spreadsheets: Limitations of current evaluation practice Birgit Hofer Dietmar Jannach Thomas Schmitz Graz University of Technology TU Dortmund TU Dortmund 8010 Graz, Austria 44221 Dortmund, Germany 44221 Dortmund, Germany bhofer@ist.tugraz.at dietmar.jannach@udo.edu thomas.schmitz@udo.edu Kostyantyn Franz Wotawa Shchekotykhin Graz University of Technology University Klagenfurt, Austria 8010 Graz, Austria kostya@ifit.uni-klu.ac.at wotawa@ist.tugraz.at ABSTRACT bugging techniques actually help the developer as discussed In recent years, researchers have developed a number of tech- in [14] for imperative programs. niques to assist the user in locating a fault within a spread- In this position paper, we discuss some limitations of the sheet. The evaluation of these approaches is often based current research practice in the field and outline potential on spreadsheets into which artificial errors are injected. In ways to improve the research practice in the future. this position paper, we summarize different shortcomings of these forms of evaluations and sketch possible remedies in- 2. LACK OF BENCHMARK PROBLEMS cluding the development of a publicly available spreadsheet To demonstrate the usefulness of a new debugging tech- corpus for benchmarking as well as user and field studies to nique, we need spreadsheets containing faults. Since no assess the true value of the proposed techniques. public set of such spreadsheets exists, researchers often cre- ate their own suite of benchmark problems, e.g., by apply- Categories and Subject Descriptors ing mutation operators to existing correct spreadsheets [2]. Unfortunately, these problems are only rarely made pub- H.4.1 [Information Systems Applications]: Office Au- licly available. This makes a comparative evaluation of ap- tomation—Spreadsheets; D.2.5 [Software Engineering]: proaches difficult and it is often unclear if the proposed tech- Testing and Debugging—Debugging aids nique is applicable to a wider class of spreadsheets. In some papers, spreadsheets from the EUSES corpus1 General Terms are used for evaluations. As no information exists about the intended semantics of these spreadsheets, mutations are ap- Spreadsheets, Debugging, Fault Localization plied in order to obtain faulty versions of the spreadsheets. The spreadsheets in this corpus are however quite diverse, 1. INTRODUCTION e.g., with respect to their size or the types of the used formu- Locating the true causes why a given spreadsheet program las. Often only a subset of the documents is used in the eval- does not compute the expected outcomes can be a tedious uations and the selection of the subset is not justified well. task. Over the last years, researchers have developed a num- Even when the benchmark problems are publicly shared like ber of methods supporting the user in the fault localization the ones used in [10], they may have special characteristics and correction (debugging) process. The techniques range that are advantageous for a certain method and, e.g., con- from the visualization of suspicious cells or regions of the tain only one single fault or use only certain functions or cell spreadsheet, and the application of known practices from data types. software engineering like spectrum-based fault localization A corpus of diverse benchmark problems is strongly needed (SFL) or slicing, to declarative and constraint-based reason- for spreadsheet debugging to make different research ap- ing techniques [1, 3, 6, 7, 9, 11, 12, 16]. proaches better comparable and to be able to identify short- However, there is a number of challenges common to all comings of existing approaches. Such a corpus could be these approaches. Unlike other computer science sub-areas, incrementally built by researchers sharing their real-world such as natural language processing, information retrieval and artificial benchmark problems. In addition, since it is or automated planning and scheduling, no standard bench- not always clear if typical spreadsheet mutation operators marks exist for spreadsheet debugging methods. The ab- truly correspond to mistakes developers make, insights and sence of commonly used benchmarks prevents the direct practices from the Information Systems field should be bet- comparison of spreadsheet debugging approaches. Further- ter integrated into our research. This in particular includes more, fault localization and debugging for spreadsheets re- the use of spreadsheet construction exercises in laboratory quire the design of a user-debugger interface. An important settings that help us identify which kinds of mistakes users question in this context is: what input or interaction can make and what their debugging strategies are, see, e.g., [4]. realistically be expected from the user? Finally, the main 1 http://esquared.unl.edu/wikka.php?wakka= question to be answered is whether or not automated de- EUSESSpreadsheetCorpus 3. USABILITY AND USER ACCEPTANCE ble remedies to these shortcomings we advocate the develop- Spreadsheet debugging research is often based on offline ment of a corpus of benchmark problems and the increased experimental designs, e.g., by measuring how many of the adoption of user studies of various types as an evaluation in- injected faults are successfully located with a given tech- strument. As experimental settings differ from real-life, we nique, see, e.g., [5]. In some cases, plug-ins to spreadsheet additionally propose to use field studies to obtain insights environments are developed like in [1] or [11]. Similar to on how debugging methods are used in companies. plug-ins used for other purposes, e.g., spreadsheet testing, the usability of these plug-ins for end users is seldom in the 6. REFERENCES focus of the research. The proposed plug-ins typically re- [1] R. Abraham and M. Erwig. GoalDebug: A quire various types of input from the user at different stages Spreadsheet Debugger for End Users. In Proc. ICSE of the debugging process. Some of these inputs have to be 2007, pages 251–260, 2007. provided at the beginning of the process and some can be [2] R. Abraham and M. Erwig. Mutation Operators for requested by the debugger during fault localization. Typical Spreadsheets. IEEE Trans. on Softw. Eng., inputs of a debugger include statements about the correct- 35(1):94–108, 2009. ness of values/formulas in individual cells [10], information [3] R. Abreu, A. Riboira, and F. Wotawa. about expected values for certain cells [1, 3], specification of Constraint-based debugging of spreadsheets. In Proc. multiple test cases [11], etc. CibSE’12, pages 1–14, 2012. In many cases, it remains unclear, if an average spread- sheet developer will be willing or able to provide these inputs [4] P. S. Brown and J. D. Gould. An Experimental Study since concepts like test cases do not exist in the spreadsheet of People Creating Spreadsheets. ACM TOIS, paradigm. Therefore, researchers have to ensure that a de- 5(3):258–272, 1987. veloper interprets the requests from the debugger correctly [5] C. Chambers and M. Erwig. Automatic Detection of and provides appropriate inputs as expected by the debug- Dimension Errors in Spreadsheets. J. Vis. Lang. & ger. One additional problem in that context is that user Comp., 20(4):269–283, 2009. inputs, e.g., the test case specifications, are usually consid- [6] J. Cunha, J. a. P. Fernandes, H. Ribeiro, and J. a. ered to be reliable and most existing approaches have no Saraiva. Towards a catalog of spreadsheet smells. In built-in means to deal with errors in the inputs. Proc. ICCSA’12, pages 202–216, 2012. Overall, we argue that offline experimental evaluations [7] F. Hermans, M. Pinzger, and A. van Deursen. should be paired with user studies whenever possible as Supporting Professional Spreadsheet Users by done, e.g., in [8] or [11]. Such studies should help us validate Generating Leveled Dataflow Diagrams. In Proc. ICSE whether our approaches are based on realistic assumptions 2011, pages 451–460, 2011. and are acceptable at least for ambitious users after some [8] F. Hermans, M. Pinzger, and A. van Deursen. training. At the same time, observations of the users’ be- Detecting and Visualizing Inter-Worksheet Smells in havior during debugging can be used to learn about their Spreadsheets. In ICSE 2012, pages 441–451, 2012. problem solving strategies and to evaluate whether the tool [9] F. Hermans, M. Pinzger, and A. van Deursen. actually helped to find a fault. Detecting Code Smells in Spreadsheet Formulas. In Again, insights and practices both from the fields of In- Proc. ICSM 2012, pages 409–418, 2012. formation Systems and Human Computer Interaction should [10] B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and be the basis for these forms of experiments. E. Getzner. On the Empirical Evaluation of Fault Localization Techniques for Spreadsheets. In Proc. 4. FIELD RESEARCH FASE 2013, pages 68–82, 2013. In addition to user studies in laboratory environments, re- [11] D. Jannach and T. Schmitz. Model-based diagnosis of search on real spreadsheets as suggested in [15] is required spreadsheet programs - A constraint-based debugging to determine potential differences between the experimental approach. Autom. Softw. Eng., to appear, 2014. usage of the proposed debugging methods and the everyday [12] D. Jannach, T. Schmitz, B. Hofer, and F. Wotawa. use of such tools in companies or institutes. Error rates and Avoiding, finding and fixing spreadsheet errors - a types found in practice could differ from what is observed in survey of automated approaches for spreadsheet QA. user studies whose participants in many cases are students. Journal of Systems and Software, to appear, 2014. In [13], e.g., a construction exercise with business managers [13] F. Karlsson. Using two heads in practice. In Proc. was done to determine error rates. In addition, the user WEUSE 2008, pages 43–47, 2008. acceptance of fault localization tools could vary strongly be- [14] C. Parnin and A. Orso. Are Automated Debugging cause of different expectations of professional users with re- Techniques Actually Helping Programmers? In Proc. spect to the utilized tools. To ensure the usability for real ISSTA 2011, pages 199–209, 2011. users, existing spreadsheets can be examined and question- [15] S. G. Powell, K. R. Baker, and B. Lawson. A critical naires with users can be made, as done, e.g., in [7]. review of the literature on spreadsheet errors. Decision Support Systems, 46(1):128–138, 2008. 5. CONCLUSIONS [16] J. Reichwein, G. Rothermel, and M. Burnett. Slicing Spreadsheets: An Integrated Methodology for A number of proposals have been made in the recent liter- Spreadsheet Testing and Debugging. In Proc. DSL ature to assist the user in the process of locating faults in a 1999, pages 25–38, 1999. given spreadsheet. In this position paper, we have identified some limitations of current research practice regarding the comparability and reproducibility of the results. As possi-