Dependence Tracing Techniques for Spreadsheets: An Investigation Sohon Roy Felienne Hermans Delft University of Technology Delft University of Technology S.Roy-1@tudelft.nl F.F.J.Hermans@tudelft.nl ABSTRACT Spreadsheet cells contain data but also may contain formulas that spreadsheets, the visual structure that is perceived from just refer to data from other cells, perform operations on them, and looking at the cells is referred to as spreadsheet surface structure render the results directly to show it to the user. In order to [2] comparable to the anatomical structure of the human body. understand the structure of spreadsheets, one needs to understand However calculations are performed based on formulas and the the formulas that control cell-to-cell dataflow. Understanding this formulas connect the cells to form another kind of structure called cell-to-cell inter-relation or dependence tracing is easier done in the computational/deep structure that is comparable to the visual manners and therefore quite a few techniques have been nervous system of the human body. These two structures are often proposed over the years. This paper aims to report the results of not similar and at times can be radically different. The deep an investigative study of such techniques. The study is a first step structure reflects the data flow in the spreadsheet and is basically of an attempt to evaluate the relevance of these techniques from the cell-to-cell inter-dependence. In the understanding of a the point of view of their benefits and effectiveness in the context spreadsheet, this cell-to-cell inter-dependence plays a key role. of real world spreadsheet users. Results obtained from such a Without having a clear idea of cell-to-cell inter-dependence, the study will have the potential for motivating the conception of modification of a fairly complex spreadsheet becomes impossible newer and better techniques, in case it is found that the need for without ample risks of errors. It is considerably easier to them is still not fully catered. understand for a user if the referred cell(s) in a formula are indicated in an enhanced manner with visualization techniques, instead of having to manually inspect each and every formula and Categories and Subject Descriptors trying to locate the exact cell(s) that it is referring to. Therefore a H.4.1 [Information Systems Applications]: Office Automation – number of visualization techniques have been proposed in various Spreadsheets research papers over the years. However there are some questions about these techniques that still need to be explored and they form General Terms the core of our investigation. They are listed in subsection 1.3. Design, Experimentation, Human Factors 1.2 Motivation Keywords It is our opinion that visualization based dependence tracing End-user computing, Dependence tracing, Spreadsheet techniques, as found in research literature, are not making across visualizations to the industry of spreadsheet users. In a study conducted by Hermans et al. [3] with spreadsheet users working in a large Dutch financial company, it was found that “the most important 1. INTRODUCTION information needs of professional spreadsheet users concern the 1.1 Background structure of the formula dependencies”. This study also mentions Spreadsheets offer the end-users an interface that is incomparable the feeling of inadequacy felt by the users while using the only in its simplicity and flexibility. However it is mostly beneficial for available dependence tracing tool within their reach the Excel performing rapid calculations and quick simple analyses. This Audit toolbar [Fig.1]; a feature of MS (Microsoft) Excel which is interface is not helpful at all in understanding the design logic by far the most popular [1] spreadsheet application in the market. behind a spreadsheet, especially the type of understanding that is This feature demonstrates cell inter-dependencies with an overlaid necessary in order to make modifications to existing spreadsheets. dependency graph over a worksheet, with graph edges shown as Modification becomes harder in the case where it is done by a blue arrows; the edges however are generated on a cell-by-cell user different from the creator. This situation is fairly common in basis which has to be interactively activated by the user. Findings the industry as the average lifespan of spreadsheets have been of another informal survey conducted in October 2013 at the found to be 5 years [3] which can often prove too long for the offices of the UK based financial modeling company F1F91 also possibility that the original creator will be always available point repeatedly at the direction of the sense of inadequacy the whenever some modifications are required. When understanding spreadsheet users are suffering from when depending heavily on this Excel Audit tracing feature. These findings lead us to the question why there are no better tools available to spreadsheet users? Nevertheless, as will be shown in this paper, there is considerable amount of research already done on this topic. This gives rise to the question why implementations of such research are not making it to the industry? Only a handful of highly 1 F1F9: A financial modeling company http://www.f1f9.com/ Figure 1: Tracing dependents with Excel Audit toolbar: blue connecting arrows and coloring of precedent cells customized tools are existing today and that also are mostly used internally by organizations; they are not compared against each 1.4 Approach To ascertain answers to the research questions, as a first step, we other based on any well accepted metrics framework. Their did a critical review of the existing research literature on this efficacy in actually helping in the end-user experience is not specific topic of visualization based dependence tracing measured. Our investigation is therefore dedicated to evaluating techniques for spreadsheets. This paper summarizes in brief the the effectiveness of these proposed techniques in the context of findings of the review and the conclusions drawn from it. It real world spreadsheet users. Such an evaluation might also open essentially presents preliminary results and indicators related to up specific areas in which to improve upon or come up with the research questions. In order to illustrate our findings for this newer techniques that are not only innovative but viable in terms paper, we chose a number of research papers relevant on this topic of practically realizable implementations that can be adopted by and revisited their contents from the following aspects: spreadsheet users in the industry. I. The basic technique/principle/strategy 1.3 Hypothesis and Research Questions II. Characteristic features related to dependents tracing Hypothesis: Proposals thus far described and demonstrated in research literature about visualization based techniques for III. Tools or prototypes developed if any spreadsheet dependence tracing have not adequately made it IV. Comments or details available on testing, performance, across to the industry in forms of reliable, user-friendly, wide- and limitations spread, multi-platform, and standardized software tools of both V. Current status of the research and its implementation, stand-alone and plug-in type. and its perceived relevance or influence in the industrial On the basis of the premise established in Subsection 1.2 and the scene above mentioned hypothesis, we arrive at the following three research questions. 2. THE SELECTED RESEARCH PAPERS Research Questions: 2.1 Fluid Visualization of Spreadsheet R1. Why the proposals thus far described and Structures [4] demonstrated in research literature have not reached the In this paper Igarashi et al. provide the description of a industry as implementations? spreadsheet visualization technique mainly based on An attempt to study what may be the key causes of the perceived superimposition of visual enhancement and animations on top of bottleneck between research and industrial implementations. the regular tabular structure of spreadsheets. The strategy is R2. Is there any well-accepted metrics framework with primarily the use of graphical variation (color, shading, outlining, which such implementations as above (R1) can be compared to etc.), animation, and lightweight interaction that allows the user to each other? directly perceive the spreadsheet dataflow structure, keeping the tabular spreadsheet view unchanged. The transient local view If and when implementations are made available to the industry, it feature is a visual enhancement based on outlining and shading is necessary to measure their usefulness in actually helping the that allows a user to view the dataflow associated with a particular end-user computing experience. If such a framework is not there, cell. There is a static global view that visually enhances the entire then it can be devised and made into an industrial standard. spreadsheet by overlaying the complete dataflow graph of all the R3. Is there any well-defined opportunity for cells. Animated global explanation plays an animation to illustrate improvement in the dependence tracing context? the dataflow of the entire spreadsheet. Visual editing techniques is a graphical manipulation technique that allows the user to directly Improvement not just from the aspect of innovativeness of idea edit the generated dataflow graph in global static view by but also from the angle of how well the idea can be translated into dragging and its effect is then reflected in the spreadsheet a user-friendly and reliable implementation; the efficacy being structure as the textual formulas are updated automatically. A measured against metrics as mentioned in R2. prototype for UNIX was developed using Pad++ and Python. Pad++ was a visualization platform developed and maintained by University of Maryland. A video demonstration of the tool in reveal the deep structure of spreadsheets to the users. A set of visual methods is described followed by strategies on how to best use those visual tools for different purposes of checking. The functional identification feature demarcates cells with different colors according to whether they behave as input, output, processing or standalone and this classification is based on whether a cell is having dependents, precedents, both or none. Multi-precedents and dependents tool, block-precedents tool, and the in-block-precedents-dependents tool are all tools that illustrate various types of inter-cell dependencies with pointed arrow-heads similar to the Excel feature. The difference here being that arrows not only connect individual cells but also have the capability of offering the visual perception that they are connecting a set of Figure 2: Recursive lifting-up operation related cells that are visually grouped together by shading or action is available. It is mentioned that the smoothness of coloring; such group of cells are termed in the paper as cell block. animation is limited to spreadsheets of 400 cells2 or lesser. Three debugging strategies each for global and local context were Performance of the tool radically degrades with increase in size of described to illustrate the use of these tools. The tools were the spreadsheets. There is no information if the efficacy of the implemented using VBA (Visual Basic for Applications) and prototype was tested with real spreadsheet users. No future plan is authors claimed that they can be plugged in to any Excel provided on how this tool can be implemented or scaled up for installation. In spite of claims that the tools increase usability of use in the industry of spreadsheet users. Pad++ and its support has spreadsheets, no details were given about user acceptance or any been long discontinued and the project is closed by UMD. measurement of by how much they increased usability. However, an extension of the idea of “transient local view” as proposed in this paper can be observed in MS Excel version 2007 2.4 Spreadsheet Visualisation3 to Improve onwards. In Excel 2007 the precedent cells of a cell are outlined End User Understanding [1] in different colors. In Excel 2013 the precedent cells are actually In this paper Ballinger et al. provide description of a visualization shaded fully in different colors [Fig.1]. toolkit that could ease understanding of spreadsheets by introducing visual abstraction with types of images that emphasize 2.2 3D Interactive Visualization for Inter-cell on layout and dependency rather than values of cells. In order to Dependencies of Spreadsheets [5] achieve this, their idea was to extract all the information contained In this paper Shiozawa et al. propose a technique of cell in a spreadsheet and utilize that in a more versatile programming dependence visualization in 3D based on an interactive lifting up environment to quickly generate visualizations. They chose Java operation. The technique utilizes the fact that spreadsheets are two for this purpose and since Excel is the most popular spreadsheet dimensional tabular structures and therefore the third dimension application, their toolkit was designed to operate on can be used to depict complementary information like cell inter- dependencies. A spreadsheet is first graphically re-rendered in a 3D space. Next, users are allowed to select a cell and drag it upwards level-wise along the z-axis. The selected cell’s dependent cells are pointed with arrows [Fig.2] and they themselves are also lifted up but kept one level below the selected cell. However in this case the advantage is in the fact that unlike in Excel, arrows connecting dependent cells lying on the same row would never overlap with each other to generate visual ambiguity. The lifting up operation is recursively repeated on the dependent cells as well to generate a leveled tree structure in 3D. This provides the user a clear idea of which cells in the sheet are more important by looking at the levels of dependents lying below them. A prototype Figure 3: (a) Data dependency unit vector map (b) Spring view for UNIX was developed by modifying the spreadsheet program graph structure SLSC. The 3D graphics were implemented with OpenGL APIs. Excel spreadsheets. The toolkit is capable of extracting low level No information regarding the performance of the prototype is structural information and data from spreadsheet files, analyze provided. For an application such as this, making heavy use of that information, and produce visualization. The data dependency computer graphics, it is presumable that performance and scaling flow feature is capable of generating 2D and 3D maps that could be a concern. Unfortunately the paper does not throw any illustrate the general drift of dataflow in a spreadsheet with arrows light on this matter. Neither was given any detail about how of unit magnitude [Fig.3 (a)]. This helps reduce the visual clutter beneficial or acceptable the tool proved for spreadsheet users. which normally occurs with arrows of different lengths due to different distances between cells. The graph structure feature 2.3 Visual Checking of Spreadsheets [2] provides the spring view [Fig.3 (b)] which is a generated graph of In this paper Chen et al. propose a set of strategies aimed at cells stripped of their values. The detailed inspection of formula checking and debugging of spreadsheets using visual methods to feature provides visualizations that are similar to Excel Audit and 2 This is a much smaller number of cells than what is observed in 3 typical real life spreadsheets Paper is in New Zealand English block precedents tool (subsection 2.3) but they are not overlaid on 3. CONCLUSIONS spreadsheets; the images are generated on spreadsheet-like matrix Our study indicates that each of the five research papers proposes structures and the cells are reduced to row-column intersection unique and innovative visualization techniques based on different points, their values wiped out to reduce visual overhead on the strategies. All of them offer rich set of features intended to help user’s understanding. The toolkit was run successfully on a corpus spreadsheet users from different angles. Only two of them have of 259 workbooks. User-studies were not conducted and no prototypes running on UNIX, both of which, to the best of our details were given on whether real users found it convenient beliefs will prove incompatible for current use on any popular enough to understand the various types of images. platform. One has Excel based VBA implementation which supposedly should work as plug-in to any Excel version but is 2.5 Supporting Professional Spreadsheet subject to be tested against version incompatibility. Two of them Users by Generating Leveled Dataflow have full-fledged standalone implementations based on Java and Diagrams [3] C#, both accepting Excel spreadsheets as inputs, but only one of In this paper Hermans et al. propose a spreadsheet visualization them has found practical exposure in the industry. This reinforces technique and the description of an implementation along with the the need to explore our research question “R1. Why the findings of a user study. The work in this paper extends that of proposals thus far described and demonstrated in research previous work by the authors about extraction of class diagrams literature have not reached the industry as implementations?” from spreadsheets. The basic principle depends upon classifying Only one of the research ideas has been properly validated against all cells in a spreadsheet as either of type data, formula, label, or a set of real world professional spreadsheet users. The efficacies empty. Diagrams similar to ER (Entity-Relationship) diagrams are of the rest of the research ideas have only been claimed in writing next created by representing data cells as entities and formula but not demonstrated by user studies. This further reinforces the cells as method (operation) + entity (result). The interconnections need to explore our second research question “R2. Is there any are illustrated as relationships. Next these elements are grouped well-accepted metrics framework with which such together based on the presence of label type cells to form larger implementations as above (R1) can be compared to each entities that represent cell blocks. These are then assembled other?” The above findings also lead us towards the general conclusion that our third research question “R3. Is there any well-defined opportunity for improvement in the dependence tracing context?” is an open question indeed. In that light we therefore judge that a suitable next step would be to do a more exhaustive search of available spreadsheet visualization tools and 1) actually test them on industrially used spreadsheets such as those available in the EUSES corpus and if the tools are found to be performing in a reliable manner then 2) test them on an adequately large and well represented spreadsheet users group to measure usability. 4. REFERENCES Figure 4: Global view (L) and Worksheet view (R) [1] Ballinger, D., Biddle, R., Noble, J. 2003. Spreadsheet Visualisation to Improve End-user Understanding. In proceedings of the Asia- inside entities that represent their respective worksheets. In this Pacific Symposium on Information Visualisation - Volume 24 manner the hierarchical leveled dataflow diagrams are generated. (APVIS 2003), Adelaide, Australia, pp. 99–109. The global view [Fig.4] feature offers the users a high level interactive visualization of the whole workbook showing the [2] Chen, Y., Chan, H. C. 2000. Visual Checking of Spreadsheets. In proceedings of the European Spreadsheet Risks Interest Group 1st dependencies between worksheets. The worksheet view shows the Annual Conference (EuSpRIG 2000), London, United Kingdom. dependencies between blocks in the same sheet and the low level formula view shows in details how individual cells are inter- [3] Hermans, F., Pinzger, M., Deursen, A. van. 2011. Supporting connected via formulas. A tool was developed called GyroSAT Professional Spreadsheet Users by Generating Leveled Dataflow (Gyro Spreadsheet Analysis Toolkit) in C# 4.0. The output Diagrams. In proceedings of the 33rd International Conference on Software Engineering (ICSE 2011), Waikiki, Honolulu, HI, USA, dataflow diagram is produced in DGML (Directed Graph Markup pp. 451–460. Language) which can be viewed and navigated in Microsoft Visual Studio 2010 Ultimate’s built-in DGML browser. This tool [4] Igarashi, T., Mackinlay, J., Chang, B.-W., Zellweger, P. 1998. Fluid was extensively evaluated with a user group consisting of 27 Visualization of Spreadsheet Structures. In proceedings of the IEEE Symposium on Visual Languages (VL 1998), Halifax, NS, Canada, professional spreadsheet users working in a large Dutch financial pp. 118–125. management company. A set of 9 spreadsheets that were used for testing in 9 case studies had number of worksheets ranging from 4 [5] Shiozawa, H., Okada, K., Matsushita, Y. 1999. 3D Interactive to 42, and number of cells ranging from 1048 to 503050. Visualization for Inter-Cell Dependencies of Spreadsheets. In Subsequently this tool and its features have been integrated into proceedings of the IEEE Symposium on Information Visualization (Info Vis 1999), San Francisco, CA, USA, pp. 79–82, 148. the set of services offered by the spreadsheet solutions company Infotron.4 4 Infotron is a spreadsheet solution company offering web based spreadsheet analysis services http://www.infotron.nl/