Enron versus EUSES: A Comparison of Two Spreadsheet Corpora Bas Jansen Delft University of Technology Email: b.jansen@tudelft.nl Abstract—Spreadsheets are widely used within companies and Previously, attempts have been made to compose a dataset often form the basis for business decisions. Numerous cases of spreadsheets that are used within companies, but it is very are known where incorrect information in spreadsheets lead to difficult to convince companies to share their spreadsheets for incorrect decisions. Such cases underline the relevance of research on the professional use of spreadsheets. research purposes. And if they are willing to do so, it is even Recently a new dataset became available for research, contain- more difficult to get permission to make these spreadsheets ing over 15.000 business spreadsheets that were extracted from available for the research community. the Enron E-mail Archive. With this dataset, we 1) aim to obtain Recently Hermans and Murphy-Hill introduced a new set of a thorough understanding of the characteristics of spreadsheets spreadsheets [7], consisting of more than 15,000 spreadsheets used within companies, and 2) compare the characteristics of the that were extracted from the already existing Enron Email Enron spreadsheets with the EUSES corpus which is the existing state of the art set of spreadsheets that is frequently used in archive that was made public during the legal investigation spreadsheet studies. following the Enron bankruptcy [8]. The size of this dataset Our analysis shows that 1) the majority of spreadsheets are is significantly larger than the EUSES corpus and maybe even not large in terms of worksheets and formulas, do not have a high more important: we know that these spreadsheets were used degree of coupling, and their formulas are relatively simple; 2) in industry. This dataset is publicly available2 . the spreadsheets from the EUSES corpus are, with respect to the measured characteristics, quite similar to the Enron spreadsheets. Hermans and Murphy-Hill performed a preliminary analysis of some basic characteristics (like number of worksheets, number of cells, etc.) of these spreadsheets. To obtain a more I. I NTRODUCTION thorough understanding of the characteristics of industry grade spreadsheets, we extend the analysis with additional metrics on Spreadsheets are used widely within companies. It was, for the degree of coupling and the use of functions. Furthermore, example, estimated that 95% of U.S. firms use spreadsheets we compare this dataset extensively with the EUSES corpus for financial reporting [1]. The information in spreadsheets and, by using statistical tests, answer the question of its often forms the basis for significant business decisions [2]. representativeness with respect to spreadsheets that are used However, from previous research we know that spreadsheets by companies. are error-prone [3]. This poses the risk for companies of taking The contributions of this paper are: important decisions on inaccurate information. Eventually, this • A detailed analysis of a large dataset of industrial spread- can lead to incorrect decisions and loss of money. The growing sheets on dimensions of size and coupling list of Horror Stories of the European Spreadsheet Risk Interest • A detailed analysis of how functions are used in spread- Group1 illustrates this and stresses the importance of spread- sheets in a business environment sheet research. It is important to understand why spreadsheet • A comparison of the characteristics of the spreadsheets errors are easily made and how this can be improved. in the Enron corpus with the EUSES corpus. One way to address this, is to study a large set of real world II. R ESEARCH Q UESTIONS spreadsheets. The EUSES corpus, a set of approximately 4,500 The Enron dataset allows us to understand and quantify the spreadsheets from different domains, is used most frequently characteristics of industry grade spreadsheets. In this study we for this purpose. Although the EUSES corpus has proven to focus on the level of coupling and the size of spreadsheets. The be very useful and is used in many research projects [4]– reason for this is that we know from software engineering that [6], there are also some drawbacks. First of all the majority source code tends to be more error-prone when the coupling of these spreadsheets were obtained from the public world- between the different units of a program is high or the program wide-web. We do not know if these spreadsheets are similar consists of large parts [9]. Previous research indicates that this to spreadsheets that are used within companies. Furthermore also applies to spreadsheets [10]. the size of the corpus is relatively small. Finally we do not Modern spreadsheet systems contain over 300 different know anything about the context in which the spreadsheets functions and with these functions users can create very com- were created. plex formulas. In previous work, we introduced the concept 1 www.eusprigorg/horror-stories.htm 2 www.felienne.com/enron of a visual language for spreadsheets [11] for which functions form the building blocks. In this study we therefore analyze, in addition to the metrics on size and coupling, how functions are used. To provide context, we compare the characteristics of spreadsheets that are used in industry with the spreadsheets of the EUSES corpus. It will also allow us to evaluate the representativeness of the EUSES corpus. Hence we aim to answer the following research questions in this paper: Fig. 1. Precedents, Transitive Precedents, and Path Depth • What are the characteristics, in terms of size and cou- pling, of spreadsheets in a business environment? • How are spreadsheet functions used? the level of the worksheet. Two identical formulas on • Is there a difference with respect to these metrics between different worksheets will count as two unique formulas. spreadsheets that are used in a business context and the We determine the unique formulas by looking at the EUSES corpus? relative R1C1 notation of the formula. As indicated by III. A PPROACH Sajaniemi, this notation stays the same even if you copy it down or right [13]. For this paper, we use two datasets: Enron and EUSES. We • Length of formula (s5) From software engineering, we analyzed the spreadsheets in the two datasets with the Spread- know that a large number of lines of code increase the sheet Scantool, developed at Delft University of Technology. chance on errors. In spreadsheets, we could consider the The tool runs on the previously developed Breviz core, that individual formulas to be comparable to lines of code. So, was made for spreadsheet visualization and smell detection if more lines of source code lead to a higher error rate, it [12]. The Scantool collects several metrics on spreadsheet, is reasonable to assume that the longer the formula, the worksheet and cell level. The metrics that were used for this more likely it is that it contains an error. Therefore, we paper can be found in Table I. measure the length of formulas in characters. • Path depth (c6), transitive precedents (c7), and prece- TABLE I OVERVIEW OF USED METRICS dents (f3) Most formulas receive input from other cells. These input cells are the so called precedents. These Dimension Metric precedents themselves could also receive input from other Size s1 # non-empty cells per spreadsheet cells. If you trace along these precedents until you reach s2 # worksheets per spreadsheet a cell without one, you have the number of transitive s3 # formulas per spreadsheet precedents. The path depth is the longest calculation s4 # unique formulas per spreadsheet chain, see also Figure 1. s5 length of formula in characters • Parse tree depth (f4) This is a measure of how nested a Coupling c1 % spreadsheets linked to other spreadsheets formula is. The formula D9 + E9 has a parse tree dept of c2 # external links per spreadsheet 2, the formula (B5 - T5) / (B6 * SQRT(4)) a parse tree c3 % spreadsheets with intraworksheets connections c4 # intraworksheet connections per spreadsheet depth of 5. c5 # passing cells as a percentage of total cells To determine if there is any difference in terms of these c6 path depth per formula metrics between the Enron and EUSES spreadsheets, we have c7 transitive precedents per formula calculated the difference between the distributions of the two Functions f1 usage of built-in functions datasets using a Wilcoxon-Mann-Whitney test and the Cliff’s f2 usage of functions per category Delta d effect size. We use this test to analyze whether there f3 # precedents per formula f4 parse tree depth is a significant difference between the the distribution of the f5 # unique functions per formula metrics of the Enron and EUSES spreadsheets. If a significant difference is found, we use the Cliff’s Delta effect size to Some of these metrics (marked bold in Table I) deserve measure the magnitude of the difference. some further explanation. • Number of unique formulas per spreadsheet (s4) In IV. R ESULTS spreadsheets, it is very common to define a formula in one cell and then copy it down or right to other cells. For In this section we present the results of our analysis of this reason, many of the formula cells in a spreadsheet the Enron and EUSES spreadsheets. First we will discuss the contain the same formula and only the references to metrics for size and coupling. Next we have a closer look other cells differ. Therefore, we also measure the number at the use of functions and we conclude this section with a of unique formulas in the spreadsheet. We do this on comparison between the Enron and the EUSES spreadsheets. A. Size 3) Cells: On the cell level, we have analyzed how formulas are linked to another via cell references. A formula receives Previous research has shown that lengthy source code will values from other cells (precedents / cell fan in), and the result increase the error-rate [9]. Therefore we analyse both datasets of the formula can be used in other formulas (dependents / on the dimension of size. The results of this analysis can be cell fan out). To gain a better understanding of the degree found in Table II. Two measures for the size of a spreadsheet of coupling on the cell level, we look at the path depth are the number of non-empty cells (s1) and the number (longest path of precedents) and the total number of transitive of worksheets (s2)3 We notice that on average the Enron precedents, see Table III. spreadsheets contain more non-empty cells than the EUSES spreadsheets. The number of worksheets per spreadsheet is Before we can calculate the number of transitive precedents, almost identical for both datasets. For context consider a study there is one adjustment that we have to make. A pattern that of Hermans et. al. [10]. They asked financial professionals we witness frequently in spreadsheets is a formula that is just to select one of their large and complex spreadsheets. These passing data (eg. ‘=A1’). Within the Enron set this is true spreadsheets contained tens of thousands of non-empty cells for 35% of the formulas and for EUSES this is 22% (c5). and have on average 10 worksheets. Based on this we can con- Most of these passing cells have a path depth and number of clude that both number of cells and the number of worksheets precedents of one.4 To get a better insight of the ‘real’ links in the Enron and EUSES spreadsheets are relatively small. we have excluded these passing formulas in our analysis of path depth and number of precedents. Formulas process and manipulate the data in the spread- sheet. So another way to look at the size of a spreadsheet is The path depth (c6) of Enron is almost identical to EUSES. to look at 1) the number of formulas per spreadsheet (s3), 2) Both data sets have a median path depth of one, which is the number of unique formulas per spreadsheet (s4), and 3) small. If we look at the total number of cells that influence the length of the formulas in characters (s5). Within the Enron a formula (c7), we observe that the median for Enron and set, formulas are used more frequently than in the EUSES set, EUSES are relatively low and almost the same (respectively but the latter tend to be slightly longer. three and four). Noteworthy is the maximum of more than 22,000 transitive precedents for a single formula in the Enron B. Coupling set. Besides size, a high degree of coupling between the units C. Use of Functions of a program could also lead to more errors. We therefore Modern spreadsheet software contains over 300 different analyze the degree of coupling in spreadsheets. This can be functions that users can combine to create complex formulas. done on several levels within a spreadsheet: Furthermore, the functions form the building blocks for the • Coupling between spreadsheets spreadsheet model. Research on learning APIs has shown that • Coupling between worksheets analyzing usage of source code can reveal interesting patterns • Coupling between cells [14]. Therefore, we have analyzed the use of functions in the The values of the Enron and EUSES spreadsheets for the Enron dataset and compared it with EUSES. different measures of coupling are summarized in Table III. From Hermans and Murphy-Hill, we know that there is 1) Spreadsheets: Starting with coupling between spread- little diversity in the functions that are used in the Enron sheets (c1), we can observe that only a minority of spread- spreadsheets [7]. In this analysis, we observe the same. We sheets links to another spreadsheet (Enron 11%, EUSES 1%). have selected all unique formulas with at least one function. For both sets, the majority of these spreadsheets only link to For this metric we focussed on the use of build-in Excel one or two other spreadsheets. The use of external links (c2) functions (F1). Hence, we excluded operators (+, -, *, /, ˆ) is slightly higher in the Enron set than the EUSES set. and user-defined functions. In the resulting formulas the top 2) Worksheets: On the worksheet level we observe that, 15 functions cover 69% of the spreadsheets. For EUSES, this although there are more spreadsheets with one or more in- is almost the same: 70%. traworksheet connections (c3) than spreadsheets with external Table IV gives an overview of the fifteen most frequently links, it is still a minority (Enron 20%, EUSES 10%). If we used functions in the two datasets. look at the number of connections (c4), we find that the median Although it is true that a small set of functions covers the for Enron and EUSES is respectively four and three, see Table majority of spreadsheets, this set of functions is not the same III. in each dataset. We see that only six functions (marked in It is fair to state that the level of coupling on both the level bold) are present in both datasets. This probably indicates that of spreadsheets and the level of worksheets is low. The vast the set of functions that are used in a spreadsheet depends on majority of spreadsheets contain no links to other spreadsheets either the company, department, user or business domain. or other worksheets within the same spreadsheet. In the previous paragraphs we have looked at the individual functions. However there are several functions that can be 3 Up to Excel 2010 every spreadsheet started by default with 3 workbooks. Because of this there are a lot of spreadsheets with 3 workbooks of which 2 4 The number of precedents could be higher than one if the formula is are empty. For this reason we excluded empty worksheets to calculate s2 passing data from merged cells. TABLE II OVERVIEW OF SIZE METRICS Metric Dataset Min Q1 Median Q3 Max s1 Number of non-empty cells per spreadsheet Enron 7 204.0 701.0 3,672.0 889,952.0 EUSES 14 237.0 573.5 1,354.3 113,134.0 s2 Number of worksheets per spreadsheet Enron 1.0 1.0 1.0 4.0 175.0 EUSES 1.0 1.0 1.0 3.0 106.0 s3 Number of formulas per spreadsheet Enron 1.0 16.0 128.0 758.0 175,568.0 EUSES 1.0 19.0 73.0 253.3 26,434.0 s4 Number of unique formulas per spreadsheet Enron 1.0 4.0 14.0 50.0 6,862.0 EUSES 1.0 3.0 10.0 32.0 961,0 s5 Length of formula in characters Enron 1.0 12.0 18.0 29.0 1,111.0 EUSES 1.0 14.0 20.0 35.0 1,129.0 TABLE III OVERVIEW OF COUPLING METRICS Metric Dataset Min Q1 Median Q3 Max c2 Number of external links per spreadsheet Enron 1.0 1.0 2.0 5.0 155.0 for spreadsheets with at least one external link EUSES 1.0 1.0 1.0 2.0 88.0 c4 Number of intraworksheet connections per spreadsheet Enron 1.0 1.0 4.0 13.0 735.0 for spreadsheets with at least one intraworksheet connection EUSES 1.0 1.0 3.0 7.0 121.0 c6 Path depth per formula Enron 0 1.0 1.0 2.0 1,205.0 EUSES 0 1.0 1.0 2.0 273.0 c7 Total number of transitive precedents per formula Enron 0 1.0 3.0 18.0 22,702.0 EUSES 0 1.0 4.0 15.0 6,536.0 TABLE IV have used the classification that was defined by Microsoft5 . T OP 15 OF MOST FREQUENTLY USED FUNCTIONSS TABLE V Enron EUSES R ELATIVE USE OF FUNCTIONS BY CATEGORY 1 SUM SUM 2 IF IF Category Enron EUSES 3 AVERAGE ROUND Operator 71.4% 58.5% 4 VLOOKUP HYPERLINK Math and trigonometry 16.4% 19.2% 5 ROUND CONCATENATE Logical 3.6% 9.9% 6 SUBTOTAL AND Lookup and reference 2.8% 3.6% 7 OFFSET COUNTIF Statistical 2.5% 4.0% 8 CONCATENATE AVERAGE Date and time 1.6% 0.5% 9 NOW OR Text 0.7% 1.8% 10 DAVERAGE INDIRECT Information 0.5% 1.0% 11 SUMIF MIN Database 0.3% 0.1% 12 INDEX ISNUMBER Financial 0.1% 0.4% 13 MATCH MAX Engineering 0.0% 0.0% 14 LOOKUP VLOOKUP Compatibility 0.0% 0.7% 15 MONTH ISBLANK Udf 0.0% 0.4% User defined add-ins 0.0% 0.0% In Table V, we see a list of these categories and the percentage of formulas that contains a function within this category. It is clear that operators are used the most frequently. used for the same kind of problem. For example, VLOOKUP, If we shift our focus to the pure use of functions we see that HLOOKUP, MATCH, and SEARCH all have the purpose the majority of functions belong to the category Math and of looking up data. To better understand what users try to accomplish with functions, it would be useful to analyze the 5 https://support.office.com/en-AU/Article/Excel-functions-by-category- use of functions on the level of these groups (f2). To do so, we 5f91f4e9-7b42-46d2-9bd1-63f26a86c0eb trigonometry. The functions SUM, ROUND and SUBTOTAL majority of formulas, only one function is used. Formulas with are responsible for 95% of this category. We conclude that more than three different functions are rare. Within the Enron functions are used mostly to perform arithmetic calculations. set only 1.5% of the formulas uses more than three different Another common category is Logical. The most frequently functions. For EUSES this percentages is even smaller (0.4%). used functions within this category are: IF, AND, and OR To summarize our findings with respect to the use of (used in 99,8% of the formulas of the Logical category). functions, we can confirm the finding of Hermans and Murphy- Users use these functions not to calculate, but to reason with Hill that users only use a very small set of functions [7]. The formulas, often with the goal to check or prevent for errors. composition of this set differs between the different dataset, The category Lookup and reference also catches the eye. It indicating that the kind of functions are user-, company- indicates the need of users to lookup or refer to data in their or domain-dependent. Functions are mostly used to perform spreadsheets. The functions VLOOKUP, OFFSET, INDEX, arithmetic calculations, add logic to spreadsheets, and lookup LOOKUP, and MATCH, together, are responsible for 81% of or reference data. We also observed that formulas are not that the formulas in this category. complex if you compare them to formulas found in [10]. Functions within the category Compatibility are those that Microsoft has replaced with new functions, but because of V. E NRON VERSUS EUSES compatibility reasons they are still available. Although the We used the Wilcoxon-Mann-Whitney test to determine if Enron dataset stems from 2000 and 2001, we observe that the characteristics of the Enron spreadsheet differ from the only a very small fraction (<0.1%) of formulas use a function EUSES spreadsheets. The test calculates a p-value. The results that in the meantime has been replaced by Microsoft. This can be found in Table VII. means that the majority of functions used in the Enron set are still valid and unchanged functions in the latest version of TABLE VII Excel. E NRON COMPARED TO EUSES Above we discussed the kind of functions that are used and Metric p-value d the purpose of these functions. But what about complexity? How complex are formulas? A high number of references that s1 number of cells <0.01 0.090 s2 number of worksheets <0.01 0.087 are made within the formula to other cells (preceding cells) s3 number of formulas <0.01 0.136 can make a formula difficult to understand. The same is true s4 number of unique formulas <0.01 0.114 for the degree of nestedness of a formula, that we can measure s5 length of formula >0.05 with the parse tree depth of the formula. The results for both c2 number of external links <0.01 0.243 metrics can be found in Table VI. c4 number of intraworksheet connections <0.01 0.142 The number of preceding cells (f3) is almost identical for c6 path depth >0.05 both Enron and EUSES. The differences are in the extremities. c7 number of transitive precedents <0.05 0.006 The high maxima for the number of precedents in both datasets f3 number of preceding cells >0.05 are caused by references to a large range. For example the f4 parse tree depth >0.05 formula for the cell with the maximum number of preceding f5 number of unique functions >0.05 cells in the Enron set is: =VLOOKUP(B51,[50]jan98!$A$34:$IV$84,3,0) The results of the Wilcoxon-Mann-Whitney test indicate that for some metrics (e.g. number of cells) there is a signif- It is a range of 256 columns (which was the maximum icant difference in the distributions of the Enron and EUSES number of columns within a worksheet in Excel 2000) and 51 datasets. For other metrics (e.g. path depth) they are equal. For rows, which makes a total of 13,056 cells. The lookup value in all metrics with a significant difference the effect (calculated B51 gives us the maximum of 13,057 preceding cells. With a with the Cliff’s Delta d) is negligible (d < 0.147), except for median of two preceding cells for all datasets, it is reasonable number of external links, where the effect is small (0.147 < d to assume that the number of preceding cells is not causing < 0.33). So, although there is a statistical significant difference complexity. in the distribution for some metrics between the Enron and In Table VI we observe that only the maximum values for EUSES dataset, the effect of this difference is negligible or the parse tree depth (f4) differ between Enron and EUSES. small. Based on this we can conclude that the spreadsheets in One could also argue that a formula with a parse tree depth of the EUSES corpus are comparable with the Enron spreadsheets two is not really complex. We were intrigued by the formula with respect to the metrics used in this paper. The full dataset with the maximum parse tree depth of 106 in the Enron and R scripts are available online6 . set. One would expect that such a formula is quite complex. However in this specific example the formula consisted of 105 VI. D ISCUSSION hard coded numbers that were added with the + operand. So In the previous section, we have 1) described the results definitely a large formula, but not complex. of an analysis of the spreadsheets in the Enron dataset with Another factor that could cause complexity is the number of different functions that is used in a formula (f5). In the 6 http://figshare.com/articles/Enron versus EUSES/1298994 TABLE VI OVERVIEW OF FORMULA CHARACTERISTICS Metric Dataset Min Q1 Median Q3 Max f3 Number of preceding cells per formula Enron 0.0 1.0 2.0 4.0 13,057.0 EUSES 0.0 1.0 2.0 4.0 3,698.0 f4 Parse tree depth per formula Enron 1.0 2.0 2.0 3.0 106.0 EUSES 1.0 2.0 2.0 3.0 89.0 f5 Number of unique functions per formula Enron 1.0 1.0 1.0 1.0 8.0 EUSES 1.0 1.0 1.0 1.0 8.0 respect to the dimensions size, coupling and use of functions VII. R ELATED W ORK and 2) compared the Enron spreadsheets with the EUSES Most related to our efforts is of course the work of Hermans spreadsheets. In this section, we discuss some issues that could and Murphy-Hill [7] which initially presented the Enron affect the applicability and suitability of the approach used. spreadsheet corpus. The authors performed a preliminary anal- A. User-defined functions ysis of some basic characteristics (like number of worksheets, cells, and formulas) of these spreadsheets. While Hermans Our analysis focused on the Microsof Excel built-in func- and Murphy-Hill present a first overview of the spreadsheets, tions. We did not analyze the use of user-defined functions, we, in this paper, dive deeper and added additional metrics to because the use of it is rather limited. Only in 0.5% of measure the degree of coupling and gain more insight in the the Enron and 1.2% of the EUSES spreadsheets that contain actual use of functions. Also, for every metric we compared formulas with functions, user-defined functions were used. the Enron spreadsheets with the EUSES corpus. By using the Wilcoxon-Mann-Whitney test and calculating the Cliff’s B. Pivot tables, charts and VBA code delta for the different metrics we answer the question of In this first detailed analysis of the Enron spreadsheets, representativeness of the EUSES corpus. we limited ourselves to metrics for size, coupling and use of Secondly, there is the EUSES corpus, which was introduced functions. More elaborate constructs like Pivot tables, charts by Fisher and Rothermel in 2005 [16]. Besides EUSES, there and VBA code could also have an impact on the complexity of are a few other smaller corpora [17], [18]. Unfortunately, none spreadsheets. In future research, we plan to extend the current of these corpora were publicly available to include in the analysis with these constructs. analyses of this paper. In addition to work on spreadsheet corpora, there are papers C. Path analysis on spreadsheet metrics, which have in common with our work In this paper, we looked at the use of formula and functions that they too measure properties of spreadsheets [19], [20]. on the level of the individual cell. However, formulas receive There is an overlap between the metrics defined in these papers input from precedent cells. These precedent cells, in turn, can and the metrics we use in this analysis (e.g. number of cells, also contain formulas. One could argue that all the formulas number of formulas). We have added metrics to obtain a better in the calculation chain form a small program or method. In understanding of how functions are used within spreadsheets. future work, we will analyze these programs to see if there We have also used the metrics to compare two spreadsheet are certain common patterns that can be found in the majority corpora. of spreadsheets. VIII. C ONCLUDING R EMARKS D. Threats to validity The overall goal of this paper is to understand and quantify 1) Cleaned dataset: Before the Enron dataset was made the characteristics of industry grade spreadsheets and secondly publicly available, it has been cleaned. In total over 10,000 to see if these spreadsheets differ from the EUSES corpus e-mail messages with potential sensitive personal information, that was used frequently in previous research. To do so, we like credit card numbers, identity numbers, personal contact formulated three research questions: details, etc., were deleted from the dataset [15]. We do not believe that this cleaning affected the analysis of the A. What are the characteristics, in terms of size and coupling, spreadsheets. The spreadsheets themselves were most likely of spreadsheets in a business environment? not deemed personal by the researchers cleaning the dataset. • The results of the analysis indicate that the size of 2) Age of dataset: The spreadsheets in the Enron dataset the majority of the Enron spreadsheets is small. On are more than ten years old. However, we believe that the average they consists of 701 non-empty cells, contain way users create spreadsheets has not change much over this three worksheets and have about 128 unique formulas. period. This is supported by the fact that in the meanwhile the • For coupling, we see a similar picture. The majority of user interface of Microsoft Excel has not changed significantly. spreadsheets (89%) is not linked to other spreadsheets and also within the spreadsheet only 20% of them have In this paper we focused on formulas at the cell level. The links between worksheets within the spreadsheet. On the formula in a spreadsheet could be considered as source code. cell level, we see the same. The median path depth However, by looking at a single cell you are actual looking at a (calculation chain) is one and a cell with a formula has single line of source code. In future research, we are planning a median of three transitive precedent cells. to extract and combine all formulas in the same calculation • Based on the results of the analysis, we conclude that chain to obtain a better and completer understanding of the in a business environment large spreadsheets with a high use of functions within spreadsheets. This will allow us to degree of coupling do exist, but are not very common. search for commonly used patterns. B. How are spreadsheet functions used and combined? R EFERENCES • With respect to the use of functions, we see that users use [1] R. R. Panko and N. Ordway, “Sarbanes-oxley: What about all the spreadsheets?” CoRR, vol. abs/0804.0797, 2008. [Online]. Available: only a small subset of the available built-in functions. http://arxiv.org/abs/0804.0797 Surprising is that this subset of functions is not the [2] F. Hermans, M. Pinzger, and A. van Deursen, “Supporting professional same between the different datasets. This could be an spreadsheet users by generating leveled dataflow diagrams,” in Proceed- ings of the 33rd International Conference on Software Engineering. indication that the kind of functions that are used in a ACM, 2011, pp. 451–460. spreadsheet are depending on the individual user or the [3] R. R. Panko, “What we know about spreadsheet errors,” Journal of specific business domain. Organizational and End User Computing (JOEUC), vol. 10, no. 2, pp. 15–21, 1998. • If we look at the level of function categories, we notice [4] R. Abraham and M. Erwig, “Goaldebug: A spreadsheet debugger for end that mainly the general purpose categories like Opera- users,” in Proceedings of the 29th international conference on Software tors, Math and trigonometry, Lookup and reference and Engineering. IEEE Computer Society, 2007, pp. 251–260. [5] F. Hermans, M. Pinzger, and A. v. Deursen, “Detecting and visualizing Logical are used. Users tend to use functions to either inter-worksheet smells in spreadsheets,” in Proceedings of the 2012 calculate something, look up something or a combination International Conference on Software Engineering. IEEE Press, 2012, of the two. Functions from the Logical category are used pp. 441–451. [6] W. Dou, S.-C. Cheung, and J. Wei, “Is spreadsheet ambiguity harmful? to reason with formulas, probably with the goal to check detecting and repairing spreadsheet smells due to ambiguous computa- or prevent for errors. tion,” in Proceedings of the 36th International Conference on Software • Regarding the complexity, the results show us that the Engineering. ACM, 2014, pp. 848–858. [7] F. Hermans and E. Murphy-Hill, “Enron’s spreadsheets and related majority of formulas are simple. They only make a direct emails: A dataset and analysis,” in Proceedings of the 37th International reference to a few other cells (median: two), are hardly Conference on Software Engineering, 2015, to appear. nested (median for the parse tree depth: two) and most [8] B. Klimt and Y. Yang, “Introducing the Enron corpus.” in CEAS, 2004. [9] M. Fowler, Refactoring : improving the design of existing code. Read- of the times only contain one function. ing, MA: Addison-Wesley, 1999. [10] F. Hermans, M. Pinzger, and A. van Deursen, “Detecting and refactoring C. Is there a difference with respect to these metrics between code smells in spreadsheet formulas,” Empirical Software Engineering, spreadsheets that are used in a business context and the pp. 1–27, 2014. [11] B. Jansen and F. Hermans, “Using a visual language to create better EUSES corpus? spreadsheets,” Software Engineering Methods in Spreadsheets, 2014. • With respect to size, coupling and use of functions [12] F. Hermans, “Analyzing and visualizing spreadsheets,” Ph.D. disserta- tion, PhD thesis, Software Engineering Research Group, Delft University the Enron spreadsheets are quite similar to the EUSES of Technology, Netherlands, 2012. corpus. Although we find for most metrics a statistical [13] J. Sajaniemi, “Modeling spreadsheet audit: A rigorous approach to significant difference, the effect size is either negligible automatic visualization,” Journal of Visual Languages & Computing, vol. 11, no. 1, pp. 49–82, 2000. (6 metrics) or small (1 metric). Based on this we can [14] J. E. Montandon, H. Borges, D. Felix, and M. T. Valente, “Documenting conclude that for these characteristics the EUSES corpus APIs with examples: Lessons learned with the apiminer platform,” is representative for spreadsheets that are used in a in Reverse Engineering (WCRE), 2013 20th Working Conference on. IEEE, 2013, pp. 401–408. business context. [15] A. Cassidy and M. Westwood-Hill. Removing pii from the EDRM Enron data set: Investigating the prevalence of unsecured financial, IX. F UTURE WORK health and personally identifiable information in corporate data. [Online]. Available: http://www.nuix.com/images/resources/case study This paper gives rise to several directions for future work. nuix edrm enron data set.pdf The results of our analysis show that users in general create [16] M. Fisher and G. Rothermel, “The EUSES spreadsheet corpus: a shared small spreadsheets, with a relatively low degree of coupling resource for supporting experimentation with spreadsheet dependability mechanisms,” ACM SIGSOFT Software Engineering Notes, vol. 30, and that they use simple formulas. This makes it less likely no. 4, pp. 1–5, 2005. that the complexity of a spreadsheet is causing the high error [17] R. R. Panko, “Two corpuses of spreadsheet errors,” in System Sciences, rate of spreadsheets. It still leaves us with the question, what 2000. Proceedings of the 33rd Annual Hawaii International Conference on. IEEE, 2000, pp. 8–pp. is causing errors in spreadsheets? To answer this question, one [18] S. G. Powell, K. R. Baker, and B. Lawson, “A critical review of the of the directions for future research could be to focus on the literature on spreadsheet errors,” Decision Support Systems, vol. 46, user interface of the spreadsheet software. In previous work no. 1, pp. 128–138, 2008. [19] A. Bregar, “Complexity metrics for spreadsheet models,” in Proc. of [11], we introduced the concept of a visual language to model EuSpRIG ’04, 2004, p. 9. spreadsheets and in future research we are planning to develop [20] K. Hodnigg and R. Mittermeir, “Metrics-based spreadsheet visualization: and validate such a language. Support for focused maintenance,” in Proc. of EuSpRIG ’08, 2008, p. 16.