=Paper=
{{Paper
|id=Vol-1355/paper11
|storemode=property
|title=Enron versus EUSES: A Comparison of Two Spreadsheet Corpora
|pdfUrl=https://ceur-ws.org/Vol-1355/paper15.pdf
|volume=Vol-1355
|dblpUrl=https://dblp.org/rec/conf/icse/Jansen15
}}
==Enron versus EUSES: A Comparison of Two Spreadsheet Corpora==
Enron versus EUSES:
A Comparison of Two Spreadsheet Corpora
Bas Jansen
Delft University of Technology
Email: b.jansen@tudelft.nl
Abstract—Spreadsheets are widely used within companies and Previously, attempts have been made to compose a dataset
often form the basis for business decisions. Numerous cases of spreadsheets that are used within companies, but it is very
are known where incorrect information in spreadsheets lead to difficult to convince companies to share their spreadsheets for
incorrect decisions. Such cases underline the relevance of research
on the professional use of spreadsheets. research purposes. And if they are willing to do so, it is even
Recently a new dataset became available for research, contain- more difficult to get permission to make these spreadsheets
ing over 15.000 business spreadsheets that were extracted from available for the research community.
the Enron E-mail Archive. With this dataset, we 1) aim to obtain Recently Hermans and Murphy-Hill introduced a new set of
a thorough understanding of the characteristics of spreadsheets spreadsheets [7], consisting of more than 15,000 spreadsheets
used within companies, and 2) compare the characteristics of the
that were extracted from the already existing Enron Email
Enron spreadsheets with the EUSES corpus which is the existing
state of the art set of spreadsheets that is frequently used in archive that was made public during the legal investigation
spreadsheet studies. following the Enron bankruptcy [8]. The size of this dataset
Our analysis shows that 1) the majority of spreadsheets are is significantly larger than the EUSES corpus and maybe even
not large in terms of worksheets and formulas, do not have a high more important: we know that these spreadsheets were used
degree of coupling, and their formulas are relatively simple; 2) in industry. This dataset is publicly available2 .
the spreadsheets from the EUSES corpus are, with respect to the
measured characteristics, quite similar to the Enron spreadsheets.
Hermans and Murphy-Hill performed a preliminary analysis
of some basic characteristics (like number of worksheets,
number of cells, etc.) of these spreadsheets. To obtain a more
I. I NTRODUCTION thorough understanding of the characteristics of industry grade
spreadsheets, we extend the analysis with additional metrics on
Spreadsheets are used widely within companies. It was, for the degree of coupling and the use of functions. Furthermore,
example, estimated that 95% of U.S. firms use spreadsheets we compare this dataset extensively with the EUSES corpus
for financial reporting [1]. The information in spreadsheets and, by using statistical tests, answer the question of its
often forms the basis for significant business decisions [2]. representativeness with respect to spreadsheets that are used
However, from previous research we know that spreadsheets by companies.
are error-prone [3]. This poses the risk for companies of taking The contributions of this paper are:
important decisions on inaccurate information. Eventually, this • A detailed analysis of a large dataset of industrial spread-
can lead to incorrect decisions and loss of money. The growing sheets on dimensions of size and coupling
list of Horror Stories of the European Spreadsheet Risk Interest • A detailed analysis of how functions are used in spread-
Group1 illustrates this and stresses the importance of spread- sheets in a business environment
sheet research. It is important to understand why spreadsheet • A comparison of the characteristics of the spreadsheets
errors are easily made and how this can be improved. in the Enron corpus with the EUSES corpus.
One way to address this, is to study a large set of real world II. R ESEARCH Q UESTIONS
spreadsheets. The EUSES corpus, a set of approximately 4,500
The Enron dataset allows us to understand and quantify the
spreadsheets from different domains, is used most frequently
characteristics of industry grade spreadsheets. In this study we
for this purpose. Although the EUSES corpus has proven to
focus on the level of coupling and the size of spreadsheets. The
be very useful and is used in many research projects [4]–
reason for this is that we know from software engineering that
[6], there are also some drawbacks. First of all the majority
source code tends to be more error-prone when the coupling
of these spreadsheets were obtained from the public world-
between the different units of a program is high or the program
wide-web. We do not know if these spreadsheets are similar
consists of large parts [9]. Previous research indicates that this
to spreadsheets that are used within companies. Furthermore
also applies to spreadsheets [10].
the size of the corpus is relatively small. Finally we do not Modern spreadsheet systems contain over 300 different
know anything about the context in which the spreadsheets functions and with these functions users can create very com-
were created. plex formulas. In previous work, we introduced the concept
1 www.eusprigorg/horror-stories.htm 2 www.felienne.com/enron
of a visual language for spreadsheets [11] for which functions
form the building blocks. In this study we therefore analyze,
in addition to the metrics on size and coupling, how functions
are used.
To provide context, we compare the characteristics of
spreadsheets that are used in industry with the spreadsheets
of the EUSES corpus. It will also allow us to evaluate the
representativeness of the EUSES corpus.
Hence we aim to answer the following research questions
in this paper: Fig. 1. Precedents, Transitive Precedents, and Path Depth
• What are the characteristics, in terms of size and cou-
pling, of spreadsheets in a business environment?
• How are spreadsheet functions used? the level of the worksheet. Two identical formulas on
• Is there a difference with respect to these metrics between different worksheets will count as two unique formulas.
spreadsheets that are used in a business context and the We determine the unique formulas by looking at the
EUSES corpus? relative R1C1 notation of the formula. As indicated by
III. A PPROACH Sajaniemi, this notation stays the same even if you copy
it down or right [13].
For this paper, we use two datasets: Enron and EUSES. We
• Length of formula (s5) From software engineering, we
analyzed the spreadsheets in the two datasets with the Spread-
know that a large number of lines of code increase the
sheet Scantool, developed at Delft University of Technology.
chance on errors. In spreadsheets, we could consider the
The tool runs on the previously developed Breviz core, that
individual formulas to be comparable to lines of code. So,
was made for spreadsheet visualization and smell detection
if more lines of source code lead to a higher error rate, it
[12]. The Scantool collects several metrics on spreadsheet,
is reasonable to assume that the longer the formula, the
worksheet and cell level. The metrics that were used for this
more likely it is that it contains an error. Therefore, we
paper can be found in Table I.
measure the length of formulas in characters.
• Path depth (c6), transitive precedents (c7), and prece-
TABLE I
OVERVIEW OF USED METRICS dents (f3) Most formulas receive input from other cells.
These input cells are the so called precedents. These
Dimension Metric precedents themselves could also receive input from other
Size s1 # non-empty cells per spreadsheet cells. If you trace along these precedents until you reach
s2 # worksheets per spreadsheet a cell without one, you have the number of transitive
s3 # formulas per spreadsheet precedents. The path depth is the longest calculation
s4 # unique formulas per spreadsheet chain, see also Figure 1.
s5 length of formula in characters
• Parse tree depth (f4) This is a measure of how nested a
Coupling c1 % spreadsheets linked to other spreadsheets formula is. The formula D9 + E9 has a parse tree dept of
c2 # external links per spreadsheet 2, the formula (B5 - T5) / (B6 * SQRT(4)) a parse tree
c3 % spreadsheets with intraworksheets connections
c4 # intraworksheet connections per spreadsheet depth of 5.
c5 # passing cells as a percentage of total cells To determine if there is any difference in terms of these
c6 path depth per formula metrics between the Enron and EUSES spreadsheets, we have
c7 transitive precedents per formula
calculated the difference between the distributions of the two
Functions f1 usage of built-in functions datasets using a Wilcoxon-Mann-Whitney test and the Cliff’s
f2 usage of functions per category
Delta d effect size. We use this test to analyze whether there
f3 # precedents per formula
f4 parse tree depth is a significant difference between the the distribution of the
f5 # unique functions per formula metrics of the Enron and EUSES spreadsheets. If a significant
difference is found, we use the Cliff’s Delta effect size to
Some of these metrics (marked bold in Table I) deserve measure the magnitude of the difference.
some further explanation.
• Number of unique formulas per spreadsheet (s4) In IV. R ESULTS
spreadsheets, it is very common to define a formula in
one cell and then copy it down or right to other cells. For In this section we present the results of our analysis of
this reason, many of the formula cells in a spreadsheet the Enron and EUSES spreadsheets. First we will discuss the
contain the same formula and only the references to metrics for size and coupling. Next we have a closer look
other cells differ. Therefore, we also measure the number at the use of functions and we conclude this section with a
of unique formulas in the spreadsheet. We do this on comparison between the Enron and the EUSES spreadsheets.
A. Size 3) Cells: On the cell level, we have analyzed how formulas
are linked to another via cell references. A formula receives
Previous research has shown that lengthy source code will
values from other cells (precedents / cell fan in), and the result
increase the error-rate [9]. Therefore we analyse both datasets
of the formula can be used in other formulas (dependents /
on the dimension of size. The results of this analysis can be
cell fan out). To gain a better understanding of the degree
found in Table II. Two measures for the size of a spreadsheet
of coupling on the cell level, we look at the path depth
are the number of non-empty cells (s1) and the number
(longest path of precedents) and the total number of transitive
of worksheets (s2)3 We notice that on average the Enron
precedents, see Table III.
spreadsheets contain more non-empty cells than the EUSES
spreadsheets. The number of worksheets per spreadsheet is Before we can calculate the number of transitive precedents,
almost identical for both datasets. For context consider a study there is one adjustment that we have to make. A pattern that
of Hermans et. al. [10]. They asked financial professionals we witness frequently in spreadsheets is a formula that is just
to select one of their large and complex spreadsheets. These passing data (eg. ‘=A1’). Within the Enron set this is true
spreadsheets contained tens of thousands of non-empty cells for 35% of the formulas and for EUSES this is 22% (c5).
and have on average 10 worksheets. Based on this we can con- Most of these passing cells have a path depth and number of
clude that both number of cells and the number of worksheets precedents of one.4 To get a better insight of the ‘real’ links
in the Enron and EUSES spreadsheets are relatively small. we have excluded these passing formulas in our analysis of
path depth and number of precedents.
Formulas process and manipulate the data in the spread-
sheet. So another way to look at the size of a spreadsheet is The path depth (c6) of Enron is almost identical to EUSES.
to look at 1) the number of formulas per spreadsheet (s3), 2) Both data sets have a median path depth of one, which is
the number of unique formulas per spreadsheet (s4), and 3) small. If we look at the total number of cells that influence
the length of the formulas in characters (s5). Within the Enron a formula (c7), we observe that the median for Enron and
set, formulas are used more frequently than in the EUSES set, EUSES are relatively low and almost the same (respectively
but the latter tend to be slightly longer. three and four). Noteworthy is the maximum of more than
22,000 transitive precedents for a single formula in the Enron
B. Coupling set.
Besides size, a high degree of coupling between the units C. Use of Functions
of a program could also lead to more errors. We therefore
Modern spreadsheet software contains over 300 different
analyze the degree of coupling in spreadsheets. This can be
functions that users can combine to create complex formulas.
done on several levels within a spreadsheet:
Furthermore, the functions form the building blocks for the
• Coupling between spreadsheets spreadsheet model. Research on learning APIs has shown that
• Coupling between worksheets analyzing usage of source code can reveal interesting patterns
• Coupling between cells [14]. Therefore, we have analyzed the use of functions in the
The values of the Enron and EUSES spreadsheets for the Enron dataset and compared it with EUSES.
different measures of coupling are summarized in Table III. From Hermans and Murphy-Hill, we know that there is
1) Spreadsheets: Starting with coupling between spread- little diversity in the functions that are used in the Enron
sheets (c1), we can observe that only a minority of spread- spreadsheets [7]. In this analysis, we observe the same. We
sheets links to another spreadsheet (Enron 11%, EUSES 1%). have selected all unique formulas with at least one function.
For both sets, the majority of these spreadsheets only link to For this metric we focussed on the use of build-in Excel
one or two other spreadsheets. The use of external links (c2) functions (F1). Hence, we excluded operators (+, -, *, /, ˆ)
is slightly higher in the Enron set than the EUSES set. and user-defined functions. In the resulting formulas the top
2) Worksheets: On the worksheet level we observe that, 15 functions cover 69% of the spreadsheets. For EUSES, this
although there are more spreadsheets with one or more in- is almost the same: 70%.
traworksheet connections (c3) than spreadsheets with external Table IV gives an overview of the fifteen most frequently
links, it is still a minority (Enron 20%, EUSES 10%). If we used functions in the two datasets.
look at the number of connections (c4), we find that the median Although it is true that a small set of functions covers the
for Enron and EUSES is respectively four and three, see Table majority of spreadsheets, this set of functions is not the same
III. in each dataset. We see that only six functions (marked in
It is fair to state that the level of coupling on both the level bold) are present in both datasets. This probably indicates that
of spreadsheets and the level of worksheets is low. The vast the set of functions that are used in a spreadsheet depends on
majority of spreadsheets contain no links to other spreadsheets either the company, department, user or business domain.
or other worksheets within the same spreadsheet. In the previous paragraphs we have looked at the individual
functions. However there are several functions that can be
3 Up to Excel 2010 every spreadsheet started by default with 3 workbooks.
Because of this there are a lot of spreadsheets with 3 workbooks of which 2 4 The number of precedents could be higher than one if the formula is
are empty. For this reason we excluded empty worksheets to calculate s2 passing data from merged cells.
TABLE II
OVERVIEW OF SIZE METRICS
Metric Dataset Min Q1 Median Q3 Max
s1 Number of non-empty cells per spreadsheet Enron 7 204.0 701.0 3,672.0 889,952.0
EUSES 14 237.0 573.5 1,354.3 113,134.0
s2 Number of worksheets per spreadsheet Enron 1.0 1.0 1.0 4.0 175.0
EUSES 1.0 1.0 1.0 3.0 106.0
s3 Number of formulas per spreadsheet Enron 1.0 16.0 128.0 758.0 175,568.0
EUSES 1.0 19.0 73.0 253.3 26,434.0
s4 Number of unique formulas per spreadsheet Enron 1.0 4.0 14.0 50.0 6,862.0
EUSES 1.0 3.0 10.0 32.0 961,0
s5 Length of formula in characters Enron 1.0 12.0 18.0 29.0 1,111.0
EUSES 1.0 14.0 20.0 35.0 1,129.0
TABLE III
OVERVIEW OF COUPLING METRICS
Metric Dataset Min Q1 Median Q3 Max
c2 Number of external links per spreadsheet Enron 1.0 1.0 2.0 5.0 155.0
for spreadsheets with at least one external link EUSES 1.0 1.0 1.0 2.0 88.0
c4 Number of intraworksheet connections per spreadsheet Enron 1.0 1.0 4.0 13.0 735.0
for spreadsheets with at least one intraworksheet connection EUSES 1.0 1.0 3.0 7.0 121.0
c6 Path depth per formula Enron 0 1.0 1.0 2.0 1,205.0
EUSES 0 1.0 1.0 2.0 273.0
c7 Total number of transitive precedents per formula Enron 0 1.0 3.0 18.0 22,702.0
EUSES 0 1.0 4.0 15.0 6,536.0
TABLE IV have used the classification that was defined by Microsoft5 .
T OP 15 OF MOST FREQUENTLY USED FUNCTIONSS
TABLE V
Enron EUSES R ELATIVE USE OF FUNCTIONS BY CATEGORY
1 SUM SUM
2 IF IF Category Enron EUSES
3 AVERAGE ROUND
Operator 71.4% 58.5%
4 VLOOKUP HYPERLINK
Math and trigonometry 16.4% 19.2%
5 ROUND CONCATENATE
Logical 3.6% 9.9%
6 SUBTOTAL AND
Lookup and reference 2.8% 3.6%
7 OFFSET COUNTIF
Statistical 2.5% 4.0%
8 CONCATENATE AVERAGE
Date and time 1.6% 0.5%
9 NOW OR
Text 0.7% 1.8%
10 DAVERAGE INDIRECT
Information 0.5% 1.0%
11 SUMIF MIN
Database 0.3% 0.1%
12 INDEX ISNUMBER
Financial 0.1% 0.4%
13 MATCH MAX
Engineering 0.0% 0.0%
14 LOOKUP VLOOKUP
Compatibility 0.0% 0.7%
15 MONTH ISBLANK
Udf 0.0% 0.4%
User defined add-ins 0.0% 0.0%
In Table V, we see a list of these categories and the
percentage of formulas that contains a function within this
category. It is clear that operators are used the most frequently.
used for the same kind of problem. For example, VLOOKUP, If we shift our focus to the pure use of functions we see that
HLOOKUP, MATCH, and SEARCH all have the purpose the majority of functions belong to the category Math and
of looking up data. To better understand what users try to
accomplish with functions, it would be useful to analyze the 5 https://support.office.com/en-AU/Article/Excel-functions-by-category-
use of functions on the level of these groups (f2). To do so, we 5f91f4e9-7b42-46d2-9bd1-63f26a86c0eb
trigonometry. The functions SUM, ROUND and SUBTOTAL majority of formulas, only one function is used. Formulas with
are responsible for 95% of this category. We conclude that more than three different functions are rare. Within the Enron
functions are used mostly to perform arithmetic calculations. set only 1.5% of the formulas uses more than three different
Another common category is Logical. The most frequently functions. For EUSES this percentages is even smaller (0.4%).
used functions within this category are: IF, AND, and OR To summarize our findings with respect to the use of
(used in 99,8% of the formulas of the Logical category). functions, we can confirm the finding of Hermans and Murphy-
Users use these functions not to calculate, but to reason with Hill that users only use a very small set of functions [7]. The
formulas, often with the goal to check or prevent for errors. composition of this set differs between the different dataset,
The category Lookup and reference also catches the eye. It indicating that the kind of functions are user-, company-
indicates the need of users to lookup or refer to data in their or domain-dependent. Functions are mostly used to perform
spreadsheets. The functions VLOOKUP, OFFSET, INDEX, arithmetic calculations, add logic to spreadsheets, and lookup
LOOKUP, and MATCH, together, are responsible for 81% of or reference data. We also observed that formulas are not that
the formulas in this category. complex if you compare them to formulas found in [10].
Functions within the category Compatibility are those that
Microsoft has replaced with new functions, but because of V. E NRON VERSUS EUSES
compatibility reasons they are still available. Although the We used the Wilcoxon-Mann-Whitney test to determine if
Enron dataset stems from 2000 and 2001, we observe that the characteristics of the Enron spreadsheet differ from the
only a very small fraction (<0.1%) of formulas use a function EUSES spreadsheets. The test calculates a p-value. The results
that in the meantime has been replaced by Microsoft. This can be found in Table VII.
means that the majority of functions used in the Enron set
are still valid and unchanged functions in the latest version of TABLE VII
Excel. E NRON COMPARED TO EUSES
Above we discussed the kind of functions that are used and
Metric p-value d
the purpose of these functions. But what about complexity?
How complex are formulas? A high number of references that s1 number of cells <0.01 0.090
s2 number of worksheets <0.01 0.087
are made within the formula to other cells (preceding cells)
s3 number of formulas <0.01 0.136
can make a formula difficult to understand. The same is true s4 number of unique formulas <0.01 0.114
for the degree of nestedness of a formula, that we can measure s5 length of formula >0.05
with the parse tree depth of the formula. The results for both c2 number of external links <0.01 0.243
metrics can be found in Table VI. c4 number of intraworksheet connections <0.01 0.142
The number of preceding cells (f3) is almost identical for c6 path depth >0.05
both Enron and EUSES. The differences are in the extremities. c7 number of transitive precedents <0.05 0.006
The high maxima for the number of precedents in both datasets f3 number of preceding cells >0.05
are caused by references to a large range. For example the f4 parse tree depth >0.05
formula for the cell with the maximum number of preceding f5 number of unique functions >0.05
cells in the Enron set is:
=VLOOKUP(B51,[50]jan98!$A$34:$IV$84,3,0) The results of the Wilcoxon-Mann-Whitney test indicate
that for some metrics (e.g. number of cells) there is a signif-
It is a range of 256 columns (which was the maximum icant difference in the distributions of the Enron and EUSES
number of columns within a worksheet in Excel 2000) and 51 datasets. For other metrics (e.g. path depth) they are equal. For
rows, which makes a total of 13,056 cells. The lookup value in all metrics with a significant difference the effect (calculated
B51 gives us the maximum of 13,057 preceding cells. With a with the Cliff’s Delta d) is negligible (d < 0.147), except for
median of two preceding cells for all datasets, it is reasonable number of external links, where the effect is small (0.147 < d
to assume that the number of preceding cells is not causing < 0.33). So, although there is a statistical significant difference
complexity. in the distribution for some metrics between the Enron and
In Table VI we observe that only the maximum values for EUSES dataset, the effect of this difference is negligible or
the parse tree depth (f4) differ between Enron and EUSES. small. Based on this we can conclude that the spreadsheets in
One could also argue that a formula with a parse tree depth of the EUSES corpus are comparable with the Enron spreadsheets
two is not really complex. We were intrigued by the formula with respect to the metrics used in this paper. The full dataset
with the maximum parse tree depth of 106 in the Enron and R scripts are available online6 .
set. One would expect that such a formula is quite complex.
However in this specific example the formula consisted of 105 VI. D ISCUSSION
hard coded numbers that were added with the + operand. So In the previous section, we have 1) described the results
definitely a large formula, but not complex. of an analysis of the spreadsheets in the Enron dataset with
Another factor that could cause complexity is the number
of different functions that is used in a formula (f5). In the 6 http://figshare.com/articles/Enron versus EUSES/1298994
TABLE VI
OVERVIEW OF FORMULA CHARACTERISTICS
Metric Dataset Min Q1 Median Q3 Max
f3 Number of preceding cells per formula Enron 0.0 1.0 2.0 4.0 13,057.0
EUSES 0.0 1.0 2.0 4.0 3,698.0
f4 Parse tree depth per formula Enron 1.0 2.0 2.0 3.0 106.0
EUSES 1.0 2.0 2.0 3.0 89.0
f5 Number of unique functions per formula Enron 1.0 1.0 1.0 1.0 8.0
EUSES 1.0 1.0 1.0 1.0 8.0
respect to the dimensions size, coupling and use of functions VII. R ELATED W ORK
and 2) compared the Enron spreadsheets with the EUSES Most related to our efforts is of course the work of Hermans
spreadsheets. In this section, we discuss some issues that could and Murphy-Hill [7] which initially presented the Enron
affect the applicability and suitability of the approach used. spreadsheet corpus. The authors performed a preliminary anal-
A. User-defined functions ysis of some basic characteristics (like number of worksheets,
cells, and formulas) of these spreadsheets. While Hermans
Our analysis focused on the Microsof Excel built-in func- and Murphy-Hill present a first overview of the spreadsheets,
tions. We did not analyze the use of user-defined functions, we, in this paper, dive deeper and added additional metrics to
because the use of it is rather limited. Only in 0.5% of measure the degree of coupling and gain more insight in the
the Enron and 1.2% of the EUSES spreadsheets that contain actual use of functions. Also, for every metric we compared
formulas with functions, user-defined functions were used. the Enron spreadsheets with the EUSES corpus. By using
the Wilcoxon-Mann-Whitney test and calculating the Cliff’s
B. Pivot tables, charts and VBA code
delta for the different metrics we answer the question of
In this first detailed analysis of the Enron spreadsheets, representativeness of the EUSES corpus.
we limited ourselves to metrics for size, coupling and use of Secondly, there is the EUSES corpus, which was introduced
functions. More elaborate constructs like Pivot tables, charts by Fisher and Rothermel in 2005 [16]. Besides EUSES, there
and VBA code could also have an impact on the complexity of are a few other smaller corpora [17], [18]. Unfortunately, none
spreadsheets. In future research, we plan to extend the current of these corpora were publicly available to include in the
analysis with these constructs. analyses of this paper.
In addition to work on spreadsheet corpora, there are papers
C. Path analysis
on spreadsheet metrics, which have in common with our work
In this paper, we looked at the use of formula and functions that they too measure properties of spreadsheets [19], [20].
on the level of the individual cell. However, formulas receive There is an overlap between the metrics defined in these papers
input from precedent cells. These precedent cells, in turn, can and the metrics we use in this analysis (e.g. number of cells,
also contain formulas. One could argue that all the formulas number of formulas). We have added metrics to obtain a better
in the calculation chain form a small program or method. In understanding of how functions are used within spreadsheets.
future work, we will analyze these programs to see if there We have also used the metrics to compare two spreadsheet
are certain common patterns that can be found in the majority corpora.
of spreadsheets.
VIII. C ONCLUDING R EMARKS
D. Threats to validity
The overall goal of this paper is to understand and quantify
1) Cleaned dataset: Before the Enron dataset was made the characteristics of industry grade spreadsheets and secondly
publicly available, it has been cleaned. In total over 10,000 to see if these spreadsheets differ from the EUSES corpus
e-mail messages with potential sensitive personal information, that was used frequently in previous research. To do so, we
like credit card numbers, identity numbers, personal contact formulated three research questions:
details, etc., were deleted from the dataset [15]. We do
not believe that this cleaning affected the analysis of the A. What are the characteristics, in terms of size and coupling,
spreadsheets. The spreadsheets themselves were most likely of spreadsheets in a business environment?
not deemed personal by the researchers cleaning the dataset. • The results of the analysis indicate that the size of
2) Age of dataset: The spreadsheets in the Enron dataset the majority of the Enron spreadsheets is small. On
are more than ten years old. However, we believe that the average they consists of 701 non-empty cells, contain
way users create spreadsheets has not change much over this three worksheets and have about 128 unique formulas.
period. This is supported by the fact that in the meanwhile the • For coupling, we see a similar picture. The majority of
user interface of Microsoft Excel has not changed significantly. spreadsheets (89%) is not linked to other spreadsheets
and also within the spreadsheet only 20% of them have In this paper we focused on formulas at the cell level. The
links between worksheets within the spreadsheet. On the formula in a spreadsheet could be considered as source code.
cell level, we see the same. The median path depth However, by looking at a single cell you are actual looking at a
(calculation chain) is one and a cell with a formula has single line of source code. In future research, we are planning
a median of three transitive precedent cells. to extract and combine all formulas in the same calculation
• Based on the results of the analysis, we conclude that chain to obtain a better and completer understanding of the
in a business environment large spreadsheets with a high use of functions within spreadsheets. This will allow us to
degree of coupling do exist, but are not very common. search for commonly used patterns.
B. How are spreadsheet functions used and combined? R EFERENCES
• With respect to the use of functions, we see that users use [1] R. R. Panko and N. Ordway, “Sarbanes-oxley: What about all the
spreadsheets?” CoRR, vol. abs/0804.0797, 2008. [Online]. Available:
only a small subset of the available built-in functions. http://arxiv.org/abs/0804.0797
Surprising is that this subset of functions is not the [2] F. Hermans, M. Pinzger, and A. van Deursen, “Supporting professional
same between the different datasets. This could be an spreadsheet users by generating leveled dataflow diagrams,” in Proceed-
ings of the 33rd International Conference on Software Engineering.
indication that the kind of functions that are used in a ACM, 2011, pp. 451–460.
spreadsheet are depending on the individual user or the [3] R. R. Panko, “What we know about spreadsheet errors,” Journal of
specific business domain. Organizational and End User Computing (JOEUC), vol. 10, no. 2, pp.
15–21, 1998.
• If we look at the level of function categories, we notice [4] R. Abraham and M. Erwig, “Goaldebug: A spreadsheet debugger for end
that mainly the general purpose categories like Opera- users,” in Proceedings of the 29th international conference on Software
tors, Math and trigonometry, Lookup and reference and Engineering. IEEE Computer Society, 2007, pp. 251–260.
[5] F. Hermans, M. Pinzger, and A. v. Deursen, “Detecting and visualizing
Logical are used. Users tend to use functions to either inter-worksheet smells in spreadsheets,” in Proceedings of the 2012
calculate something, look up something or a combination International Conference on Software Engineering. IEEE Press, 2012,
of the two. Functions from the Logical category are used pp. 441–451.
[6] W. Dou, S.-C. Cheung, and J. Wei, “Is spreadsheet ambiguity harmful?
to reason with formulas, probably with the goal to check detecting and repairing spreadsheet smells due to ambiguous computa-
or prevent for errors. tion,” in Proceedings of the 36th International Conference on Software
• Regarding the complexity, the results show us that the Engineering. ACM, 2014, pp. 848–858.
[7] F. Hermans and E. Murphy-Hill, “Enron’s spreadsheets and related
majority of formulas are simple. They only make a direct emails: A dataset and analysis,” in Proceedings of the 37th International
reference to a few other cells (median: two), are hardly Conference on Software Engineering, 2015, to appear.
nested (median for the parse tree depth: two) and most [8] B. Klimt and Y. Yang, “Introducing the Enron corpus.” in CEAS, 2004.
[9] M. Fowler, Refactoring : improving the design of existing code. Read-
of the times only contain one function. ing, MA: Addison-Wesley, 1999.
[10] F. Hermans, M. Pinzger, and A. van Deursen, “Detecting and refactoring
C. Is there a difference with respect to these metrics between code smells in spreadsheet formulas,” Empirical Software Engineering,
spreadsheets that are used in a business context and the pp. 1–27, 2014.
[11] B. Jansen and F. Hermans, “Using a visual language to create better
EUSES corpus? spreadsheets,” Software Engineering Methods in Spreadsheets, 2014.
• With respect to size, coupling and use of functions [12] F. Hermans, “Analyzing and visualizing spreadsheets,” Ph.D. disserta-
tion, PhD thesis, Software Engineering Research Group, Delft University
the Enron spreadsheets are quite similar to the EUSES of Technology, Netherlands, 2012.
corpus. Although we find for most metrics a statistical [13] J. Sajaniemi, “Modeling spreadsheet audit: A rigorous approach to
significant difference, the effect size is either negligible automatic visualization,” Journal of Visual Languages & Computing,
vol. 11, no. 1, pp. 49–82, 2000.
(6 metrics) or small (1 metric). Based on this we can [14] J. E. Montandon, H. Borges, D. Felix, and M. T. Valente, “Documenting
conclude that for these characteristics the EUSES corpus APIs with examples: Lessons learned with the apiminer platform,”
is representative for spreadsheets that are used in a in Reverse Engineering (WCRE), 2013 20th Working Conference on.
IEEE, 2013, pp. 401–408.
business context. [15] A. Cassidy and M. Westwood-Hill. Removing pii from the EDRM
Enron data set: Investigating the prevalence of unsecured financial,
IX. F UTURE WORK health and personally identifiable information in corporate data.
[Online]. Available: http://www.nuix.com/images/resources/case study
This paper gives rise to several directions for future work. nuix edrm enron data set.pdf
The results of our analysis show that users in general create [16] M. Fisher and G. Rothermel, “The EUSES spreadsheet corpus: a shared
small spreadsheets, with a relatively low degree of coupling resource for supporting experimentation with spreadsheet dependability
mechanisms,” ACM SIGSOFT Software Engineering Notes, vol. 30,
and that they use simple formulas. This makes it less likely no. 4, pp. 1–5, 2005.
that the complexity of a spreadsheet is causing the high error [17] R. R. Panko, “Two corpuses of spreadsheet errors,” in System Sciences,
rate of spreadsheets. It still leaves us with the question, what 2000. Proceedings of the 33rd Annual Hawaii International Conference
on. IEEE, 2000, pp. 8–pp.
is causing errors in spreadsheets? To answer this question, one [18] S. G. Powell, K. R. Baker, and B. Lawson, “A critical review of the
of the directions for future research could be to focus on the literature on spreadsheet errors,” Decision Support Systems, vol. 46,
user interface of the spreadsheet software. In previous work no. 1, pp. 128–138, 2008.
[19] A. Bregar, “Complexity metrics for spreadsheet models,” in Proc. of
[11], we introduced the concept of a visual language to model EuSpRIG ’04, 2004, p. 9.
spreadsheets and in future research we are planning to develop [20] K. Hodnigg and R. Mittermeir, “Metrics-based spreadsheet visualization:
and validate such a language. Support for focused maintenance,” in Proc. of EuSpRIG ’08, 2008, p. 16.