Enron versus EUSES:
            A Comparison of Two Spreadsheet Corpora
                                                             Bas Jansen
                                                   Delft University of Technology
                                                     Email: b.jansen@tudelft.nl


   Abstract—Spreadsheets are widely used within companies and            Previously, attempts have been made to compose a dataset
often form the basis for business decisions. Numerous cases           of spreadsheets that are used within companies, but it is very
are known where incorrect information in spreadsheets lead to         difficult to convince companies to share their spreadsheets for
incorrect decisions. Such cases underline the relevance of research
on the professional use of spreadsheets.                              research purposes. And if they are willing to do so, it is even
   Recently a new dataset became available for research, contain-     more difficult to get permission to make these spreadsheets
ing over 15.000 business spreadsheets that were extracted from        available for the research community.
the Enron E-mail Archive. With this dataset, we 1) aim to obtain         Recently Hermans and Murphy-Hill introduced a new set of
a thorough understanding of the characteristics of spreadsheets       spreadsheets [7], consisting of more than 15,000 spreadsheets
used within companies, and 2) compare the characteristics of the
                                                                      that were extracted from the already existing Enron Email
Enron spreadsheets with the EUSES corpus which is the existing
state of the art set of spreadsheets that is frequently used in       archive that was made public during the legal investigation
spreadsheet studies.                                                  following the Enron bankruptcy [8]. The size of this dataset
   Our analysis shows that 1) the majority of spreadsheets are        is significantly larger than the EUSES corpus and maybe even
not large in terms of worksheets and formulas, do not have a high     more important: we know that these spreadsheets were used
degree of coupling, and their formulas are relatively simple; 2)      in industry. This dataset is publicly available2 .
the spreadsheets from the EUSES corpus are, with respect to the
measured characteristics, quite similar to the Enron spreadsheets.
                                                                         Hermans and Murphy-Hill performed a preliminary analysis
                                                                      of some basic characteristics (like number of worksheets,
                                                                      number of cells, etc.) of these spreadsheets. To obtain a more
                         I. I NTRODUCTION                             thorough understanding of the characteristics of industry grade
                                                                      spreadsheets, we extend the analysis with additional metrics on
   Spreadsheets are used widely within companies. It was, for         the degree of coupling and the use of functions. Furthermore,
example, estimated that 95% of U.S. firms use spreadsheets            we compare this dataset extensively with the EUSES corpus
for financial reporting [1]. The information in spreadsheets          and, by using statistical tests, answer the question of its
often forms the basis for significant business decisions [2].         representativeness with respect to spreadsheets that are used
However, from previous research we know that spreadsheets             by companies.
are error-prone [3]. This poses the risk for companies of taking         The contributions of this paper are:
important decisions on inaccurate information. Eventually, this          • A detailed analysis of a large dataset of industrial spread-
can lead to incorrect decisions and loss of money. The growing              sheets on dimensions of size and coupling
list of Horror Stories of the European Spreadsheet Risk Interest         • A detailed analysis of how functions are used in spread-
Group1 illustrates this and stresses the importance of spread-              sheets in a business environment
sheet research. It is important to understand why spreadsheet            • A comparison of the characteristics of the spreadsheets
errors are easily made and how this can be improved.                        in the Enron corpus with the EUSES corpus.
   One way to address this, is to study a large set of real world                       II. R ESEARCH Q UESTIONS
spreadsheets. The EUSES corpus, a set of approximately 4,500
                                                                         The Enron dataset allows us to understand and quantify the
spreadsheets from different domains, is used most frequently
                                                                      characteristics of industry grade spreadsheets. In this study we
for this purpose. Although the EUSES corpus has proven to
                                                                      focus on the level of coupling and the size of spreadsheets. The
be very useful and is used in many research projects [4]–
                                                                      reason for this is that we know from software engineering that
[6], there are also some drawbacks. First of all the majority
                                                                      source code tends to be more error-prone when the coupling
of these spreadsheets were obtained from the public world-
                                                                      between the different units of a program is high or the program
wide-web. We do not know if these spreadsheets are similar
                                                                      consists of large parts [9]. Previous research indicates that this
to spreadsheets that are used within companies. Furthermore
                                                                      also applies to spreadsheets [10].
the size of the corpus is relatively small. Finally we do not            Modern spreadsheet systems contain over 300 different
know anything about the context in which the spreadsheets             functions and with these functions users can create very com-
were created.                                                         plex formulas. In previous work, we introduced the concept
  1 www.eusprigorg/horror-stories.htm                                   2 www.felienne.com/enron
of a visual language for spreadsheets [11] for which functions
form the building blocks. In this study we therefore analyze,
in addition to the metrics on size and coupling, how functions
are used.
   To provide context, we compare the characteristics of
spreadsheets that are used in industry with the spreadsheets
of the EUSES corpus. It will also allow us to evaluate the
representativeness of the EUSES corpus.
   Hence we aim to answer the following research questions
in this paper:                                                            Fig. 1. Precedents, Transitive Precedents, and Path Depth
   • What are the characteristics, in terms of size and cou-
     pling, of spreadsheets in a business environment?
   • How are spreadsheet functions used?                               the level of the worksheet. Two identical formulas on
   • Is there a difference with respect to these metrics between       different worksheets will count as two unique formulas.
     spreadsheets that are used in a business context and the          We determine the unique formulas by looking at the
     EUSES corpus?                                                     relative R1C1 notation of the formula. As indicated by
                       III. A PPROACH                                  Sajaniemi, this notation stays the same even if you copy
                                                                       it down or right [13].
  For this paper, we use two datasets: Enron and EUSES. We
                                                                     • Length of formula (s5) From software engineering, we
analyzed the spreadsheets in the two datasets with the Spread-
                                                                       know that a large number of lines of code increase the
sheet Scantool, developed at Delft University of Technology.
                                                                       chance on errors. In spreadsheets, we could consider the
The tool runs on the previously developed Breviz core, that
                                                                       individual formulas to be comparable to lines of code. So,
was made for spreadsheet visualization and smell detection
                                                                       if more lines of source code lead to a higher error rate, it
[12]. The Scantool collects several metrics on spreadsheet,
                                                                       is reasonable to assume that the longer the formula, the
worksheet and cell level. The metrics that were used for this
                                                                       more likely it is that it contains an error. Therefore, we
paper can be found in Table I.
                                                                       measure the length of formulas in characters.
                                                                     • Path depth (c6), transitive precedents (c7), and prece-
                           TABLE I
                   OVERVIEW OF USED METRICS                            dents (f3) Most formulas receive input from other cells.
                                                                       These input cells are the so called precedents. These
 Dimension    Metric                                                   precedents themselves could also receive input from other
 Size         s1 # non-empty cells per spreadsheet                     cells. If you trace along these precedents until you reach
              s2 # worksheets per spreadsheet                          a cell without one, you have the number of transitive
              s3 # formulas per spreadsheet                            precedents. The path depth is the longest calculation
              s4 # unique formulas per spreadsheet                     chain, see also Figure 1.
              s5 length of formula in characters
                                                                     • Parse tree depth (f4) This is a measure of how nested a
 Coupling     c1 % spreadsheets linked to other spreadsheets           formula is. The formula D9 + E9 has a parse tree dept of
              c2 # external links per spreadsheet                      2, the formula (B5 - T5) / (B6 * SQRT(4)) a parse tree
              c3 % spreadsheets with intraworksheets connections
              c4 # intraworksheet connections per spreadsheet          depth of 5.
              c5 # passing cells as a percentage of total cells       To determine if there is any difference in terms of these
              c6 path depth per formula                            metrics between the Enron and EUSES spreadsheets, we have
              c7 transitive precedents per formula
                                                                   calculated the difference between the distributions of the two
 Functions    f1 usage of built-in functions                       datasets using a Wilcoxon-Mann-Whitney test and the Cliff’s
              f2 usage of functions per category
                                                                   Delta d effect size. We use this test to analyze whether there
              f3 # precedents per formula
              f4 parse tree depth                                  is a significant difference between the the distribution of the
              f5 # unique functions per formula                    metrics of the Enron and EUSES spreadsheets. If a significant
                                                                   difference is found, we use the Cliff’s Delta effect size to
  Some of these metrics (marked bold in Table I) deserve           measure the magnitude of the difference.
some further explanation.
  • Number of unique formulas per spreadsheet (s4) In                                         IV. R ESULTS
    spreadsheets, it is very common to define a formula in
    one cell and then copy it down or right to other cells. For       In this section we present the results of our analysis of
    this reason, many of the formula cells in a spreadsheet        the Enron and EUSES spreadsheets. First we will discuss the
    contain the same formula and only the references to            metrics for size and coupling. Next we have a closer look
    other cells differ. Therefore, we also measure the number      at the use of functions and we conclude this section with a
    of unique formulas in the spreadsheet. We do this on           comparison between the Enron and the EUSES spreadsheets.
A. Size                                                                           3) Cells: On the cell level, we have analyzed how formulas
                                                                               are linked to another via cell references. A formula receives
   Previous research has shown that lengthy source code will
                                                                               values from other cells (precedents / cell fan in), and the result
increase the error-rate [9]. Therefore we analyse both datasets
                                                                               of the formula can be used in other formulas (dependents /
on the dimension of size. The results of this analysis can be
                                                                               cell fan out). To gain a better understanding of the degree
found in Table II. Two measures for the size of a spreadsheet
                                                                               of coupling on the cell level, we look at the path depth
are the number of non-empty cells (s1) and the number
                                                                               (longest path of precedents) and the total number of transitive
of worksheets (s2)3 We notice that on average the Enron
                                                                               precedents, see Table III.
spreadsheets contain more non-empty cells than the EUSES
spreadsheets. The number of worksheets per spreadsheet is                         Before we can calculate the number of transitive precedents,
almost identical for both datasets. For context consider a study               there is one adjustment that we have to make. A pattern that
of Hermans et. al. [10]. They asked financial professionals                    we witness frequently in spreadsheets is a formula that is just
to select one of their large and complex spreadsheets. These                   passing data (eg. ‘=A1’). Within the Enron set this is true
spreadsheets contained tens of thousands of non-empty cells                    for 35% of the formulas and for EUSES this is 22% (c5).
and have on average 10 worksheets. Based on this we can con-                   Most of these passing cells have a path depth and number of
clude that both number of cells and the number of worksheets                   precedents of one.4 To get a better insight of the ‘real’ links
in the Enron and EUSES spreadsheets are relatively small.                      we have excluded these passing formulas in our analysis of
                                                                               path depth and number of precedents.
   Formulas process and manipulate the data in the spread-
sheet. So another way to look at the size of a spreadsheet is                     The path depth (c6) of Enron is almost identical to EUSES.
to look at 1) the number of formulas per spreadsheet (s3), 2)                  Both data sets have a median path depth of one, which is
the number of unique formulas per spreadsheet (s4), and 3)                     small. If we look at the total number of cells that influence
the length of the formulas in characters (s5). Within the Enron                a formula (c7), we observe that the median for Enron and
set, formulas are used more frequently than in the EUSES set,                  EUSES are relatively low and almost the same (respectively
but the latter tend to be slightly longer.                                     three and four). Noteworthy is the maximum of more than
                                                                               22,000 transitive precedents for a single formula in the Enron
B. Coupling                                                                    set.

  Besides size, a high degree of coupling between the units                    C. Use of Functions
of a program could also lead to more errors. We therefore
                                                                                   Modern spreadsheet software contains over 300 different
analyze the degree of coupling in spreadsheets. This can be
                                                                               functions that users can combine to create complex formulas.
done on several levels within a spreadsheet:
                                                                               Furthermore, the functions form the building blocks for the
   • Coupling between spreadsheets                                             spreadsheet model. Research on learning APIs has shown that
   • Coupling between worksheets                                               analyzing usage of source code can reveal interesting patterns
   • Coupling between cells                                                    [14]. Therefore, we have analyzed the use of functions in the
   The values of the Enron and EUSES spreadsheets for the                      Enron dataset and compared it with EUSES.
different measures of coupling are summarized in Table III.                        From Hermans and Murphy-Hill, we know that there is
   1) Spreadsheets: Starting with coupling between spread-                     little diversity in the functions that are used in the Enron
sheets (c1), we can observe that only a minority of spread-                    spreadsheets [7]. In this analysis, we observe the same. We
sheets links to another spreadsheet (Enron 11%, EUSES 1%).                     have selected all unique formulas with at least one function.
For both sets, the majority of these spreadsheets only link to                 For this metric we focussed on the use of build-in Excel
one or two other spreadsheets. The use of external links (c2)                  functions (F1). Hence, we excluded operators (+, -, *, /, ˆ)
is slightly higher in the Enron set than the EUSES set.                        and user-defined functions. In the resulting formulas the top
   2) Worksheets: On the worksheet level we observe that,                      15 functions cover 69% of the spreadsheets. For EUSES, this
although there are more spreadsheets with one or more in-                      is almost the same: 70%.
traworksheet connections (c3) than spreadsheets with external                      Table IV gives an overview of the fifteen most frequently
links, it is still a minority (Enron 20%, EUSES 10%). If we                    used functions in the two datasets.
look at the number of connections (c4), we find that the median                    Although it is true that a small set of functions covers the
for Enron and EUSES is respectively four and three, see Table                  majority of spreadsheets, this set of functions is not the same
III.                                                                           in each dataset. We see that only six functions (marked in
   It is fair to state that the level of coupling on both the level            bold) are present in both datasets. This probably indicates that
of spreadsheets and the level of worksheets is low. The vast                   the set of functions that are used in a spreadsheet depends on
majority of spreadsheets contain no links to other spreadsheets                either the company, department, user or business domain.
or other worksheets within the same spreadsheet.                                   In the previous paragraphs we have looked at the individual
                                                                               functions. However there are several functions that can be
   3 Up to Excel 2010 every spreadsheet started by default with 3 workbooks.
Because of this there are a lot of spreadsheets with 3 workbooks of which 2      4 The number of precedents could be higher than one if the formula is
are empty. For this reason we excluded empty worksheets to calculate s2        passing data from merged cells.
                                                                            TABLE II
                                                                     OVERVIEW OF SIZE METRICS

                 Metric                                                       Dataset    Min        Q1       Median             Q3            Max
                 s1 Number of non-empty cells per spreadsheet                 Enron         7     204.0       701.0       3,672.0       889,952.0
                                                                              EUSES        14     237.0       573.5       1,354.3       113,134.0
                 s2 Number of worksheets per spreadsheet                      Enron       1.0       1.0             1.0         4.0          175.0
                                                                              EUSES       1.0       1.0             1.0         3.0          106.0
                 s3 Number of formulas per spreadsheet                        Enron       1.0      16.0       128.0        758.0        175,568.0
                                                                              EUSES       1.0      19.0        73.0        253.3         26,434.0
                 s4 Number of unique formulas per spreadsheet                 Enron       1.0       4.0            14.0     50.0            6,862.0
                                                                              EUSES       1.0       3.0            10.0     32.0              961,0
                 s5 Length of formula in characters                           Enron       1.0      12.0            18.0     29.0            1,111.0
                                                                              EUSES       1.0      14.0            20.0     35.0            1,129.0


                                                                           TABLE III
                                                                  OVERVIEW OF COUPLING METRICS

               Metric                                                                   Dataset    Min       Q1      Median           Q3        Max
               c2 Number of external links per spreadsheet                              Enron       1.0      1.0          2.0         5.0      155.0
               for spreadsheets with at least one external link                         EUSES       1.0      1.0          1.0         2.0       88.0
               c4 Number of intraworksheet connections per spreadsheet                  Enron       1.0      1.0          4.0     13.0         735.0
               for spreadsheets with at least one intraworksheet connection             EUSES       1.0      1.0          3.0      7.0         121.0
               c6 Path depth per formula                                                Enron         0      1.0          1.0         2.0     1,205.0
                                                                                        EUSES         0      1.0          1.0         2.0       273.0
               c7 Total number of transitive precedents per formula                     Enron         0      1.0          3.0     18.0      22,702.0
                                                                                        EUSES         0      1.0          4.0     15.0       6,536.0


                            TABLE IV                                               have used the classification that was defined by Microsoft5 .
          T OP 15 OF MOST FREQUENTLY USED FUNCTIONSS

                                                                                                                  TABLE V
                   Enron                       EUSES                                              R ELATIVE USE OF FUNCTIONS BY CATEGORY
          1        SUM                         SUM
          2        IF                          IF                                                 Category                        Enron        EUSES
          3        AVERAGE                     ROUND
                                                                                                  Operator                        71.4%         58.5%
          4        VLOOKUP                     HYPERLINK
                                                                                                  Math and trigonometry           16.4%         19.2%
          5        ROUND                       CONCATENATE
                                                                                                  Logical                          3.6%          9.9%
          6        SUBTOTAL                    AND
                                                                                                  Lookup and reference             2.8%          3.6%
          7        OFFSET                      COUNTIF
                                                                                                  Statistical                      2.5%          4.0%
          8        CONCATENATE                 AVERAGE
                                                                                                  Date and time                    1.6%          0.5%
          9        NOW                         OR
                                                                                                  Text                             0.7%          1.8%
          10       DAVERAGE                    INDIRECT
                                                                                                  Information                      0.5%          1.0%
          11       SUMIF                       MIN
                                                                                                  Database                         0.3%          0.1%
          12       INDEX                       ISNUMBER
                                                                                                  Financial                        0.1%          0.4%
          13       MATCH                       MAX
                                                                                                  Engineering                      0.0%          0.0%
          14       LOOKUP                      VLOOKUP
                                                                                                  Compatibility                    0.0%          0.7%
          15       MONTH                       ISBLANK
                                                                                                  Udf                              0.0%          0.4%
                                                                                                  User defined add-ins             0.0%          0.0%


                                                                                      In Table V, we see a list of these categories and the
                                                                                   percentage of formulas that contains a function within this
                                                                                   category. It is clear that operators are used the most frequently.
used for the same kind of problem. For example, VLOOKUP,                           If we shift our focus to the pure use of functions we see that
HLOOKUP, MATCH, and SEARCH all have the purpose                                    the majority of functions belong to the category Math and
of looking up data. To better understand what users try to
accomplish with functions, it would be useful to analyze the                         5 https://support.office.com/en-AU/Article/Excel-functions-by-category-
use of functions on the level of these groups (f2). To do so, we                   5f91f4e9-7b42-46d2-9bd1-63f26a86c0eb
trigonometry. The functions SUM, ROUND and SUBTOTAL                majority of formulas, only one function is used. Formulas with
are responsible for 95% of this category. We conclude that         more than three different functions are rare. Within the Enron
functions are used mostly to perform arithmetic calculations.      set only 1.5% of the formulas uses more than three different
Another common category is Logical. The most frequently            functions. For EUSES this percentages is even smaller (0.4%).
used functions within this category are: IF, AND, and OR              To summarize our findings with respect to the use of
(used in 99,8% of the formulas of the Logical category).           functions, we can confirm the finding of Hermans and Murphy-
Users use these functions not to calculate, but to reason with     Hill that users only use a very small set of functions [7]. The
formulas, often with the goal to check or prevent for errors.      composition of this set differs between the different dataset,
   The category Lookup and reference also catches the eye. It      indicating that the kind of functions are user-, company-
indicates the need of users to lookup or refer to data in their    or domain-dependent. Functions are mostly used to perform
spreadsheets. The functions VLOOKUP, OFFSET, INDEX,                arithmetic calculations, add logic to spreadsheets, and lookup
LOOKUP, and MATCH, together, are responsible for 81% of            or reference data. We also observed that formulas are not that
the formulas in this category.                                     complex if you compare them to formulas found in [10].
   Functions within the category Compatibility are those that
Microsoft has replaced with new functions, but because of                             V. E NRON VERSUS EUSES
compatibility reasons they are still available. Although the          We used the Wilcoxon-Mann-Whitney test to determine if
Enron dataset stems from 2000 and 2001, we observe that            the characteristics of the Enron spreadsheet differ from the
only a very small fraction (<0.1%) of formulas use a function      EUSES spreadsheets. The test calculates a p-value. The results
that in the meantime has been replaced by Microsoft. This          can be found in Table VII.
means that the majority of functions used in the Enron set
are still valid and unchanged functions in the latest version of                                TABLE VII
Excel.                                                                                  E NRON COMPARED TO EUSES
   Above we discussed the kind of functions that are used and
                                                                      Metric                                           p-value      d
the purpose of these functions. But what about complexity?
How complex are formulas? A high number of references that            s1 number of cells                                <0.01    0.090
                                                                      s2 number of worksheets                           <0.01    0.087
are made within the formula to other cells (preceding cells)
                                                                      s3 number of formulas                             <0.01    0.136
can make a formula difficult to understand. The same is true          s4 number of unique formulas                      <0.01    0.114
for the degree of nestedness of a formula, that we can measure        s5 length of formula                              >0.05
with the parse tree depth of the formula. The results for both        c2 number of external links                       <0.01    0.243
metrics can be found in Table VI.                                     c4 number of intraworksheet connections           <0.01    0.142
   The number of preceding cells (f3) is almost identical for         c6 path depth                                     >0.05
both Enron and EUSES. The differences are in the extremities.         c7 number of transitive precedents                <0.05    0.006
The high maxima for the number of precedents in both datasets         f3 number of preceding cells                      >0.05
are caused by references to a large range. For example the            f4 parse tree depth                               >0.05
formula for the cell with the maximum number of preceding             f5 number of unique functions                     >0.05
cells in the Enron set is:
=VLOOKUP(B51,[50]jan98!$A$34:$IV$84,3,0)                              The results of the Wilcoxon-Mann-Whitney test indicate
                                                                   that for some metrics (e.g. number of cells) there is a signif-
   It is a range of 256 columns (which was the maximum             icant difference in the distributions of the Enron and EUSES
number of columns within a worksheet in Excel 2000) and 51         datasets. For other metrics (e.g. path depth) they are equal. For
rows, which makes a total of 13,056 cells. The lookup value in     all metrics with a significant difference the effect (calculated
B51 gives us the maximum of 13,057 preceding cells. With a         with the Cliff’s Delta d) is negligible (d < 0.147), except for
median of two preceding cells for all datasets, it is reasonable   number of external links, where the effect is small (0.147 < d
to assume that the number of preceding cells is not causing        < 0.33). So, although there is a statistical significant difference
complexity.                                                        in the distribution for some metrics between the Enron and
   In Table VI we observe that only the maximum values for         EUSES dataset, the effect of this difference is negligible or
the parse tree depth (f4) differ between Enron and EUSES.          small. Based on this we can conclude that the spreadsheets in
One could also argue that a formula with a parse tree depth of     the EUSES corpus are comparable with the Enron spreadsheets
two is not really complex. We were intrigued by the formula        with respect to the metrics used in this paper. The full dataset
with the maximum parse tree depth of 106 in the Enron              and R scripts are available online6 .
set. One would expect that such a formula is quite complex.
However in this specific example the formula consisted of 105                                VI. D ISCUSSION
hard coded numbers that were added with the + operand. So            In the previous section, we have 1) described the results
definitely a large formula, but not complex.                       of an analysis of the spreadsheets in the Enron dataset with
   Another factor that could cause complexity is the number
of different functions that is used in a formula (f5). In the        6 http://figshare.com/articles/Enron versus EUSES/1298994
                                                            TABLE VI
                                               OVERVIEW OF FORMULA CHARACTERISTICS

                    Metric                                         Dataset   Min   Q1    Median   Q3        Max
                    f3 Number of preceding cells per formula       Enron     0.0   1.0      2.0   4.0   13,057.0
                                                                   EUSES     0.0   1.0      2.0   4.0    3,698.0
                    f4 Parse tree depth per formula                Enron     1.0   2.0      2.0   3.0      106.0
                                                                   EUSES     1.0   2.0      2.0   3.0       89.0
                    f5 Number of unique functions per formula      Enron     1.0   1.0      1.0   1.0        8.0
                                                                   EUSES     1.0   1.0      1.0   1.0        8.0


respect to the dimensions size, coupling and use of functions                            VII. R ELATED W ORK
and 2) compared the Enron spreadsheets with the EUSES                   Most related to our efforts is of course the work of Hermans
spreadsheets. In this section, we discuss some issues that could     and Murphy-Hill [7] which initially presented the Enron
affect the applicability and suitability of the approach used.       spreadsheet corpus. The authors performed a preliminary anal-
A. User-defined functions                                            ysis of some basic characteristics (like number of worksheets,
                                                                     cells, and formulas) of these spreadsheets. While Hermans
   Our analysis focused on the Microsof Excel built-in func-         and Murphy-Hill present a first overview of the spreadsheets,
tions. We did not analyze the use of user-defined functions,         we, in this paper, dive deeper and added additional metrics to
because the use of it is rather limited. Only in 0.5% of             measure the degree of coupling and gain more insight in the
the Enron and 1.2% of the EUSES spreadsheets that contain            actual use of functions. Also, for every metric we compared
formulas with functions, user-defined functions were used.           the Enron spreadsheets with the EUSES corpus. By using
                                                                     the Wilcoxon-Mann-Whitney test and calculating the Cliff’s
B. Pivot tables, charts and VBA code
                                                                     delta for the different metrics we answer the question of
  In this first detailed analysis of the Enron spreadsheets,         representativeness of the EUSES corpus.
we limited ourselves to metrics for size, coupling and use of           Secondly, there is the EUSES corpus, which was introduced
functions. More elaborate constructs like Pivot tables, charts       by Fisher and Rothermel in 2005 [16]. Besides EUSES, there
and VBA code could also have an impact on the complexity of          are a few other smaller corpora [17], [18]. Unfortunately, none
spreadsheets. In future research, we plan to extend the current      of these corpora were publicly available to include in the
analysis with these constructs.                                      analyses of this paper.
                                                                        In addition to work on spreadsheet corpora, there are papers
C. Path analysis
                                                                     on spreadsheet metrics, which have in common with our work
   In this paper, we looked at the use of formula and functions      that they too measure properties of spreadsheets [19], [20].
on the level of the individual cell. However, formulas receive       There is an overlap between the metrics defined in these papers
input from precedent cells. These precedent cells, in turn, can      and the metrics we use in this analysis (e.g. number of cells,
also contain formulas. One could argue that all the formulas         number of formulas). We have added metrics to obtain a better
in the calculation chain form a small program or method. In          understanding of how functions are used within spreadsheets.
future work, we will analyze these programs to see if there          We have also used the metrics to compare two spreadsheet
are certain common patterns that can be found in the majority        corpora.
of spreadsheets.
                                                                                    VIII. C ONCLUDING R EMARKS
D. Threats to validity
                                                                        The overall goal of this paper is to understand and quantify
   1) Cleaned dataset: Before the Enron dataset was made             the characteristics of industry grade spreadsheets and secondly
publicly available, it has been cleaned. In total over 10,000        to see if these spreadsheets differ from the EUSES corpus
e-mail messages with potential sensitive personal information,       that was used frequently in previous research. To do so, we
like credit card numbers, identity numbers, personal contact         formulated three research questions:
details, etc., were deleted from the dataset [15]. We do
not believe that this cleaning affected the analysis of the          A. What are the characteristics, in terms of size and coupling,
spreadsheets. The spreadsheets themselves were most likely           of spreadsheets in a business environment?
not deemed personal by the researchers cleaning the dataset.            • The results of the analysis indicate that the size of
   2) Age of dataset: The spreadsheets in the Enron dataset               the majority of the Enron spreadsheets is small. On
are more than ten years old. However, we believe that the                 average they consists of 701 non-empty cells, contain
way users create spreadsheets has not change much over this               three worksheets and have about 128 unique formulas.
period. This is supported by the fact that in the meanwhile the         • For coupling, we see a similar picture. The majority of
user interface of Microsoft Excel has not changed significantly.          spreadsheets (89%) is not linked to other spreadsheets
    and also within the spreadsheet only 20% of them have               In this paper we focused on formulas at the cell level. The
    links between worksheets within the spreadsheet. On the          formula in a spreadsheet could be considered as source code.
    cell level, we see the same. The median path depth               However, by looking at a single cell you are actual looking at a
    (calculation chain) is one and a cell with a formula has         single line of source code. In future research, we are planning
    a median of three transitive precedent cells.                    to extract and combine all formulas in the same calculation
  • Based on the results of the analysis, we conclude that           chain to obtain a better and completer understanding of the
    in a business environment large spreadsheets with a high         use of functions within spreadsheets. This will allow us to
    degree of coupling do exist, but are not very common.            search for commonly used patterns.

B. How are spreadsheet functions used and combined?                                                R EFERENCES
  • With respect to the use of functions, we see that users use       [1] R. R. Panko and N. Ordway, “Sarbanes-oxley: What about all the
                                                                          spreadsheets?” CoRR, vol. abs/0804.0797, 2008. [Online]. Available:
    only a small subset of the available built-in functions.              http://arxiv.org/abs/0804.0797
    Surprising is that this subset of functions is not the            [2] F. Hermans, M. Pinzger, and A. van Deursen, “Supporting professional
    same between the different datasets. This could be an                 spreadsheet users by generating leveled dataflow diagrams,” in Proceed-
                                                                          ings of the 33rd International Conference on Software Engineering.
    indication that the kind of functions that are used in a              ACM, 2011, pp. 451–460.
    spreadsheet are depending on the individual user or the           [3] R. R. Panko, “What we know about spreadsheet errors,” Journal of
    specific business domain.                                             Organizational and End User Computing (JOEUC), vol. 10, no. 2, pp.
                                                                          15–21, 1998.
  • If we look at the level of function categories, we notice         [4] R. Abraham and M. Erwig, “Goaldebug: A spreadsheet debugger for end
    that mainly the general purpose categories like Opera-                users,” in Proceedings of the 29th international conference on Software
    tors, Math and trigonometry, Lookup and reference and                 Engineering. IEEE Computer Society, 2007, pp. 251–260.
                                                                      [5] F. Hermans, M. Pinzger, and A. v. Deursen, “Detecting and visualizing
    Logical are used. Users tend to use functions to either               inter-worksheet smells in spreadsheets,” in Proceedings of the 2012
    calculate something, look up something or a combination               International Conference on Software Engineering. IEEE Press, 2012,
    of the two. Functions from the Logical category are used              pp. 441–451.
                                                                      [6] W. Dou, S.-C. Cheung, and J. Wei, “Is spreadsheet ambiguity harmful?
    to reason with formulas, probably with the goal to check              detecting and repairing spreadsheet smells due to ambiguous computa-
    or prevent for errors.                                                tion,” in Proceedings of the 36th International Conference on Software
  • Regarding the complexity, the results show us that the                Engineering. ACM, 2014, pp. 848–858.
                                                                      [7] F. Hermans and E. Murphy-Hill, “Enron’s spreadsheets and related
    majority of formulas are simple. They only make a direct              emails: A dataset and analysis,” in Proceedings of the 37th International
    reference to a few other cells (median: two), are hardly              Conference on Software Engineering, 2015, to appear.
    nested (median for the parse tree depth: two) and most            [8] B. Klimt and Y. Yang, “Introducing the Enron corpus.” in CEAS, 2004.
                                                                      [9] M. Fowler, Refactoring : improving the design of existing code. Read-
    of the times only contain one function.                               ing, MA: Addison-Wesley, 1999.
                                                                     [10] F. Hermans, M. Pinzger, and A. van Deursen, “Detecting and refactoring
C. Is there a difference with respect to these metrics between            code smells in spreadsheet formulas,” Empirical Software Engineering,
spreadsheets that are used in a business context and the                  pp. 1–27, 2014.
                                                                     [11] B. Jansen and F. Hermans, “Using a visual language to create better
EUSES corpus?                                                             spreadsheets,” Software Engineering Methods in Spreadsheets, 2014.
  •   With respect to size, coupling and use of functions            [12] F. Hermans, “Analyzing and visualizing spreadsheets,” Ph.D. disserta-
                                                                          tion, PhD thesis, Software Engineering Research Group, Delft University
      the Enron spreadsheets are quite similar to the EUSES               of Technology, Netherlands, 2012.
      corpus. Although we find for most metrics a statistical        [13] J. Sajaniemi, “Modeling spreadsheet audit: A rigorous approach to
      significant difference, the effect size is either negligible        automatic visualization,” Journal of Visual Languages & Computing,
                                                                          vol. 11, no. 1, pp. 49–82, 2000.
      (6 metrics) or small (1 metric). Based on this we can          [14] J. E. Montandon, H. Borges, D. Felix, and M. T. Valente, “Documenting
      conclude that for these characteristics the EUSES corpus            APIs with examples: Lessons learned with the apiminer platform,”
      is representative for spreadsheets that are used in a               in Reverse Engineering (WCRE), 2013 20th Working Conference on.
                                                                          IEEE, 2013, pp. 401–408.
      business context.                                              [15] A. Cassidy and M. Westwood-Hill. Removing pii from the EDRM
                                                                          Enron data set: Investigating the prevalence of unsecured financial,
                      IX. F UTURE WORK                                    health and personally identifiable information in corporate data.
                                                                          [Online]. Available: http://www.nuix.com/images/resources/case study
   This paper gives rise to several directions for future work.           nuix edrm enron data set.pdf
The results of our analysis show that users in general create        [16] M. Fisher and G. Rothermel, “The EUSES spreadsheet corpus: a shared
small spreadsheets, with a relatively low degree of coupling              resource for supporting experimentation with spreadsheet dependability
                                                                          mechanisms,” ACM SIGSOFT Software Engineering Notes, vol. 30,
and that they use simple formulas. This makes it less likely              no. 4, pp. 1–5, 2005.
that the complexity of a spreadsheet is causing the high error       [17] R. R. Panko, “Two corpuses of spreadsheet errors,” in System Sciences,
rate of spreadsheets. It still leaves us with the question, what          2000. Proceedings of the 33rd Annual Hawaii International Conference
                                                                          on. IEEE, 2000, pp. 8–pp.
is causing errors in spreadsheets? To answer this question, one      [18] S. G. Powell, K. R. Baker, and B. Lawson, “A critical review of the
of the directions for future research could be to focus on the            literature on spreadsheet errors,” Decision Support Systems, vol. 46,
user interface of the spreadsheet software. In previous work              no. 1, pp. 128–138, 2008.
                                                                     [19] A. Bregar, “Complexity metrics for spreadsheet models,” in Proc. of
[11], we introduced the concept of a visual language to model             EuSpRIG ’04, 2004, p. 9.
spreadsheets and in future research we are planning to develop       [20] K. Hodnigg and R. Mittermeir, “Metrics-based spreadsheet visualization:
and validate such a language.                                             Support for focused maintenance,” in Proc. of EuSpRIG ’08, 2008, p. 16.