=Paper= {{Paper |id=Vol-1209/paper14 |storemode=property |title=Anonymizing Spreadsheet Data and Metadata with AnonymousXL |pdfUrl=https://ceur-ws.org/Vol-1209/paper_15.pdf |volume=Vol-1209 }} ==Anonymizing Spreadsheet Data and Metadata with AnonymousXL== https://ceur-ws.org/Vol-1209/paper_15.pdf
         Anonymizing Spreadsheet Data and Metadata with
                        AnonymousXL

                         Joeri van Veen                                       Felienne Hermans
                              Infotron                                   Delft University of Technology
                      Delft, the Netherlands                                 Delft, the Netherlands
                        joeri@infotron.nl                                 f.f.j.hermans@tudelft.nl


ABSTRACT                                                        of a date type is set to the day of anonymization: “last print
In spreadsheet risk analysis, we often encounter spreadsheets   date”, “creation date” and “last save time”.
that are confidential. This might hinder adoption of spread-
sheet analysis tools, especially web-based ones, as users do
                                                                2.2    Textual Metadata
not want to have their confidential spreadsheets analyzed.        The following textual metadata are set to the text string
To address this problem, we have developed AnonymousXL,         “anonymous”: title, subject, author, keywords, comments,
an Excel plugin that makes spreadsheets anonymous with          template, last author, application name, security, category,
two actions: 1) remove all sensitive metadata and 2) obfus-     format, manager, company.
cate all spreadsheet data within the Excel worksheets such
that it resembles, untraceably, the original values.            3.    DATA OBFUSCATION
                                                                  Data obfuscation is the alteration of data to make it anony-
1.    INTRODUCTION                                              mous. This happens linearly, from the first sheet to the last
                                                                sheet, from the first to the last cell of the used range of cells
   When commercializing our Breviz analysis toolkit [2, 3,
                                                                in each worksheet. We use different techniques for differ-
4] as an online tool called PerfectXL, we ran into the prob-
                                                                ent types of data in the spreadsheet: numeric data, dates,
lem that customers often do not want to upload, share or
                                                                textual data, formulas and other types of data.
even show us confidential spreadsheets. Therefore, we have
developed a tool that obfuscates [1] both the data and the      3.1    Numeric Data
metadata in a spreadsheet, while the values still resemble
the original ones. By construction, we guarantee that our          The basic step for anonymizing a number is to randomly
anonymization does not create or resolve Excel errors. This     add or subtract up to 60% of its original value. Or, math-
enables us to run our smell detection tool on the anonymized    ematically, for any number N in a cell, N is replaced by
spreadsheets as if we were analyzing the original. This pa-     N ± N × 0.6 × r where r is a random value in the range [0,
per describes the capabilities, limitations and applications    1]. We treat integers and real numbers differently: Integer
of AnonymousXL.                                                 values remain integer, real numbers keep their decimals.
                                                                   There is one exception in the anonymization: In Perfec-
                                                                tXL, one of the analyses that is performed is the occurrence
2.    METADATA REMOVAL                                          of so-called ‘magic numbers’, numbers of which the meaning
  AnonymousXL removes spreadsheet metadata: the au-             might be unclear to the user. There are some numbers, how-
thor, the date the file was last opened and the total edit      ever, that are not considered to be magical, because of their
time, in order to remove any ties with the company that the     frequent occurrence: 0, 1, 2, 12, 100, 365, 1000. Therefore,
spreadsheet originally came from. In addition, worksheet        these numbers remain as is in our anonymization process.
names within the spreadsheet are replaced with anonymous        Since all text fields (including column names) get changed,
names.                                                          we believe that leaving the non-magic numbers intact does
                                                                not pose a threat to the anonymity of the spreadsheet, since
2.1   Numerical and Date Related Metadata                       labels give numbers semantics.
  All numerical metadata information is converted to 0. At
this time, only the numerical metadata “revision number”
                                                                3.2    Dates
and “total editing time” are converted. All metadata that is      Dates are converted into random dates in the range of
                                                                representable dates in VBA, in contrast with metadata, in
                                                                which all date values are set to the day of anonymization.
                                                                This randomness is introduced as to maintain data variation.

                                                                3.3    Textual Data
                                                                   For textual data, it does not suffice to simply change all
                                                                textual values to “text”, since in many situations, it matters
                                                                to keep equal strings equal. An example of such a situation
                                                                is a pivot table, as shown in Figure 1. Should we change all
                                                                categories to “text”, the spreadsheet would not work any-
more, as pivot tables cannot contain two fields of the same     number). A special note on booleans TRUE and FALSE: as
name. If we would replace all textual values by unique ones,    booleans are interpreted by Excel as 0 and 1, they are not
such as “text1”, “text2”, “text3”, as shown in Figure 2, it     changed. However, booleans are seldom present as literal
does work, pivot tables however are often based on textual      values. They are often the result of formulas, in which case
data (which denote categories, for example). This means         they only change in accordance with modifications to the
that where there once were three categories (“a”, “b” and       data they depend on.
“c”, in Figure 1), now there are many (eight different ones
in Figure 2). Pivot tables calculate their size based on the    4.    INTRODUCING EXCEL ERRORS
number of unique values they find for a category, so pivot
                                                                   By changing data in Excel cells, errors might be induced
tables become larger than they were originally. This can
                                                                that were not present in the original spreadsheet. For in-
lead to problems, since multiple pivot tables are often situ-
                                                                stance, in the formula =A1/(3-A2), division by zero might
ated close to each other on the same worksheet. If the pivot
                                                                occur (and thus be reported after analysis) if A2 becomes
tables grow because of the anonymization, they can start to
                                                                3, which could happen because of the anonymization step in
overlap and unfortunately, this causes Excel to crash.
                                                                which data in cells is decreased or increased by 60% of their
                                                                original value.
                                                                   To resolve this, we save the list of all formulas that re-
                                                                sult in an error before the anonymization. Then, after we
                                                                anonymize each data cell, we verify that we have not changed
                                                                this list. For this, we do not have to analyze all formulas
                                                                in the spreadsheet, we only analyze the recursive precedents
                                                                of the cell, plus all formulas that contain the INDIRECT
                                                                function.
Figure 1: Original spreadsheet with three categories
                                                                5.    LIMITATIONS
                                                                5.1    Confidential formulas
                                                                  Every so often, spreadsheets contain confidential formu-
                                                                las. All formulas, including those confidential ones, are left
                                                                unaltered to preserve analysis results. This might not be
                                                                sufficient for some users.

                                                                5.2    Embedded constants
                                                                  In the current implementation, we only change numeric
                                                                values in cells and not within formulas, such as in =SUM(A1:
                                                                A10)*1.2. This is a limitation because these constants too
          Figure 2: Simple text replacement                     can be of importance to the spreadsheet owner and thus
                                                                confidential.

                                                                5.3    Analysis Types
                                                                   Different kinds of spreadsheet analyses scan for different
                                                                kinds of patterns. Developed to complement PerfectXL,
                                                                AnonymousXL leaves intact formulas, boolean literals and
                                                                certain numbers for they are key to mimicking analysis of
                                                                the original spreadsheet. Nevertheless, AnonymousXL or a
                                                                slight variation of it could carry great potential for alterna-
                                                                tive analysis types.

      Figure 3: AnonymousXL applied to table                    6.    REFERENCES
                                                                [1] D. E. Bakken, R. Parameswaran, D. M. Blough, A. A.
                                                                    Franz, and T. J. Palmer. Data obfuscation: Anonymity
  Therefore, we anonymize all textual values while keep-
                                                                    and desensitization of usable data sets. IEEE Security
ing intact cell uniqueness by replacing texts with “unique1”,
                                                                    & Privacy, 2(6):34–41, 2004.
“unique2”, “unique3”, etc. (for example, “unique6” repre-
sents the textual value “a” in Figure 3).                       [2] F. Hermans, M. Pinzger, and A. van Deursen.
                                                                    Supporting professional spreadsheet users by generating
3.4    Formulas                                                     leveled dataflow diagrams. In Proc. of ICSE ’11, pages
  Formulas are basially left alone. The only modification           451–460, 2011.
made to formulas are sheet references, since sheet names        [3] F. Hermans, M. Pinzger, and A. van Deursen.
are made anonymous as well.                                         Detecting and visualizing inter-worksheet smells in
                                                                    spreadsheets. In Proc of ICSE ’12, pages 441–451, 2012.
3.5    Other Types                                              [4] F. Hermans, M. Pinzger, and A. van Deursen.
   Other data types usually fall under either categories men-       Detecting code smells in spreadsheet formulas. In Proc
tioned (for instance, a currency type is simply considered a        of ICSM ’12, pages 409–418, 2012.