=Paper=
{{Paper
|id=Vol-1209/paper14
|storemode=property
|title=Anonymizing Spreadsheet Data and Metadata with AnonymousXL
|pdfUrl=https://ceur-ws.org/Vol-1209/paper_15.pdf
|volume=Vol-1209
}}
==Anonymizing Spreadsheet Data and Metadata with AnonymousXL==
Anonymizing Spreadsheet Data and Metadata with
AnonymousXL
Joeri van Veen Felienne Hermans
Infotron Delft University of Technology
Delft, the Netherlands Delft, the Netherlands
joeri@infotron.nl f.f.j.hermans@tudelft.nl
ABSTRACT of a date type is set to the day of anonymization: “last print
In spreadsheet risk analysis, we often encounter spreadsheets date”, “creation date” and “last save time”.
that are confidential. This might hinder adoption of spread-
sheet analysis tools, especially web-based ones, as users do
2.2 Textual Metadata
not want to have their confidential spreadsheets analyzed. The following textual metadata are set to the text string
To address this problem, we have developed AnonymousXL, “anonymous”: title, subject, author, keywords, comments,
an Excel plugin that makes spreadsheets anonymous with template, last author, application name, security, category,
two actions: 1) remove all sensitive metadata and 2) obfus- format, manager, company.
cate all spreadsheet data within the Excel worksheets such
that it resembles, untraceably, the original values. 3. DATA OBFUSCATION
Data obfuscation is the alteration of data to make it anony-
1. INTRODUCTION mous. This happens linearly, from the first sheet to the last
sheet, from the first to the last cell of the used range of cells
When commercializing our Breviz analysis toolkit [2, 3,
in each worksheet. We use different techniques for differ-
4] as an online tool called PerfectXL, we ran into the prob-
ent types of data in the spreadsheet: numeric data, dates,
lem that customers often do not want to upload, share or
textual data, formulas and other types of data.
even show us confidential spreadsheets. Therefore, we have
developed a tool that obfuscates [1] both the data and the 3.1 Numeric Data
metadata in a spreadsheet, while the values still resemble
the original ones. By construction, we guarantee that our The basic step for anonymizing a number is to randomly
anonymization does not create or resolve Excel errors. This add or subtract up to 60% of its original value. Or, math-
enables us to run our smell detection tool on the anonymized ematically, for any number N in a cell, N is replaced by
spreadsheets as if we were analyzing the original. This pa- N ± N × 0.6 × r where r is a random value in the range [0,
per describes the capabilities, limitations and applications 1]. We treat integers and real numbers differently: Integer
of AnonymousXL. values remain integer, real numbers keep their decimals.
There is one exception in the anonymization: In Perfec-
tXL, one of the analyses that is performed is the occurrence
2. METADATA REMOVAL of so-called ‘magic numbers’, numbers of which the meaning
AnonymousXL removes spreadsheet metadata: the au- might be unclear to the user. There are some numbers, how-
thor, the date the file was last opened and the total edit ever, that are not considered to be magical, because of their
time, in order to remove any ties with the company that the frequent occurrence: 0, 1, 2, 12, 100, 365, 1000. Therefore,
spreadsheet originally came from. In addition, worksheet these numbers remain as is in our anonymization process.
names within the spreadsheet are replaced with anonymous Since all text fields (including column names) get changed,
names. we believe that leaving the non-magic numbers intact does
not pose a threat to the anonymity of the spreadsheet, since
2.1 Numerical and Date Related Metadata labels give numbers semantics.
All numerical metadata information is converted to 0. At
this time, only the numerical metadata “revision number”
3.2 Dates
and “total editing time” are converted. All metadata that is Dates are converted into random dates in the range of
representable dates in VBA, in contrast with metadata, in
which all date values are set to the day of anonymization.
This randomness is introduced as to maintain data variation.
3.3 Textual Data
For textual data, it does not suffice to simply change all
textual values to “text”, since in many situations, it matters
to keep equal strings equal. An example of such a situation
is a pivot table, as shown in Figure 1. Should we change all
categories to “text”, the spreadsheet would not work any-
more, as pivot tables cannot contain two fields of the same number). A special note on booleans TRUE and FALSE: as
name. If we would replace all textual values by unique ones, booleans are interpreted by Excel as 0 and 1, they are not
such as “text1”, “text2”, “text3”, as shown in Figure 2, it changed. However, booleans are seldom present as literal
does work, pivot tables however are often based on textual values. They are often the result of formulas, in which case
data (which denote categories, for example). This means they only change in accordance with modifications to the
that where there once were three categories (“a”, “b” and data they depend on.
“c”, in Figure 1), now there are many (eight different ones
in Figure 2). Pivot tables calculate their size based on the 4. INTRODUCING EXCEL ERRORS
number of unique values they find for a category, so pivot
By changing data in Excel cells, errors might be induced
tables become larger than they were originally. This can
that were not present in the original spreadsheet. For in-
lead to problems, since multiple pivot tables are often situ-
stance, in the formula =A1/(3-A2), division by zero might
ated close to each other on the same worksheet. If the pivot
occur (and thus be reported after analysis) if A2 becomes
tables grow because of the anonymization, they can start to
3, which could happen because of the anonymization step in
overlap and unfortunately, this causes Excel to crash.
which data in cells is decreased or increased by 60% of their
original value.
To resolve this, we save the list of all formulas that re-
sult in an error before the anonymization. Then, after we
anonymize each data cell, we verify that we have not changed
this list. For this, we do not have to analyze all formulas
in the spreadsheet, we only analyze the recursive precedents
of the cell, plus all formulas that contain the INDIRECT
function.
Figure 1: Original spreadsheet with three categories
5. LIMITATIONS
5.1 Confidential formulas
Every so often, spreadsheets contain confidential formu-
las. All formulas, including those confidential ones, are left
unaltered to preserve analysis results. This might not be
sufficient for some users.
5.2 Embedded constants
In the current implementation, we only change numeric
values in cells and not within formulas, such as in =SUM(A1:
A10)*1.2. This is a limitation because these constants too
Figure 2: Simple text replacement can be of importance to the spreadsheet owner and thus
confidential.
5.3 Analysis Types
Different kinds of spreadsheet analyses scan for different
kinds of patterns. Developed to complement PerfectXL,
AnonymousXL leaves intact formulas, boolean literals and
certain numbers for they are key to mimicking analysis of
the original spreadsheet. Nevertheless, AnonymousXL or a
slight variation of it could carry great potential for alterna-
tive analysis types.
Figure 3: AnonymousXL applied to table 6. REFERENCES
[1] D. E. Bakken, R. Parameswaran, D. M. Blough, A. A.
Franz, and T. J. Palmer. Data obfuscation: Anonymity
Therefore, we anonymize all textual values while keep-
and desensitization of usable data sets. IEEE Security
ing intact cell uniqueness by replacing texts with “unique1”,
& Privacy, 2(6):34–41, 2004.
“unique2”, “unique3”, etc. (for example, “unique6” repre-
sents the textual value “a” in Figure 3). [2] F. Hermans, M. Pinzger, and A. van Deursen.
Supporting professional spreadsheet users by generating
3.4 Formulas leveled dataflow diagrams. In Proc. of ICSE ’11, pages
Formulas are basially left alone. The only modification 451–460, 2011.
made to formulas are sheet references, since sheet names [3] F. Hermans, M. Pinzger, and A. van Deursen.
are made anonymous as well. Detecting and visualizing inter-worksheet smells in
spreadsheets. In Proc of ICSE ’12, pages 441–451, 2012.
3.5 Other Types [4] F. Hermans, M. Pinzger, and A. van Deursen.
Other data types usually fall under either categories men- Detecting code smells in spreadsheet formulas. In Proc
tioned (for instance, a currency type is simply considered a of ICSM ’12, pages 409–418, 2012.