<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anonymizing Spreadsheet Data and Metadata with AnonymousXL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joeri van Veen</string-name>
          <email>joeri@infotron.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felienne Hermans</string-name>
          <email>f.f.j.hermans@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology</institution>
          ,
          <addr-line>Delft</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Infotron</institution>
          ,
          <addr-line>Delft</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In spreadsheet risk analysis, we often encounter spreadsheets that are con dential. This might hinder adoption of spreadsheet analysis tools, especially web-based ones, as users do not want to have their con dential spreadsheets analyzed. To address this problem, we have developed AnonymousXL, an Excel plugin that makes spreadsheets anonymous with two actions: 1) remove all sensitive metadata and 2) obfuscate all spreadsheet data within the Excel worksheets such that it resembles, untraceably, the original values.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        When commercializing our Breviz analysis toolkit [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3,
4</xref>
        ] as an online tool called PerfectXL, we ran into the
problem that customers often do not want to upload, share or
even show us con dential spreadsheets. Therefore, we have
developed a tool that obfuscates [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] both the data and the
metadata in a spreadsheet, while the values still resemble
the original ones. By construction, we guarantee that our
anonymization does not create or resolve Excel errors. This
enables us to run our smell detection tool on the anonymized
spreadsheets as if we were analyzing the original. This
paper describes the capabilities, limitations and applications
of AnonymousXL.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METADATA REMOVAL</title>
      <p>AnonymousXL removes spreadsheet metadata: the
author, the date the le was last opened and the total edit
time, in order to remove any ties with the company that the
spreadsheet originally came from. In addition, worksheet
names within the spreadsheet are replaced with anonymous
names.</p>
    </sec>
    <sec id="sec-3">
      <title>Numerical and Date Related Metadata</title>
      <p>All numerical metadata information is converted to 0. At
this time, only the numerical metadata \revision number"
and \total editing time" are converted. All metadata that is
of a date type is set to the day of anonymization: \last print
date", \creation date" and \last save time".
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Textual Metadata</title>
      <p>The following textual metadata are set to the text string
\anonymous": title, subject, author, keywords, comments,
template, last author, application name, security, category,
format, manager, company.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>DATA OBFUSCATION</title>
      <p>Data obfuscation is the alteration of data to make it
anonymous. This happens linearly, from the rst sheet to the last
sheet, from the rst to the last cell of the used range of cells
in each worksheet. We use di erent techniques for di
erent types of data in the spreadsheet: numeric data, dates,
textual data, formulas and other types of data.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Numeric Data</title>
      <p>
        The basic step for anonymizing a number is to randomly
add or subtract up to 60% of its original value. Or,
mathematically, for any number N in a cell, N is replaced by
N N 0:6 r where r is a random value in the range [
        <xref ref-type="bibr" rid="ref1">0,
1</xref>
        ]. We treat integers and real numbers di erently: Integer
values remain integer, real numbers keep their decimals.
      </p>
      <p>There is one exception in the anonymization: In
PerfectXL, one of the analyses that is performed is the occurrence
of so-called `magic numbers', numbers of which the meaning
might be unclear to the user. There are some numbers,
however, that are not considered to be magical, because of their
frequent occurrence: 0, 1, 2, 12, 100, 365, 1000. Therefore,
these numbers remain as is in our anonymization process.
Since all text elds (including column names) get changed,
we believe that leaving the non-magic numbers intact does
not pose a threat to the anonymity of the spreadsheet, since
labels give numbers semantics.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Dates</title>
      <p>Dates are converted into random dates in the range of
representable dates in VBA, in contrast with metadata, in
which all date values are set to the day of anonymization.
This randomness is introduced as to maintain data variation.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Textual Data</title>
      <p>For textual data, it does not su ce to simply change all
textual values to \text", since in many situations, it matters
to keep equal strings equal. An example of such a situation
is a pivot table, as shown in Figure 1. Should we change all
categories to \text", the spreadsheet would not work
anymore, as pivot tables cannot contain two elds of the same
name. If we would replace all textual values by unique ones,
such as \text1", \text2", \text3", as shown in Figure 2, it
does work, pivot tables however are often based on textual
data (which denote categories, for example). This means
that where there once were three categories (\a", \b" and
\c", in Figure 1), now there are many (eight di erent ones
in Figure 2). Pivot tables calculate their size based on the
number of unique values they nd for a category, so pivot
tables become larger than they were originally. This can
lead to problems, since multiple pivot tables are often
situated close to each other on the same worksheet. If the pivot
tables grow because of the anonymization, they can start to
overlap and unfortunately, this causes Excel to crash.</p>
      <p>Therefore, we anonymize all textual values while
keeping intact cell uniqueness by replacing texts with \unique1",
\unique2", \unique3", etc. (for example, \unique6"
represents the textual value \a" in Figure 3).
3.4</p>
    </sec>
    <sec id="sec-9">
      <title>Formulas</title>
      <p>Formulas are basially left alone. The only modi cation
made to formulas are sheet references, since sheet names
are made anonymous as well.
3.5</p>
    </sec>
    <sec id="sec-10">
      <title>Other Types</title>
      <p>Other data types usually fall under either categories
mentioned (for instance, a currency type is simply considered a
number). A special note on booleans TRUE and FALSE: as
booleans are interpreted by Excel as 0 and 1, they are not
changed. However, booleans are seldom present as literal
values. They are often the result of formulas, in which case
they only change in accordance with modi cations to the
data they depend on.</p>
    </sec>
    <sec id="sec-11">
      <title>4. INTRODUCING EXCEL ERRORS</title>
      <p>By changing data in Excel cells, errors might be induced
that were not present in the original spreadsheet. For
instance, in the formula =A1/(3-A2), division by zero might
occur (and thus be reported after analysis) if A2 becomes
3, which could happen because of the anonymization step in
which data in cells is decreased or increased by 60% of their
original value.</p>
      <p>To resolve this, we save the list of all formulas that
result in an error before the anonymization. Then, after we
anonymize each data cell, we verify that we have not changed
this list. For this, we do not have to analyze all formulas
in the spreadsheet, we only analyze the recursive precedents
of the cell, plus all formulas that contain the INDIRECT
function.
5.
5.1</p>
    </sec>
    <sec id="sec-12">
      <title>LIMITATIONS</title>
    </sec>
    <sec id="sec-13">
      <title>Confidential formulas</title>
      <p>Every so often, spreadsheets contain con dential
formulas. All formulas, including those con dential ones, are left
unaltered to preserve analysis results. This might not be
su cient for some users.
5.2</p>
    </sec>
    <sec id="sec-14">
      <title>Embedded constants</title>
      <p>In the current implementation, we only change numeric
values in cells and not within formulas, such as in =SUM(A1:
A10)*1.2. This is a limitation because these constants too
can be of importance to the spreadsheet owner and thus
con dential.
5.3</p>
    </sec>
    <sec id="sec-15">
      <title>Analysis Types</title>
      <p>Di erent kinds of spreadsheet analyses scan for di erent
kinds of patterns. Developed to complement PerfectXL,
AnonymousXL leaves intact formulas, boolean literals and
certain numbers for they are key to mimicking analysis of
the original spreadsheet. Nevertheless, AnonymousXL or a
slight variation of it could carry great potential for
alternative analysis types.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Bakken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Franz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Palmer</surname>
          </string-name>
          .
          <article-title>Data obfuscation: Anonymity and desensitization of usable data sets</article-title>
          .
          <source>IEEE Security &amp; Privacy</source>
          ,
          <volume>2</volume>
          (
          <issue>6</issue>
          ):
          <volume>34</volume>
          {
          <fpage>41</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinzger</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. van Deursen.</surname>
          </string-name>
          <article-title>Supporting professional spreadsheet users by generating leveled data ow diagrams</article-title>
          .
          <source>In Proc. of ICSE '11</source>
          , pages
          <fpage>451</fpage>
          {
          <fpage>460</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinzger</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. van Deursen. Detecting</surname>
          </string-name>
          <article-title>and visualizing inter-worksheet smells in spreadsheets</article-title>
          .
          <source>In Proc of ICSE '12</source>
          , pages
          <fpage>441</fpage>
          {
          <fpage>451</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinzger</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. van Deursen.</surname>
          </string-name>
          <article-title>Detecting code smells in spreadsheet formulas</article-title>
          .
          <source>In Proc of ICSM '12</source>
          , pages
          <fpage>409</fpage>
          {
          <fpage>418</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>