<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Preserving Privacy in Analyses of Textual Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Borja Balle Deep Mind borja.balle@gmail.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oluwaseyi Feyisetan Amazon sey@amazon.com</string-name>
          <email>draket@amazon.com</email>
          <email>tdiethe@amazon.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Thomas Drake Amazon</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tom Diethe Amazon</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>Amazon prides itself on being the most customer-centric company on earth. That means maintaining the highest possible standards of both security and privacy when dealing with customer data. This month, at the ACM Web Search and Data Mining (WSDM) Conference, my colleagues and I will describe a way to protect privacy during large-scale analyses of textual data supplied by customers. Our method works by, essentially, re-phrasing the customersupplied text and basing analysis on the new phrasing, rather than on the customers' own language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>DIFFERENTIAL PRIVACY</title>
      <p>Questions about data privacy are frequently met with the answer
’It’s anonymized! Identifying features have been scrubbed!’
However, studies such as this one from MIT show that attackers can
de-anonymize data by correlating it with ’side information’ from
other data sources.</p>
      <p>
        Diferential privacy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a way to calculate the probability that
analysis of a data set will leak information about any individual in
that data set. Within the diferential-privacy framework, protecting
privacy usually means adding noise to a data set, to make data
related to specific individuals more dificult to trace. Adding noise
often means a loss of accuracy in data analyses, and diferential
privacy also provides a way to quantify the trade-of between privacy
and accuracy.
      </p>
      <p>Let’s say that you have a data set of cell phone location traces for
a particular city, and you want to estimate the residents’ average
commute time. The data set contains (anonymized) information
about specific individuals, but the analyst is interested only in an
aggregate figure - 37 minutes, say.</p>
      <p>Diferential privacy provides a statistical assurance that the
aggregate figure will not leak information about which individuals
are in the data set. Say there are two data sets that are identical,
except that one includes Alice’s data and one doesn’t. Diferential
privacy says that, given the result of an analysis - the aggregate
ifgure - the probabilities that either of the two data sets was the
basis of the analysis should be virtually identical.</p>
      <p>Of course, the smaller the data set, the more dificult this
standard is to meet. If the data set contains nine people with 15-minute
commutes and one person, Bob, with a two-hour commute, the
average commute time is very diferent for data sets that do and
do not contain Bob. Someone with side information - that Bob
frequently posts Instagram photos from a location two hours outside
the city - could easily determine whether Bob is included in the
data set.</p>
      <p>Adding noise to the data can blur the distinctions between
analyses performed on slightly diferent data sets, but it can also reduce
the utility of the analyses. A very small data set might require
the addition of so much noise that analyses become essentially
meaningless. But the expectation is that as the size of the data set
grows, the trade-of between utility and privacy becomes more
manageable.</p>
    </sec>
    <sec id="sec-2">
      <title>PRIVACY IN THE SPACE OF WORD</title>
    </sec>
    <sec id="sec-3">
      <title>EMBEDDINGS</title>
      <p>In the field of natural-language processing, a word embedding is a
mapping from the space of words into a vector space, i.e., the space
of real numbers. Often, this mapping depends on the frequency
with which words co-occur with each other, so that related words
tend to cluster near each other in the space:</p>
      <p>So how can we go about preserving privacy in such spaces?
One possibility is to modify the original text such that its author
cannot be identified, but the semantics are preserved. This means
adding noise in the space of word embeddings. The result is sort of
like a game of Mad Libs, where certain words are removed from a
sentence and replaced with others.</p>
      <p>
        While we can apply standard diferential privacy in the space
of word embeddings, doing so would lead to poor performance.
Diferential privacy requires that any data point in a data set can be
replaced by any other, without an appreciable efect on the results of
aggregate analyses. But we want to cast a narrower net, replacing a
given data point only with one that lies near it in the semantic space.
Hence we consider a more general definition known as ’metric’
diferential privacy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>METRIC DIFFERENTIAL PRIVACY</title>
      <p>I said that diferential privacy requires that the probabilities that a
statistic is derived from either of two data sets be virtually
identical. But what does ’virtually’ mean? With diferential privacy, the
allowable diference between the probabilities is controlled by a
parameter, epsilon, which the analyst must determine in advance.
With metric diferential privacy, the parameter is epsilon times the
distance between the two data sets, according to some distance
metric: the more similar the data sets are, the harder they must be
to distinguish.</p>
      <p>Initially, metric diferential privacy was an attempt to extend the
principle of diferential privacy to location data. Protecting privacy
means adding noise, but ideally, the noise should be added in a way
that preserves aggregate statistics. With location data, that means
overwriting particular locations with locations that aren’t too far
away. Hence the need for a distance metric.</p>
      <p>The application to embedded linguistic data should be clear. But
there’s a subtle diference. With location data, adding noise to a
location always produces a valid location - a point somewhere on
the earth’s surface. Adding noise to a word embedding produces a
new point in the representational space, but it’s probably not the
location of a valid word embedding. So once we’ve identified such
a point, we perform a search to find the nearest valid embedding.
Sometimes the nearest valid embedding will be the original word
itself; in that case, the original word is not overwritten.</p>
      <p>In our paper, we analyze the privacy implications of diferent
choices of epsilon value. In particular, we consider, for a given
epsilon value, the likelihood that any given word in a string of words
will be overwritten and the number of semantically related words
that fall within a fixed distance of each word in the embedding
space. This enables us to make some initial arguments about what
practical epsilon values might be.</p>
    </sec>
    <sec id="sec-5">
      <title>HYPERBOLIC SPACE</title>
      <p>
        In November 2019, at the IEEE International Conference on Data
Mining (ICDM), we presented a paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that, although it appeared
ifrst, is in fact a follow-up to our WSDM paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In that paper,
we describe an extension of our work on metric diferential privacy
to hyperbolic space.
      </p>
      <p>The word-embedding space we describe in the WSDM paper is
the standard Euclidean space. A two-dimensional Euclidean space is
a plane. A two-dimensional hyperbolic space, by contrast, is curved.</p>
      <p>
        In hyperbolic space, as in Euclidean space, distance between
embeddings indicates semantic similarity. But hyperbolic spaces
have an additional degree of representational capacity: the diferent
curvature of the space at diferent locations can indicate where
embeddings fall in a semantic hierarchy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>So, for instance, the embeddings of the words ’ibuprofen’,
’medication’, and ’drug’ may lie near each other in the space, but their
positions along the curve indicate which of them are more specific
terms and which more general. This allows us to ensure that we
are substituting more general terms for more specific ones, which
makes personal data harder to extract.</p>
      <p>In experiments, we applied the same metric-diferential-privacy
framework to hyperbolic spaces that we had applied to Euclidean
space and observed 20-fold greater guarantees on expected privacy
in the worst case.</p>
    </sec>
    <sec id="sec-6">
      <title>5 BIOGRAPHY</title>
      <p>Dr. Tom Diethe is an Applied Science Manager in Amazon Research,
Cambridge UK. Tom is also an Honorary Research Fellow at the
University of Bristol. Tom was formerly a Research Fellow for
the “SPHERE” Interdisciplinary Research Collaboration, which is
designing a platform for eHealth in a smart-home context. This
platform is currently being deployed into homes throughout Bristol.</p>
      <p>Tom specializes in probabilistic methods for machine learning,
applications to digital healthcare, and privacy enhancing
technologies. He has a Ph.D. in Machine Learning applied to multivariate
signal processing from UCL, and was employed by Microsoft
Research Cambridge where he co-authored a book titled ‘Model-Based
Machine Learning.’ He also has significant industrial experience,
with positions at QinetiQ and the British Medical Journal. He is a
fellow of the Royal Statistical Society and a member of the IEEE
Signal Processing Society.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Konstantinos</given-names>
            <surname>Chatzikokolakis</surname>
          </string-name>
          , Miguel E Andrés, Nicolás Emilio Bordenabe, and
          <string-name>
            <given-names>Catuscia</given-names>
            <surname>Palamidessi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Broadening the scope of diferential privacy using metrics</article-title>
          .
          <source>In International Symposium on Privacy Enhancing Technologies Symposium</source>
          . Springer,
          <fpage>82</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Dwork</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Diferential privacy: A survey of results</article-title>
          .
          <source>In International conference on theory and applications of models of computation</source>
          . Springer,
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Oluwaseyi</given-names>
            <surname>Feyisetan</surname>
          </string-name>
          , Borja Balle, Thomas Drake, and Tom Diethe.
          <year>2020</year>
          .
          <article-title>Privacyand Utility- Preserving Textual Analysis via Calibrated Multivariate Perturbations</article-title>
          .
          <source>In Proceedings of the 13th International Conference on Web Search and Data Mining.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Oluwaseyi</given-names>
            <surname>Feyisetan</surname>
          </string-name>
          , Tom Diethe, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Drake</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text</article-title>
          .
          <source>In IEEE International Conference on Data Mining (ICDM).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Maximillian</given-names>
            <surname>Nickel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Douwe</given-names>
            <surname>Kiela</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Poincaré embeddings for learning hierarchical representations</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>6338</volume>
          -
          <fpage>6347</fpage>
          .
          <article-title>A version of this first appeared on the Amazon science</article-title>
          blog at: https://www.amazon.science/blog/preserving
          <article-title>-privacy-in-analyses-of-textual-data</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>