<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PIVOT: Privacy-preserving Outsourcing of Text Data for Word Embedding Against Frequency Analysis Attack (Poster submission)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yanying Li and Wendy Hui Wang</string-name>
          <email>Hui.Wangg@stevens.edu</email>
          <email>fyli158, Hui.Wangg@stevens.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boxiang Dong</string-name>
          <email>dongb@montclair.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Montclair State University</institution>
          ,
          <addr-line>Montclair, NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Stevens Institute of Technology</institution>
          ,
          <addr-line>Hoboken, NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we design PIVOT, a new privacy-preserving method that supports outsourcing of text data for word embedding. PIVOT includes a 1-to-many mapping function for text documents that can defend against the frequency analysis attack with provable guarantee, while preserving the word context during transformation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>There has been a significant growth in the volume and
variety of unstructured text data. Technologies such as text
analytics and Natural Language Processing (NLP) are
developed to process and analyze those massively collected
data. However, the high complexity of these techniques
hinder common users to perform sophisticated text analysis for
their own business and research use.</p>
      <p>
        In this paper, we consider an outsourcing paradigm.
Alice, a data owner who has a private text document D,
intends to outsource D to a third-party service provider for
NLP analysis without leaking sensitive information in D.
Word embedding has been used commonly in many NLP
tasks. It enables machine learning techniques to process raw
text data. In this paper, we assume that the server performs
a prediction-based word embedding method (e.g., word2vec
        <xref ref-type="bibr" rid="ref2">(Mikolov et al. 2013)</xref>
        ) to generate the vectors of D. An
important property of word embedding for NLP analysis is the
context of words (i.e., the co-occurrence of words).
      </p>
      <p>Apparently, a simple 1-to-1 word mapping function (i.e.,
each distinct, individual word is transformed to a
different word) can preserve the word context in D. However,
the 1-to-1 transformation function is weak against the
frequency analysis attack, which is a common attack that can
break the transformation by mapping the original and
transformed words based on their frequency distribution. To
defend against the frequency analysis attack, in this paper, we
consider 1-to-many word mapping transformation function:
multiple instances of the same word that appear at different
places of D are mapped to different words in D0.</p>
      <p>Although the 1-to-many word mapping function can
defend against the frequency analysis attack, using 1-to-many
word mapping for data transformation raises the challenge
of how to preserve data utility. In this paper, we consider
the data utility function as the accuracy of word embedding.
Finding a 1-to-many mapping that preserves the semantics
of the original document D for accurate word embedding is
non-trivial. As an example, consider the document in which
w1 and w2 are of frequency 4. Out of these four occurrences,
w1 and w2 co-occur three times. Assume w1 (w2 resp.) is
to be replaced with s11 (s12 reps.) and s21 (s22 resp. ), each of
frequency 2. There are several possible schemes to replace
the three (w1; w2) pairs. Below we we show two possible
schemes:</p>
      <p>Scheme 1: The three (w1; w2) pairs are mapped to two
(s11; s12) pairs and one (s21; s22) pair;
Scheme 2: The three (w1; w2) pairs are mapped to one
(s11; s12) pair, one (s21; s1) pair, and one (s11; s22) pair.</p>
      <p>2
Apparently, due to the fact that the context in (w1; w2) pairs
is preserved in two (s11; s1) pairs, Scheme 1 preserves the
2
context of w1 and w2 better than Scheme 2.</p>
      <p>To addresses the trade-off between privacy and utility, we
design PIVOT, an efficient word transformation method for
PrIVacy-preserving Outsourcing of Text data for word
embedding. To our best knowledge, this is the first work on
privacy-preserving outsourcing of text data that considers
the accuracy of word embedding as the utility goal.</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In this paper, we use the notations shown in Table 1.</p>
      <p>Symbol
D / D0
wi
Si
sip
Data Utility. In this paper, we consider the quality of word
embedding as the utility of the outsourced data. We
measure the accuracy loss of word embedding as the change of
distance between the vectors that are generated from D and
D0. Formally, given two vectors v and v0, the distance of v
and v0 is measured as following, where cos(v; v0) measures
the cosine similarity of v and v0:
dist(v; v0) = 1
cos(v; v0) = 1</p>
      <p>v v0
jjvjj jjv0jj
all vector pairs in V and V 0.
dist(D; D0) =</p>
      <p>
        X
wi;wj2W;sip2Si;sjq2Sj
jdist(wi; wj ) dist(sip; sjq)j
Our utility goal is to minimize the distance between the
original document D and its transformed document D0.
Attack Model. In this paper, we only consider the
frequency analysis attack. We assume the attacker may
obtain the prior knowledge of word frequency distribution of
D. Then by collecting the word frequency information in
the transformed document D0, the attacker can perform the
frequency analysis attack, a well-known attack, to map the
words in the transformed document D0 back to the original
words in D using their frequency distribution
        <xref ref-type="bibr" rid="ref4">(Naveed,
Kamara, and Wright 2015)</xref>
        .
      </p>
      <p>
        Privacy Model. To quantify the privacy guarantee against
the frequency analysis attack, we define the notation of
`-privacy, which is adapted from a well-known notion
`diversity
        <xref ref-type="bibr" rid="ref1">(Machanavajjhala et al. 2006)</xref>
        . Formally, given D
and D0, D0 satisfies `-privacy if for each replacement word
s 2 D0, the attacker’s attack probability on s must satisfy:
1
`
where w 2 D is the original word of s , and ` &gt; 1 is a
userspecified integer. Intuitively, the higher ` is, the stronger the
transformation is against the frequency analysis attack.
      </p>
      <p>P (s ! w)
;</p>
    </sec>
    <sec id="sec-3">
      <title>Grouping-based One-to-Many Word Mapping</title>
      <p>We design a grouping-based 1-to-many word mapping
(GOMM) method that maps multiple instances of a single
word to different words. To defend against the frequency
analysis attack, the frequency distribution of the transformed
document is always (almost) uniform. Figure 1 shows an
example of before and after mapping.</p>
      <p>10
w1 w2 w3</p>
      <p>Word
(a) Freq. of orig. words
(before mapping)
4
3.5
3
cyn 2.5
equ 2
re 1.5
F 1
0.5
0 s11 s21 s22 s32 s13 s23 s33</p>
      <p>Replacement Word
(b) Freq. of transformed words
(after mapping)</p>
      <p>Case 3 [Overlap replacement]: the (wi; wj ) pairs are
replaced with multiple pairs that overlap (i.e., a unique
replacement word sip is paired with at least two different
replacement words).
(a) Case 1
(b) Case 2
(c) Case 3</p>
      <p>Apparently, for Case 1, the context of (wi; wj ) is well
preserved. Thus the utility loss of embedding of the
replacement words sip (sjq, resp.) is minimized. For Case 2,
since the disjoint replacement pairs are considered as unique
word pairs in the transformed document D0, the context of
(wi; wj ) in D is split into multiple non-overlapping new
contexts in D0. Thus the utility loss of Case 2 is worse than
Case 1. Case 3 has the worst utility loss, as the overlapped
replacement pairs make the same replacement word to be
included in different contexts.</p>
      <p>To address the trade-off between privacy and utility, we
design the GOMM algorithm that consists of two steps:
Grouping: all words in D are partitioned into several
disjoint groups based on their frequency, with each group of
size at least `.
1-to-many mapping within groups: for each group,
pairs of words are replaced by a 1-to-many mapping
function, requiring that all replacement words are of similar
frequency.</p>
      <p>The key idea of grouping is to avoid a large number of
distinct replacement words, as too many distinct replacement
words will create new contexts (as in Case 3) in the
transformed document; the old context thus is ruined.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we present our design of PIVOT, a
privacypreserving transformation method on raw text data that can
defend against the frequency analysis attack. Next, we will
implement the algorithm and perform empirical study on the
real-world datasets to evaluate the performance of PIVOT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Machanavajjhala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kifer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Venkitasubramaniam,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2006</year>
          . `
          <article-title>-diversity: privacy beyond - anonymity</article-title>
          .
          <source>In 22nd International Conference on Data Engineering (ICDE'06)</source>
          ,
          <fpage>24</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Corrado</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>arXiv preprint arXiv:1301</source>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Naveed</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kamara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>C. V.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Inference attacks on property-preserving encrypted databases</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security</source>
          ,
          <fpage>644</fpage>
          -
          <lpage>655</lpage>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>