<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sensitive-aware Privacy Index for Sanitized Text Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudio Carpineto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Romano</string-name>
          <email>romanog@fub.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Ugo Bordoni</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although a number of sanitization methods for text databases have been recently proposed, there has been so far little work on general ways to measure what can be learned about the user from the sanitized database relative to what can be learned from the original database. In this paper we propose a new privacy index, termed Sensitive-aware Privacy Index (SPI), that extends the common approach of comparing the global information content of the two databases by taking into account the relative importance of the single documents in each database. This is achieved through a form of weighted Jensen Shannon divergence, in which the weights re ect the document's sensitivity, as determined by an ad hoc classi er. Using search queries as an illustration, we show that SPI provides more reliable and consistent indications than sensitive-unaware information theoretic measures, both under a generic anonymization technique as well as with privacy-controlled query logs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The amount of text data formed by `documents' associated with users (e.g. social
network posts, search queries, medical notes, tweets, reviews, email messages)
has been growing exponentially in recent years, giving rise to new opportunities
and challenges. While the value of analyzing these data is widely recognized,
their publication raises privacy concerns, both in terms of identity disclosure
(i.e., when an attacker is able to match a document in a database to an
individual) and attribute disclosure (i.e., when an attacker is able to nd the sensitive
documents, with or without reidenti cation). Even though we remove explicit
identi ers (such as personal name, social security number, address), quasi
identi ers (such as zip code, gender, birthdate), and directly-sensitive items (such
as marriage status, national origin, salary, religion, sexual orientation, diseases)
from a user's documents by using natural language processing techniques, it may
still be possible to infer some of these attributes by combining other, seemingly
irrelevant parts of the documents with external databases [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>To better protect the user's privacy while at the same time preserving the
utility of the shared data, a number of sanitization methods for text data have
been made available,1 often as an extension of earlier privacy models for
struc</p>
      <p>
        IIR 2018, May 28-30, 2018, Rome, Italy. Copyright held by the author(s).
1 In this paper, by sanitization we mean both the protection of identity (usually
referred to as anonymization) and of private information.
tured data. Among others, k-anonymity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], di erential privacy [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and user
clustering [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] . The availability of several sanitization strategies, each enforcing
certain privacy guarantees for a speci c set of parameters and type of output,
raises the question of their evaluation and comparison.
      </p>
      <p>
        Previous work in the microdata eld has focused on global ways to to
measure the changes in information content that follow data sanitization, without
making speci c assumptions about an attacker. Several information theoretic
measures have been proposed, including mutual information [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
KullbackLeibler divergence [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Although a straightforward application of this approach
to text data is possible [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we argue that it does not seem very appropriate
because the two domains are fundamentally di erent. In structured databases,
the quasi-identi ers and sensitive attributes are known a priori, so that we can
neglect the other attributes. In text databases, we have documents instead of
attributes and each document may be sensitive or quasi-identi er. This suggests
that we should be able to estimate how the single documents may a ect the user
privacy besides considering the changes in their probability distributions.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Sensitive-aware Privacy Index</title>
      <p>Given a database X containing text documents associated with a set of users
N , and a sanitized version of X denoted by Y , let Xu = fdu;1; du;2; :::; du;j g
and Yu = fdu;1; du;2; :::; du;kg be the set of documents associated with user u in,
respectively, X and Y . We want to measure some di erence between the set of a
user's documents before and after sanitization, relating such a di erence to the
gain of privacy.</p>
      <p>
        One natural starting point [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is to compute the Kullback-Leibler divergence
(KLD) of Yu from Xu:
      </p>
      <p>KLD(XujjYu) = X</p>
      <p>P (djXu)</p>
      <p>log
d</p>
      <p>Intuitively, the larger the KLD value, the more di cult it is to nd useful
information to break the user's privacy. However, the use of KLD in our scenario
does not come without problems. In order to compute Expression 1, we need to
estimate P (djXu) and P (djYu); i.e., the probability of the document d given
the original and sanitized datasets. This problem is made di cult by the fact
that there may be documents in Xu not present in Yu (e.g., due to document
suppression), as well as documents in Yu not present in Xu (e.g., due to document
perturbation). In particular, we cannot set P (djYu); d 2 Xu; d 2= Yu, to zero,
because KLD(XujjYu) is not de ned in this (very common) case.</p>
      <p>To overcome this di culty, rather than applying some smoothing procedure
to KLD, we use the Jensen-Shannon divergence (JSD) between Xu and Yu:
where Mu = 21 (Xu + Yu). Unlike KLD, JSD is always de ned and is bounded
by 1.</p>
      <p>We next observe that in Equations 1 and 2 any document is treated in the
same manner, whereas some documents are clearly more important than others
for user identi cation or disclosure of con dential information. We assume that it
is possible to estimate the sensitivity of a document d automatically by an ad-hoc
classi er, and denote by d the class membership probability of the document.</p>
      <p>We can now use d to weight the contribution made by the single
documents to KLD (and JSD). A released document with a large d a ects privacy
negatively (i.e., the chances of privacy breaks increase), which means that the
divergence should become smaller (compared to the value obtained releasing a
document with a lower d). The contribution of a single document should thus
be inversely related to its d value. We set the weight of d (wd) to 1 minus the
probability that d belongs to the sensitivity class: wd = 1 d. The weighted
KLD (WKLD) is given by:</p>
      <p>WKLD(wd ; XujjYu) = X
d
wd P (djXu) log</p>
      <p>Note that when all the weights wd are equal to 1 (i.e., if the probability that
any document belongs to the sensitive class is equal to zero), SPI coincides with
JSD. In general, the value of SPI will be smaller than JSD. Like JSD, SPI is
bounded by 0 and 1. In particular, if X = Y, then SPI = 0.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experiments with query logs</title>
      <p>
        For our experiments, we consider search query logs, a speci c but important
type of text data that has been the focus of much privacy research in the last
ten years [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In all our experiments we used a subset of the well known AOL
dataset. It contained the queries associated with 10,000 `heavy' users, who were
randomly selected among those who entered more than 44 queries; i.e., the
average number of queries per AOL user. In this way we removed the users with
very short pro les, who are not very interesting from the point of view of privacy.
The number of random users (i.e., 10,000) was decided by experimenting with
increasing samples, until the results stabilized.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Training and testing SPI's sensitivity weights</title>
        <p>
          In order to compute SPI for the query log data, we need to nd d; i.e., the
probability that any given search query is sensitive. As search queries are usually very
short and do not contain repetitions, we trained a naive Bayes classi er using a
bag of words model with binary features. As a training set, we used a subset of
AOL queries that were manually labeled as sensitive or not-sensitive, rst
introduced in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The classi er achieved 78% 10 fold cross validation accuracy on the
labeled data set. When we ran the classi er on the heavy AOL users, we found
that about half of the queries were labeled as sensitive. Compared to an earlier
experiment on the whole population of AOL users using a small training set [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ],
we achieved better classi cation accuracy and found fewer sensitive queries.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluating SPI under k-anonymity</title>
        <p>
          To evaluate the performance of SPI on sanitized databases, we generated a
number of query logs under k-anonymization [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a simple privacy policy which
requires that for each query there are at least other k-1 equal queries entered by
distinct users. Using k-anonymization, the level of privacy protection can be
increased by choosing larger values of k, which results in the suppression of rare
as well as relatively frequent queries, with only the most common queries being
released.2 We let k vary from 1 to 1000, thus progressively strengthening the
privacy requirements, and generated the corresponding released logs for the heavy
AOL users. Then, for each released log, we computed four measures: (i) SPI, (ii)
JSD (i.e., the unweighted version of SPI), (iii) the number of released queries
(also known as impressions), and (iv) the Pro le Exposure Level (PEL), one of
the few earlier privacy measures for text data of which we are aware, presented
in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The de nition of PEL is the following:
        </p>
        <p>PEL =</p>
        <p>I(X; Y )
H(X)
100; I(X; Y ) = X p(xjy) p(y) log
x;y
p(xjy)
p(x)
(4)</p>
        <p>The ratio between mutual information I and entropy H (known in statistics
as the uncertainty coe cient) gives a measure of the information that Y provides
about X, normalized with respect to the information of X. To estimate p(x),
p(y), and p(xjy) in Equation 4 we used the method described by the authors.</p>
        <p>In Figure 1 we show how the four measures varied as a function of k, for the
heavy AOL users. The function JSD monotonically grew as k increased, because
larger values of k are associated with smaller subsets of released queries, thus
increasing the divergence from the original log. We checked that the percentage of
impressions, the value of k, and JSD were all highly correlated with one another,
with pairwise Pearson's correlation coe cients greater than 0.85 (in absolute
value). Althoug JSD seems a more powerful privacy index than impressions and
k, in the particular setting of our experiment they provided similar indications.
2 Note that this behavior is not speci c to k-anonymization; it can be observed in
most privacy-protection methods, including di erential privacy and user clustering.</p>
        <p>The function SPI exhibited a di erent pattern and was weakly correlated
with the former measures. While it generally grew as k increased, it also showed
some notable oscillations pointing to problems with the privacy content of the
queries suppressed at step k. In fact, unlike JSD, SPI may decrease when we
remove some documents from the released database. This happens if we remove
documents with high weights (i.e., with low sensitivity) while keeping documents
with low weights (i.e., with high sensitivity). Intuitively, in this case the privacy
decreases because it may be easier to nd harmful documents in the released
database. Figure 1 also shows that the values of SPI were always lower than
the corresponding values of JSD, consistent with the observations made in the
preceding section.</p>
        <p>Turning to the behavior of PEL, we see that it remained nearly stable despite
the large variations in the size of released logs. In general, it slightly decreased
as k grew, but it occasionally increased. It can be proved that the latter
phenomenon happens when we remove queries that are more frequent in the user
population than those released. In practice, this situation may be common. Think
of a user frequently entering some unpopular query of interest and less frequently
a popular query: using most sanitization techniques, the less popular query will
have more chances of being removed because it may be more harmful for the
user privacy, but this will instead cause an increase of the level of exposure of
the user according to PEL.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluating SPI using privacy-speci c data</title>
        <p>The second round of experiments leverages controlled privacy-related features.
We focused on queries containing person or place names or private information
(denoted as PPP queries) because this kind of information may be useful to
break the user privacy. In order to extract such queries from the original search
log we used three lists of proper nouns available on the web; one of about 150,000
surnames, one of 5,000 female and male names, and one of 200,000 populated
places. We also collected a vocabulary of sensitive words from a search engine
using the Google sensitive ad categories as seed queries and extracting the most
informative terms from the search results. The search log associated with the
heavy AOL users was partitioned in two groups of queries, those matching our
lists of words (i.e., the PPP queries, covering about 50% of the sample) and those
that did not (i.e., the remaining no-PPP queries). Then we considered two query
selection strategies: releasing only PPP queries and releasing no PPP queries. For
each strategy, we generated increasingly larger supersets of queries, letting the
number of released queries vary from 10% to 50% of the size of the original search
log. We also used a fourth query selection strategy, namely k-anonymization. You
may think of it as reverse engineering of the chart in Figure 1. For each value of
impressions (in the range from 10% to 50%), we found the corresponding value
of k and computed the set of queries associated with it. Finally, for each strategy
we computed the SPI values for the sets of released queries having the desired
sizes. The results are shown in Figure 2.</p>
        <p>The no-PPP strategy was a clear winner. Our simple method of removing
queries containing profane words, even without considering lexical variations and
context, was better than using k-anonymization. This nding con rms that
kanonymization, in general, does not guarantee a good level of privacy protection
because there may be relatively popular sensitive queries that are released even
for high values of k. The only-PPP strategy was clearly recognized as a
bottom line. Our privacy evaluation measure seemed thus to provide reliable and
consistent indications.</p>
        <p>We also computed the analogous values of PEL. Its behavior was a ected
neither by the size of released log nor by the type of query selection strategy,
thus con rming that PEL is not very suitable as a privacy evaluation measure,
at least when the released search log is a strict subset of the original search
log. PEL is limited not only by its inability to nd and score sensitive queries,
but also by the di culty of estimating the joint (or conditional) probabilities of
queries in original and released search logs in Equation 4.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and limitations</title>
      <p>
        One question raised by this approach is the scope of applicability of SPI. We
have estimated the probabilities in Equation 1 using frequency counts at the
document level (i.e., search queries) for each user. In this way, in addition to
requiring that the association between documents and users should be preserved
in the sanitized database (which holds for most sanitization methods), we have
implicitly assumed that many original documents were left unchanged by
sanitization. This assumption is met by various sanitization methods, not only by the
k-anonymity family but also by methods based on classi cation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and
clustering [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and by recent forms of di erential privacy [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, there are
other privacy policies where this assumption would not always hold, e.g., due
to systematic document perturbation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or generalization [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In such cases, it
seems that SPI can still be applied provided that we use more exible methods
to compare a user before and after sanitization; e.g., by partial matching at the
document level or by exact matching at the word level. This is left for future
work.
      </p>
      <p>
        Another practical di culty concerns the paucity of annotated natural
language datasets for training the classi er of document sensitivity for the type of
text data of interest. These datasets are di cult to acquire, although there are
recent works that automate this process to some extent; e.g., for Quora posts
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        A nal issue concerns the attack model. As SPI is intended to evaluate the
level of protection o ered by distinct privacy models, it does not make any
speci c assumption about the attacker. Its basic tenet is that any sanitization
method for text data will result in the suppression or modi cation of leaked
sensitive documents, and it measures the extent to which this has been achieved.
This is a generic assumption that holds for most privacy models, usually
implicitly but also explicitly; e.g., when an attacker's background knowledge is modeled
in terms of machine learning [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or information retrieval techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We introduced SPI, a novel privacy index for text data that extends the classical
information theoretic approach to include the sensitivity of single documents.
First experiments with query logs suggest that its indications are more
reliable and consistent than those provided by existing methods. Future research
directions include the generalizability of SPI to other types of text data and
sanitization methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>E. Adar.</surname>
          </string-name>
          <article-title>User 4xxxxx9: Anonymizing query logs</article-title>
          .
          <source>In WWW Workshop on Query Log Analysis</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Bezzi</surname>
          </string-name>
          .
          <article-title>An information theoretic approach for privacy metrics</article-title>
          .
          <source>TDP</source>
          ,
          <volume>3</volume>
          :
          <fpage>199</fpage>
          {
          <fpage>215</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Biega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Gummadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Mele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Milchevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tryfonopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. R-Susceptibility</surname>
          </string-name>
          :
          <article-title>An IR-Centric Approach to Assessing Privacy Risks for Users in Online Communities</article-title>
          .
          <source>In SIGIR'16</source>
          , pages
          <fpage>365</fpage>
          {
          <fpage>374</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Romano. K -A nity Privacy</surname>
          </string-name>
          <article-title>: Releasing Infrequent Query Re nements Safely</article-title>
          . IP&amp;M,
          <volume>51</volume>
          :
          <fpage>74</fpage>
          {
          <fpage>88</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Romano</surname>
          </string-name>
          .
          <article-title>A Review of Ten Year Research on Query Log Privacy</article-title>
          .
          <source>In Proceedings of the 7th Italian Information Retrieval Workshop</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Cormode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Procopiuc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Empirical privacy and empirical utility of anonymized data</article-title>
          .
          <source>In 29th ICDEW</source>
          , pages
          <volume>77</volume>
          {
          <fpage>82</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Erola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Castella-Roca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Viejo</surname>
          </string-name>
          , and J. Mateo-Sanz.
          <article-title>Exploiting social networks to provide privacy in personalized web search</article-title>
          .
          <source>J. Syst. Software</source>
          ,
          <volume>84</volume>
          (
          <issue>10</issue>
          ):
          <volume>1734</volume>
          {
          <fpage>1745</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Yi</given-names>
            <surname>Fang</surname>
          </string-name>
          , Archana Godavarthy, and
          <string-name>
            <given-names>Haibing</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>A Utility Maximization Framework for Privacy Preservation of User Generated Content</article-title>
          .
          <source>In ICTIR'16</source>
          , pages
          <fpage>281</fpage>
          {
          <fpage>290</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          .
          <article-title>Anonymization of Set Valued Data via Top Down, Local Generalization</article-title>
          .
          <source>In VLDB'09</source>
          , pages
          <fpage>934</fpage>
          {
          <fpage>945</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vaidya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Di erentially private search log sanitization with optimal output utility</article-title>
          .
          <source>In 15th EDBT</source>
          , pages
          <volume>50</volume>
          {
          <fpage>61</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. Tomkins. `</surname>
          </string-name>
          <article-title>I know what you did last summer': query logs and user privacy</article-title>
          .
          <source>In CIKM'07</source>
          , pages
          <fpage>909</fpage>
          {
          <fpage>914</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Korolova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kenthapadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ntoulas</surname>
          </string-name>
          .
          <article-title>Releasing search queries and clicks privately</article-title>
          .
          <source>In WWW'09</source>
          , pages
          <fpage>171</fpage>
          {
          <fpage>180</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vorobeychik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Malin</surname>
          </string-name>
          .
          <article-title>Iterative Classi cation for Sanitizing Large-Scale Datasets</article-title>
          .
          <source>In ICDM'15</source>
          , pages
          <fpage>841</fpage>
          {
          <fpage>846</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. G.
          <string-name>
            <surname>Navarro-Arribas</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Torra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Erola</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>Castella-Roca</surname>
          </string-name>
          .
          <article-title>User k-anonymity for privacy preserving data mining of query logs</article-title>
          . IP&amp;M,
          <volume>48</volume>
          (
          <issue>3</issue>
          ):
          <volume>476</volume>
          {
          <fpage>487</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Peddinti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korolova</surname>
          </string-name>
          , E. Bursztein, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Sampemane</surname>
          </string-name>
          .
          <article-title>Cloak and Swagger: Understanding Data Sensitivity Through the Lens of User Anonymity</article-title>
          . In
          <string-name>
            <surname>S</surname>
          </string-name>
          &amp;P'14, pages
          <fpage>493</fpage>
          {
          <fpage>508</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. S. T. Peddinti and
          <string-name>
            <given-names>N.</given-names>
            <surname>Saxena</surname>
          </string-name>
          .
          <article-title>On the Privacy of Web Search Based on Query Obfuscation: A Case Study of TrackMeNot</article-title>
          . In PETS'
          <volume>10</volume>
          , pages
          <fpage>19</fpage>
          {
          <fpage>37</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>