=Paper= {{Paper |id=Vol-2335/paper1 |storemode=property |title=Privacy-Preserving Active Learning on Sensitive Data for User Intent Classification |pdfUrl=https://ceur-ws.org/Vol-2335/1st_PAL_paper_1.pdf |volume=Vol-2335 |authors=Oluwaseyi Feyisetan,Thomas Drake,Borja Balle,Tom Diethe }} ==Privacy-Preserving Active Learning on Sensitive Data for User Intent Classification== https://ceur-ws.org/Vol-2335/1st_PAL_paper_1.pdf
                                    Privacy-preserving Active Learning on
                                  Sensitive Data for User Intent Classification
                           Oluwaseyi Feyisetan                                      Thomas Drake
                               Amazon Research                                    Amazon Research
                              sey@amazon.com                                    draket@amazon.com

                                 Borja Balle                                          Tom Diethe
                             Amazon Research                                      Amazon Research
                          pigem@amazon.co.uk                                  tdiethe@amazon.co.uk

                             Abstract                                two-fold when submitting a batch of data for annotation: (1)
     Active learning holds promise of significantly reduc-           labeling this selected subset leads to the greatest increase in
     ing data annotation costs while maintaining reasonable          our machine learning model performance, (2) the probability
     model performance. However, it requires sending data            of revealing any query that can uniquely identify a specific
     to annotators for labeling. This presents a possible pri-       user is very small (and quantifiable by a privacy parameter).
     vacy leak when the training set includes sensitive user
     data. In this paper, we describe an approach for carrying
     out privacy preserving active learning with quantifiable        Contributions Recent studies into privacy and machine
     guarantees. We evaluate our approach by showing the             learning have focused on preserving model parameters from
     tradeoff between privacy, utility and annotation budget         leaking training data (Papernot et al. 2016; Hamm, Cao, and
     on a binary classification task in a active learning setting.   Belkin 2016). See (Ji, Lipton, and Elkan 2014) for a recent
                                                                     survey. However, in this paper, we address the privacy pre-
                         Introduction                                serving requirement from the point of view of training sam-
Preserving data privacy is an essential tenet required to main-      ples that are sent to annotators from an active learning model.
tain the bond of trust between consumers and corporations.           To the best of our knowledge, this is the first paper that views
Consumers expect their data to remain secure while being             preserving privacy in machine learning from this angle. We
used to design better services for them without compromis-           also describe how techniques such as k-anonymity does not
ing their identities – especially while carrying out sensitive       provide sufficient privacy guarantees and how this can be
transactions and interactions. We define these potentially           improved using differential privacy (DP). We describe how
compromising and personally identifiable data as sensitive           to do this by providing experimental results after discussing
data. Annotated data drives the machine learning economy             an approach that leverages one of the DP algorithms from
and sensitive data holds the key to building richer experiences      literature.
for users interacting with modern AI interfaces. However, in
a bid to get annotations, sensitive data in the wrong hands                                 Background
could lead to irreparable damage in terms of reputation and
trust between data holders and their users.                          In this section, we present an introduction to active learning
   This potential data transfer deserves greater monitoring in       and the privacy challenge of outsourcing queries to the crowd.
the era of human powered crowdsourcing and active learning.          We then describe k-anonymity, its shortcoming in providing
As niche classification tasks arise to power new applications,       an adequate privacy model for active learning and how this
they often lack an abundance of pre-annotated datasets. With         can be improved with differential privacy.
active learning, the learner can select a subset of the available
data points to be annotated. This can exponentially reduce           Active Learning
(Settles 2010) (in some cases) the number of training queries        The central premise of active learning is that a model is able
required. However, the cost (Dasgupta 2011; Arora, Nyberg,           to perform as well with less data, if a learner can select the
and Rosé 2009; Settles 2010) of labelling machine learning          training examples that provide the highest information (Set-
datasets is traditionally viewed as a function of the expert,        tles 2010). Formally described, using a classification task:
time or price.                                                       let D be a distribution over X ⇥ Y where the goal is to
   In this paper, we argue that, for non-public datasets, the        output a label Y from the label space {±1} given an input
cost of learning the true labels should also factor in the pri-      from the feature space X . The learner receives a batch of
vacy of the information contributed by the data owners to            i.i.d. draws (x1 , y1 ), ..., (xn , yn ) over the unknown under-
the data custodians. As a result, the active learning condi-         lying distribution D. The value of yi is unknown unless an
tion (beyond simply selecting the best examples) becomes             annotation request is made by the learner. The objective is
Accepted at Privacy-Enhancing Artificial Intelligence and Language   to select a hypothesis function h : X ! Y, where err(h) =
Technologies. PAL 2019, March 2019, Palo Alto, CA USA                P(x,y)⇠D [h(x) 6= y] is small. Given that H is the space of
all hypothesis, and h⇤ = argmin{err(h) : h 2 H} is the hy-                to handle sensitive attributes. In our implementation, we sub-
pothesis with minimum error, the aim of active learning is to             sume the quasi-identifiers to include the entire user query.
select a hypothesis h 2 H with error err(h) within reasonable                Therefore, by ‘hiding in the crowd’ of k, a user has re-
bounds of err(h⇤) by using few annotation requests (i.e., few             ceived some assurance from k-anonymity that their sensitive
compared to a passive learner).                                           query will not be outsourced from the active learning model
    Various strategies have been proposed to implement an                 unless it passes a meaningful threshold. However, stronger
active learner. One is uncertainty sampling (Lewis and Gale               formal privacy guarantees are required to demonstrate that
1994) which attempts to select the query x0 that the model is             given the user’s query, an attacker cannot decide where it
least convinced about; i.e., x0 = argmaxx 1 P✓ (ŷ|x), where              came from with certainty. With k-anonymity, we are unable
ŷ is the label with the highest posterior for model ✓ and x is           to directly quantify a privacy loss value, nor state the bounds
maximized over the range of all the unlabeled examples in the             of the guarantee of this loss. These two quantities are ob-
training pool. Other approaches to uncertainty sampling use               tainable from a differential privacy model which we now
either the margin between the two most probable classes ŷ1               describe.
and ŷ2 ; i.e., x0 = argminx P✓ (ŷ1 |x) P✓ (ŷ2 |x) or a general
entropy-based uncertainty
                     P         over all the possible ŷi classes; i.e.,
x0 = argmaxx                                                              Differential Privacy To motivate our discourse on why
                        i P✓ (ŷi |x)logP✓ (ŷi |x).
    The main privacy issue with active learning stems from the            we need stronger privacy guarantees than what k-anonymity
need to scale the annotation process by crowdsourcing the                 provides, we consider a hypothetical scenario: Would a user
labels via an open call (Howe 2006). Whenever you make                    be comfortable asking an AI agent a sensitive question, with
a request to an external resource, you pay a privacy cost by              the knowledge that the question will be possibly used to
transmitting the information to be annotated. This problem                further train agent’s learning model? We denote the training
is compounded when there is only one oracle (Avidan and                   data available to the model before the user submission as D,
Butman 2007) or collusion among crowd workers. In this                    and the data after the user question as D0 . These are adjacent
paper, we describe privacy notions that can be used to address            datasets differing on only one record. We posit that a user c
these concerns along the privacy-utility tradeoff spectrum.               will be comfortable if (1) Q(D) = Q(D0 ) where Q is a query
                                                                          over the dataset; and (2) P (S(c)|D0 ) = P (S(c)) where S
Privacy-preserving machine learning                                       is a user secret. These 2 points are articulated in Dalenius’s
                                                                          Desideratum (Dwork 2011) that:
k-Anonymity At first glance, a straightforward approach
for addressing the privacy concerns of active learning could                Anything that can be learned about a respondent from
be through k-anonymity (Sweeney 2002; Di Castro et al.                      the statistical database should be learnable without ac-
2016); i.e., ensuring each query that is sent out for crowd-                cess to the database
sourcing occurs at least k times. In deploying k-anonymity,               However, we can’t make these exact guarantees because
the first step involves identifying a set of quasi-identifiers. In        datasets are meant to convey information and they will have
our context, these are user queries which can be potentially              no utility if these points were true.
combined with an externally available dataset to uniquely                    What Differential Privacy (Dwork 2011; Dwork and Roth
identify a user. The frequency set of these quasi-identifiers             2014) offers is a strong privacy guarantee on adjacent datasets
represent the number of occurrences in the dataset. We there-             (taking our AI agent example), that: the example selected for
fore say that a dataset satisfies k-anonymity relative to the             active learning will be very similar whether or not the user
quasi-identifiers if when it is projected on an external dataset,         added their sensitive question. This means, an adversarial
the frequency set occurs greater than or equal to k times.                annotator receiving a random training query cannot guess
   To achieve k-anonymity when the size of the frequency                  with certainty if the query was from dataset D (which doesn’t
set is less than a desired k, the attributes are anonymized by            include the user’s query) or D0 (which includes it).
either generalizing or suppressing the information. For ex-
                                                                             With this, we state that, a randomized algorithm M :
ample, marital status attributes listed as married, divorced or
                                                                          NX ! C that receives as input a dataset D with records
widowed are generalized as once married, while the ethnicity
                                                                          from a universe X and outputs an element from C is (", )-
is redacted as *****.
                                                                          differentially private if for every pair of databases D and
   Despite its promise, k-anonymity has fundamental chal-
                                                                          D0 differing in one record and every possible set of outputs
lenges, some of which are exacerbated by our unstructured
                                                                          C ✓ C we have
data domain. First, (Aggarwal 2005) demonstrated that k-
anonymity suffers from the curse of dimensionality since                           Pr[M(D) 2 C]  e" Pr[M(D0 ) 2 C] +                 (1)
generalization (such as with traditional database columns),
requires co-occurrence of words across different examples,                   The parameter accounts for a < 1/||D||1 relaxed chance
but unstructured data such as text phrases tend to follow a               of the ✏ guarantee not holding – otherwise, it will be equiva-
heavy-tailed distribution that have a low co-occurrence of                lent to just selecting a random sample on the order of the size
words. Secondly, the choice of quasi-identifiers might ex-                of the dataset. One benefit of the differential privacy model
clude the selection of some useful sensitive attributes which             is that it has a quantifiable, non binary value for privacy loss
could then be used for re-identification attacks. This led to             which helps in deciding to comparatively select one algo-
other approaches such as l-diversity (Machanavajjhala et al.              rithm over the other. We observe an output of the random
2006) and t-closeness (Li, Li, and Venkatasubramanian 2007)               algorithm C ⇠ M(D) where we believe that C was more
likely produced by D and not D0 , then the privacy loss from       Considerations
the query that yields C on an auxiliary input x is:                Given the size and projected scale of our dataset (⇡ 109
                                                                   queries), we decide to employ randomized probabilistic algo-
                                   P r[M(D) = C]
       L(C; M, x, D, D0 ) = ln(
                             def
                                                   )        (2)    rithms in estimating if a query satisfies k-anonymity. Com-
                                   P r[M(D0 ) = C]                 pute and memory resources are thus freed up for training and
                                                                   retraining the model rather than maintaining the frequency
   So we surmise that differential privacy promises to prevent     and cardinality of incoming queries. Each algorithm (detailed
a user from sustaining additional damage by including their        below) is adjusted to prevent over-estimations which could
data in a dataset; and the privacy loss obtained is ✏ with         erode the privacy guarantees. Furthermore, after a query is
probability 1        .                                             presumed to satisfy k-anonymity, only 1 of the k queries is
   A common method for making the results of a statistical         sent to n external annotators to prevent an aggregation of
query differentially private involves adding Laplacian noise       privacy losses.
proportional to either the query’s global sensitivity (Dwork
2008; Dwork et al. 2006) or the smooth bound of the local          Approach
sensitivity (Nissim, Raskhodnikova, and Smith 2007) (where
sensitivity f = max ||f D f D0 ||). However, for non-              In this paper, we adopt the differential privacy algorithm
continuous domains, adding noise can result in unintended          from (Li, Qardaji, and Su 2012) but we utilize it in an active
consequences that completely wipe out the utility of the           learning setting to select a subset of training examples to
results e.g., (Dwork and Roth 2014) describe how attempting        send for crowdsourcing. We also note that other DP methods
to add noise to the query for the optimal price for an auction     that have been designed for search logs and include a form
could drive the revenue to zero.                                   of k parameter aggregation such as: (Korolova et al. 2009),
                                                                   Z EALOUS (Gotz et al. 2012) and SafeLog (Zhang et al. 2016)
   Research has however shown that apart from providing
                                                                   can be implemented to obtain similar results.
reasonable and well understood protection from inadvertent
                                                                      We take a two-stepped approach to extend k-anonymity
exposure (Di Castro et al. 2016), k-anonymity can also be
                                                                   to yield a quantifiable differentially private active learning
used as a launchpad for achieving quantifiable differential
                                                                   model taking a cue from how (Li, Qardaji, and Su 2012)
privacy without the utility loss that comes from applying
                                                                   demonstrated the use of pre-sampling to achieve differential
noise (Li, Qardaji, and Su 2012; Soria-Comas et al. 2014).
                                                                   privacy with k-anonymity. This is predicated on T HEOREM 1
                                                                   from (Li, Qardaji, and Su 2012) which states that: given an
       Privacy Preserving Active Learning                          algorithm M which satisfies ( 1 , ✏1 , 1 )-differential privacy
                  Framework                                        under sampling, then M also satisfies ( 2 , ✏2 , 2 )-differential
                                                                   privacy under sampling for any 2 < 1 where
This section introduces our proposed framework for carrying
out active learning with privacy guarantees on queries that                               ✓               ◆!
                                                                                             2 ✏1                      2
are sent to an external oracle. It presents the task we try                ✏2 = ln 1 +         (e      1)    ; 2=         1     (3)
                                                                                              1                        1
our approach on, highlights the considerations that drive
our choices and lays out a high level pseudo-code of our              Therefore, k-anonymity on our full dataset (i.e., 1 = 1)
approach.                                                          can instead be preceded by a mechanism that samples each
                                                                   row of its input with probability 2 , with k-anonymity then
Task model                                                         applied to the resulting sub-sample to yield ✏2 , 2 -differential
Our task consists of a very large dataset of user queries          privacy for ✏2 = ln(1 + ( 2 (e✏1 1))) within the bounds
U = {u1 , ..., un } that represent the user intent (we map          2 = 2 1 . Thus the effect of sampling serves to amplify
the queries to the intent and do not extract specific quasi-       pre-existing privacy guarantees (Balle, Barthe, and Gaboardi
identifiers in order to prevent leakages from un-captured          2018).
sensitive attributes). Our pipeline consists of an active learn-      Furthermore, we harden our k-anonymity to offer ‘safe’
ing model which learns a binary classifier, predicting if a        k-anonymization by aggregating the queries by frequency
user intent belongs to a specified class or not. The model         rather than using a distance based measure (LeFevre, DeWitt,
is bootstrapped with a golden set of user queries and their        and Ramakrishnan 2006). The benefit we get from this is that
associated intents. Subsequent queries from a fixed pool are       no query within our set of k contains any extraneous sensitive
added to a R ANKED E XAMPLE P OOL where they are ordered           text which could be used as a source of re-identification or to
by confidence/uncertainty (Gal and Ghahramani 2016) from           carry out reconstruction attacks.
our deep learning model.                                              The next sections describe: how we carry out our sampling
                                                                   to ensure we select useful candidates in an efficient manner,
   To train the model, it first draws on the golden set, then we
                                                                   and how we estimate k-anonymity using the queries.
make a next example call to draw an uncertain query from
the pool with the criteria that knowing the accurate intent of
this query gives the best performance increase to the model           Efficient sub-sampling for active learning
while preserving privacy. The query is then outsourced to ex-      Given a multiset of query sets M = {U1 , ..., Us } with repeti-
ternal annotators and the annotated labels are re-incorporated     tions where a given Ui is a tuple hui , ..., uk i, and a sampling
into the model training process.                                   rate , our objective is to return a sub-sample from which to
  Algorithm 1: Privacy-preserving Active Learning                      imate of the cardinality. To correct this, H YPER L OG L OG
                                                                       breaks the multiset into subsets and uses the harmonic mean
 1 Let    be the sampling rate // = 2 from (3)                         of the subsets.
 2 Let k be the anonymity parameter
                                                                          After determining our sample size, the next step is to draw
 3 Let l be number of samples for variance calculation
                                                                       a random set of unique samples without replacement up to
   Data: Input multiset of samples x with unknown labels                 n̂. We keep each element in the dataset with probability .
          y as: U = {x1 , y1 }, ..., {xn , yn }                        The ensuing sub-sample represents the new dataset in our
   Result: U 0 filtered private multiset                               R ANKED E XAMPLE P OOL from which we will carry out our
 4
                                                                       k-anonymization.
 5 Bootstrap AL probablistic model P✓ with labeled
    utterances L
 6 Retrieve random sub-sample U of size n
                                    0                                         Estimating k-anonymity using query
 7 Pool creation: to add queries to the                                                    frequency
    R ANKED E XAMPLE P OOL                                             Given a multiset of query sets M = {U1 , ..., Us } with rep-
 8 for {x, y} 2 U do                                                   etitions such that the frequency of Ui is fUi and Ui is a
                    0

 9     retrieve freq(x);                                               tuple hui , ..., uk i. For a very large dataset size s, we seek to
10     if freq(x) k then                                               estimate fˆUi using sub-linear space. To estimate the query
11          retrieve variance x = var(P✓ (ŷ|x) : 1..l) on u;          frequency, we use the C OUNT-M EAN -M IN with conservative
                 // computed using l draws                             update (Goyal, Daumé III, and Cormode 2012) sketch algo-
12          add hx, x i to R ANKED E XAMPLE P OOL;                     rithm which is an improvement on the proposed C OUNT-M IN
13     else                                                            sketch algorithm by (Cormode and Muthukrishnan 2005).
14          remove {x, y} from U 0 ;                                   For each incoming Ui : hqi , ..., qk i, d different hashes of the
15     end                                                             queries is computed and a counter indexed by each hashed
16 end                                                                 result is incremented. To return the frequency, the minimum
17                                                                     over all d index locations for Qi is returned. To further reduce
18 Acquire labels for top examples sorted by x                         the potential of error from over-estimation, conservative up-
    descending as {ŷ1 , ..., ŷn̂ }                                   dates are employed to increment only the minimum counter
19 Set U to be (x1 , ŷ1 ), ..., (xn̂ , ŷn̂ )
         0
                                                                       from the d indexes, and an estimated noise is further deducted
20 Update model: draw the next training example from U
                                                       0
                                                                       from the result.
    over x                                                                Therefore after initial pre-sampling step, we select only
21    get x̂ = argmaxu 1 P✓ (ŷ|x); - # sample we are                  queries which occur at least k times. These queries are
    least confident of                                                 then added to the R ANKED E XAMPLE P OOL where the
22    retrain learning model using {x̂, ŷ}                            next example is drawn based on the element with the highest
23                                                                     uncertainty measure. The benefit of using the frequency to
24   return U 0                                                        satisfy k-anonymity rather than using partitioning, cluster-
                                                                       ing and recoding, or distance based algorithms, is to prevent
                                                                       attacks that rise from an attackers a-priori knowledge of a
                                                                       dataset. For example, a cluster of k with one sensitive or
carry out k-anonymization before training our active learner.          extreme outlier (e.g., a cluster of incomes within zip code
Let n be the number of distinct query sets —{U1 , ..., Us }—           with one UHNW outlier becomes easily identifiable by an
with elements {e1 , ..., en }. For a very large dataset size s, we     attacker even though the aggregation was based on nearest
seek to estimate n̂ using only m registers where m << n.               neighbors).
The number of distinct queries in our sample set therefore
become n̂.
                                                                                              Experiments
   To estimate the cardinality n̂, we utlize the H YPER -
L OG L OG algorithm by (Flajolet et al. 2007). H YPER -                Our work seeks to demonstrate quantifiable privacy preserv-
L OG L OG is a probabilistic cardinality estimator that uses           ing guarantees in an active learning setting by taking a pre-
a very small memory footprint (⇡ 12kb per key) for a low               sampling approach before carrying out k-anonymization. We
standard error (⇡ 0.81%) while scaling up to dataset sizes as          evaluate our approach on an internal dataset used for intent
large as 264 items1 .                                                  classification on voice devices.
   For each incoming Ui : hui , ..., uk i, a hash h(Ui ) is com-
puted and converted to base 2. The b least significant bits
                                                                       Datasets
are used to identify the register location to modify, where            The Intent Classifier dataset consists of a subset of queries
2b = m or log2 m. With the remaining bits w, a count p(w) is           from February 2018. The dataset is used to train a model
made of the number of running 0s up to the leftmost 1. For a           which determines a binary intent for a user. The dataset con-
very large, uniformly distributed multiset of random numbers,          sists of 2.5M queries comprising 58K distinct data points.
2 raised to the maximum value of p(w) gives a wide approx-             Each record contains a user query and a label indicating if
                                                                       it is categorized as a P OSITIVE or N EGATIVE intent query.
     1
       Values taken for the Redis implementation of H YPER L OG L OG   Part of the dataset has also been previously discussed and
- http://antirez.com/news/75                                           described by (Yang et al. 2018). Figures 1c and 1d show the
nature of the dataset with a histogram and plot of the fre-
quency distribution of the queries. As expected with textual
data, there is a long tail of queries which were observed just
once (making up ⇡ 60% of the dataset). The dataset consists
of 63% of queries labelled as P OSITIVE intents vs 27% being
N EGATIVE.

Experiment setup
                                                                      (a) Distribution of confidence   (b) Distribution of uncertainty
The experiment task was binary intent classification in an            scores for each unique query     scores for each unique query
active learning setting. We created a new baseline model
which predicts P OSITIVE and N EGATIVE intents. For the
experiments, the model was initially bootstrapped with 1, 000
labeled examples. The active learner then queries a data pool
to get a batch of additional training examples to improve the
model. The active learning strategy was uncertainty sampling
based on confidence scores.
   The confidence and uncertainty scores for the active
learning model were obtained from a Bayesian deep learn-              (c) Histogram of frequency dis- (d) Log frequency distribution of
                                                                      tribution                       queries
ing model described in (Yang et al. 2018) where model
uncertainty,
   P 1 P quantified         by Shannon P
                            P               entropy is U (x) =                  Figure 1: A view into the training dataset
                              t p̂c ) and T   t p̂c is the averaged
                          1               1
     c( T    t p̂c ) log( T
predicted probability of class c for x, sampled T times by
Monte Carlo dropout. A histogram of the confidence and
uncertainty scores can be seen in Figures 1a and 1b.                  Privacy vs Utility Tradeoff
   We simulated the probability of the crowd annotators re-           Figure 2 highlights the privacy–utility tradeoff which occurs
turning the correct answers to the requested queries by draw-         as a result of varying and k. As expected, as the value
ing from a normal distribution with mean centered at 0.65             of k, gets smaller, i.e., by selecting more items in the tail
and standard deviation 0.01 (see (Yang et al. 2018)’s Figure          of the dataset, we are able to improve the accuracy of our
2(a) for more).                                                       model. This however has the effect of degrading our privacy
                                                                      guarantees. Similarly, by providing privacy amplification by
Evaluation metrics                                                    subsampling, the utility of our model suffers. Figure 2 paints a
                                                                      wholistic picture of this by showing how by tuning the values
To evaluate our results, we compared the annotation accuracy          of and k, we can arrive at the same values of accuracy.
between the baseline model, and the models trained with
active learning and our privacy preserving model. We vary
the sub-sampling parameter and the anonymization factor
k while training our model and recording its accuracy. We
set the evaluation data at 5, 000 samples (i.e., about 10%
of the dataset). We also provide privacy guarantee values
from numerical computations of ✏ and and highlight in the
appendix, what values of k and provide those levels of
guarantees.


Baseline condition train standard classification model.
Sub-sampling parameter = 1, anonymization factor k = 1
i.e., using the entire dataset
                                                                                Figure 2: Accuracy at different k values
Experiment conditions train classification model using
privacy preserving active learning. Sub-sampling parame-
ter varied at = {0.1, 0.3, 0.6, 0.9}, anonymization factor            Annotation budget
varied at k = {1, 20, 100, 200, 500}
                                                                      Figure 3 describes how our annotation budget changes for
Results                                                               different privacy settings. With a stronger privacy model,
                                                                      we incur less cost as a function of less annotation requests.
The results of our experiments are presented in Figures 2, 3          By reading across the graph, we also discover that the same
and 4. Our findings provide insight to the tradeoffs between          budget can be realized from different privacy configurations:
privacy, utility and our annotation budget.                           e.g., by subsampling with = 0.1 and selecting k = 20, we
incur the same budget as = 0.3 and k = 100 and therefore,                                 Conclusion
the same accuracy (from Figure 2 above).
                                                                  We now briefly revisit our results in the light of our hypoth-
                                                                  esis. We also discuss the limitations of our process, its im-
                                                                  plication to the broader discourse on privacy and machine
                                                                  learning and conclude with future work.
                                                                     We apply the approach from (Li, Qardaji, and Su 2012) to
                                                                  offer privacy guarantees when training models with active
                                                                  learning which requires sending unlabelled examples to an
                                                                  external oracle. Our results join the conversation on differ-
                                                                  ential privacy and machine learning (Ji, Lipton, and Elkan
                                                                  2014) with particular reference to preserving the privacy of
                                                                  users.
                                                                     Our results show that by taking a small performance hit,
                                                                  we can achieve similar accuracy scores with a smaller anno-
                                                                  tation budget and stronger privacy guarantee. One limitation
                                                                  however is that we have only reported results on a binary
           Figure 3: Budget at different k values                 classification task. We are currently expanding our approach
                                                                  by designing a new algorithm for differential privacy on text.
                                                                  We show that the accuracy loss increases as task complexity
                                                                  increases. Therefore, if we were to apply the approach in this
Budget vs Accuracy                                                work to other NLP tasks, e.g. multi-class classification or
                                                                  question answering, we will expect the accuracy loss to be
We established from Figure 2 and Figure 3, the relationship       greater.
between privacy and accuracy, and between privacy and our            Another limitation of our approach and potentially other
annotation budget. Since we can obtain the same level of          k parameter based approaches ((Korolova et al. 2009),(Gotz
accuracy and budget requirements from different parameter         et al. 2012),(Zhang et al. 2016)) to differential privacy for
values, Figure 4 highlights how an increase in budget affects     text is that it will not work for tasks where almost all the data
our overall model accuracy. Increasing the budget initially ac-   is unique i.e., k is essentially 1 (e.g., a datasets of emails or
celerates the improvement of our model, however, the utility      movie reviews). Therefore, a different approach is needed to
gains quickly slow down. For example, after 30, 000 labels,       provide quantifiable privacy guarantees without resorting to
we do not see any significant increase in model accuracy.         k-anonymity.
                                                                     We believe that this is an area worthy of further research
                                                                  in order to further quantify the true cost of privacy in crowd-
                                                                  sourcing and machine learning. We have already begun fur-
                                                                  ther work to address two of the limitations reported in this
                                                                  current paper.

                                                                                           Appendix
                                                                  Table 1 lays out a grid of ✏, scores and the corresponding
                                                                  sampling parameter and anonymization factor k required
                                                                  to satisfy that level of (✏, )-differential privacy. The shaded
                                                                  region presents a ⇥ k high level view of how to achieve a
                                                                  desired level of privacy.
                                                                     A few insights can be gleaned from the results, the most
                                                                  obvious being that strong privacy requirements, (indicated
               Figure 4: Budget vs Accuracy
                                                                  by small ✏ and scores as we traverse the table towards the
                                                                  bottom left corner), require a higher anonymization factor
   These results can serve as a guideline in selecting appro-     k and smaller sampling rate . This is also observed by the
priate privacy parameters for different annotation budgets        fact that lowering the k factor only preserves privacy at the
in a way that is more representative of the dataset. For ex-      highest displayed ✏ value of 1.0 and = 1 ⇥ 10 6 (at the top
ample, for a fixed annotation budget, you can reduce to           right corner of the table).
select more data points from the tail of the dataset (i.e., a        Observing individually, we also note that, when k and
smaller k). This variation can also be done by starting with a    ✏ are at fixed values, as decreases, i.e., to obtain stronger
target accuracy score and varying and k. The results also         privacy guarantees, we need to lower the sampling rate .
demonstrate that by sacrificing some utility gains, we can        This indicates as decreases, privacy guarantees increase.
make stronger privacy guarantees and reduce our annotation        Similarly, observing k by fixing the values of and ✏, demon-
budget when carrying out active learning.                         strates that increasing k improves our privacy guarantees.
Table 1: Selected values of ✏ against : the shaded regions show the accuracy scores from our experiments and what                                    =
{0.1, 0.3, 0.6, 0.9} vs k = {20, 100, 200, 500} values provide the (✏, )-differential privacy guarantee

                                                                                         ✏
                                            0.25                              0.5                      0.75                        1.0
                                       .1

                                               .3

                                                       .6

                                                              .9




                                                                      .1

                                                                             .3

                                                                                    .6

                                                                                         .9




                                                                                                .1

                                                                                                       .3

                                                                                                               .6

                                                                                                                    .9




                                                                                                                           .1

                                                                                                                                  .3

                                                                                                                                         .6

                                                                                                                                                .9
                                     =0

                                             =0

                                                     =0

                                                            =0




                                                                    =0

                                                                            =0

                                                                                    =0

                                                                                         =0




                                                                                               =0

                                                                                                      =0

                                                                                                             =0

                                                                                                                    =0




                                                                                                                          =0

                                                                                                                                 =0

                                                                                                                                         =0

                                                                                                                                               =0
                          k = 20                                   68.3                       68.3                       68.3   70.3

                          k = 100   64.7    68.7                   64.7    68.7               64.7   68.7   69.8         64.7   68.7   69.8

                          k = 200   61.6    66.8                   61.6    66.8   68.6        61.6   66.8   68.6         61.6   66.8   68.6


             1 ⇥ 10 6     k = 500   61.5    61.6    67.3           61.5    61.6   67.3        61.5   61.6   67.3         61.5   61.6   67.3   67.6



                          k = 20                                   68.3                       68.3                       68.3

                          k = 100   64.7                           64.7    68.7               64.7   68.7                64.7   68.7   69.8

                          k = 200   61.6    66.8                   61.6    66.8               61.6   66.8   68.6         61.6   66.8   68.6


             1 ⇥ 10 9     k = 500   61.5    61.6                   61.5    61.6   67.3        61.5   61.6   67.3         61.5   61.6   67.3



                          k = 20                                                              68.3                       68.3

                          k = 100   64.7                           64.7    68.7               64.7   68.7                64.7   68.7

                          k = 200   61.6                           61.6    66.8               61.6   66.8   68.6         61.6   66.8   68.6


            1 ⇥ 10 12     k = 500   61.5    61.6                   61.5    61.6   67.3        61.5   61.6   67.3         61.5   61.6   67.3



                          k = 20

                          k = 100   64.7                           64.7                       64.7   68.7                64.7   68.7

                          k = 200   61.6                           61.6    66.8               61.6   66.8                61.6   66.8   68.6


            1 ⇥ 10 15     k = 500   61.5    61.6                   61.5    61.6   67.3        61.5   61.6   67.3         61.5   61.6   67.3
                         References                                 [Goyal, Daumé III, and Cormode 2012] Goyal,               A.;
[Aggarwal 2005] Aggarwal, C. C. 2005. On k-anonymity                 Daumé III, H.; and Cormode, G. 2012. Sketch algo-
 and the curse of dimensionality. In Proceedings of the 31st         rithms for estimating point queries in nlp. In Proceedings of
 international conference on Very large data bases, 901–909.         the 2012 Joint Conference on Empirical Methods in Natural
 VLDB Endowment.                                                     Language Processing and Computational Natural Language
                                                                     Learning, 1093–1103. Association for Computational
[Arora, Nyberg, and Rosé 2009] Arora, S.; Nyberg, E.; and           Linguistics.
 Rosé, C. P. 2009. Estimating annotation cost for active learn-
 ing in a multi-annotator environment. In Proceedings of the        [Hamm, Cao, and Belkin 2016] Hamm, J.; Cao, Y.; and
 NAACL HLT 2009 Workshop on Active Learning for Natural              Belkin, M. 2016. Learning privately from multiparty data.
 Language Processing, 18–26. Association for Computational           In International Conference on Machine Learning, 555–563.
 Linguistics.                                                       [Howe 2006] Howe, J. 2006. The rise of crowdsourcing.
[Avidan and Butman 2007] Avidan, S., and Butman, M. 2007.            Wired magazine 14(6):1–4.
 Efficient methods for privacy preserving face detection. In        [Ji, Lipton, and Elkan 2014] Ji, Z.; Lipton, Z. C.; and Elkan,
 Advances in neural information processing systems, 57–64.           C. 2014. Differential privacy and machine learning: a survey
[Balle, Barthe, and Gaboardi 2018] Balle, B.; Barthe, G.; and        and review. arXiv preprint arXiv:1412.7584.
 Gaboardi, M. 2018. Privacy amplification by subsampling:           [Korolova et al. 2009] Korolova, A.; Kenthapadi, K.; Mishra,
 Tight analyses via couplings and divergences. arXiv preprint        N.; and Ntoulas, A. 2009. Releasing search queries and
 arXiv:1807.01647.                                                   clicks privately. In Proceedings of the 18th international
[Cormode and Muthukrishnan 2005] Cormode, G., and                    conference on World wide web, 171–180. ACM.
 Muthukrishnan, S. 2005. An improved data stream summary:           [LeFevre, DeWitt, and Ramakrishnan 2006] LeFevre,          K.;
 the count-min sketch and its applications. Journal of               DeWitt, D. J.; and Ramakrishnan, R. 2006. Mondrian
 Algorithms 55(1):58–75.                                             multidimensional k-anonymity. In Data Engineering, 2006.
[Dasgupta 2011] Dasgupta, S. 2011. Two faces of active               ICDE’06. Proceedings of the 22nd International Conference
 learning. Theoretical computer science 412(19):1767–1781.           on, 25–25. IEEE.
[Di Castro et al. 2016] Di Castro, D.; Lewin-Eytan, L.;             [Lewis and Gale 1994] Lewis, D. D., and Gale, W. A. 1994.
 Maarek, Y.; Wolff, R.; and Zohar, E. 2016. Enforcing k-             A sequential algorithm for training text classifiers. In Pro-
 anonymity in web mail auditing. In Proceedings of the Ninth         ceedings of the 17th annual international ACM SIGIR confer-
 ACM International Conference on Web Search and Data Min-            ence on Research and development in information retrieval,
 ing, 327–336. ACM.                                                  3–12. Springer-Verlag New York, Inc.
[Dwork and Roth 2014] Dwork, C., and Roth, A. 2014. The             [Li, Li, and Venkatasubramanian 2007] Li, N.; Li, T.; and
 algorithmic foundations of differential privacy. Foundations        Venkatasubramanian, S. 2007. t-closeness: Privacy beyond
 and Trends R in Theoretical Computer Science 9(3–4):211–            k-anonymity and l-diversity. In Data Engineering, 2007.
 407.                                                                ICDE 2007. IEEE 23rd International Conference on, 106–
[Dwork et al. 2006] Dwork, C.; McSherry, F.; Nissim, K.; and         115. IEEE.
 Smith, A. 2006. Calibrating noise to sensitivity in private data   [Li, Qardaji, and Su 2012] Li, N.; Qardaji, W.; and Su, D.
 analysis. In Theory of Cryptography Conference, 265–284.            2012. On sampling, anonymization, and differential privacy
 Springer.                                                           or, k-anonymization meets differential privacy. In Proceed-
[Dwork 2008] Dwork, C. 2008. Differential privacy: A sur-            ings of the 7th ACM Symposium on Information, Computer
 vey of results. In International Conference on Theory and           and Communications Security, 32–33. ACM.
 Applications of Models of Computation, 1–19. Springer.             [Machanavajjhala et al. 2006] Machanavajjhala, A.; Gehrke,
[Dwork 2011] Dwork, C. 2011. A firm foundation for private           J.; Kifer, D.; and Venkitasubramaniam, M. 2006. l-diversity:
 data analysis. Communications of the ACM 54(1):86–95.               Privacy beyond k-anonymity. In Data Engineering, 2006.
                                                                     ICDE’06. Proceedings of the 22nd International Conference
[Flajolet et al. 2007] Flajolet, P.; Fusy, É.; Gandouet, O.; and
                                                                     on, 24–24. IEEE.
 Meunier, F. 2007. Hyperloglog: the analysis of a near-
 optimal cardinality estimation algorithm. In AofA: Analysis of     [Nissim, Raskhodnikova, and Smith 2007] Nissim,            K.;
 Algorithms, 137–156. Discrete Mathematics and Theoretical           Raskhodnikova, S.; and Smith, A. 2007. Smooth sensitivity
 Computer Science.                                                   and sampling in private data analysis. In Proceedings of the
[Gal and Ghahramani 2016] Gal, Y., and Ghahramani, Z.                thirty-ninth annual ACM symposium on Theory of computing,
 2016. Dropout as a Bayesian approximation: Represent-               75–84. ACM.
 ing model uncertainty in deep learning. In international           [Papernot et al. 2016] Papernot, N.; Abadi, M.; Erlingsson,
 conference on machine learning, 1050–1059.                          U.; Goodfellow, I.; and Talwar, K. 2016. Semi-supervised
[Gotz et al. 2012] Gotz, M.; Machanavajjhala, A.; Wang, G.;          knowledge transfer for deep learning from private training
 Xiao, X.; and Gehrke, J. 2012. Publishing search logs – a           data. arXiv preprint arXiv:1610.05755.
 comparative study of privacy guarantees. IEEE Transactions         [Settles 2010] Settles, B. 2010. Active learning literature
 on Knowledge and Data Engineering 24(3):520–532.                    survey. 2010. Computer Sciences Technical Report 1648.
[Soria-Comas et al. 2014] Soria-Comas, J.; Domingo-Ferrer,
 J.; Sánchez, D.; and Martı́nez, S. 2014. Enhancing data
 utility in differential privacy via microaggregation-based k-
 anonymity. The VLDB Journal 23(5):771–794.
[Sweeney 2002] Sweeney, L. 2002. k-anonymity: A model
 for protecting privacy. International Journal of Uncertainty,
 Fuzziness and Knowledge-Based Systems 10(05):557–570.
[Yang et al. 2018] Yang, J.; Drake, T.; Damianou, A.; and
 Maarek, Y. 2018. Leveraging crowdsourcing data for deep
 active learning an application: Learning intents in Alexa. In
 Proceedings of the 2018 World Wide Web Conference on
 World Wide Web, 23–32. International World Wide Web
 Conferences Steering Committee.
[Zhang et al. 2016] Zhang, S.; Yang, G. H.; Singh, L.; and
 Xiong, L. 2016. Safelog: Supporting web search and min-
 ing by differentially-private query logs. In 2016 AAAI Fall
 Symposium Series.
                                               Privacy-preserving Active Learning on Sensitive Data for
                                                              User Intent Classification
                                                         Oluwaseyi Feyisetan, Thomas Drake, Borja Balle, Tom Diethe
                                                                              Amazon Research


                                                  INTRODUCTION                                                    RESULTS
Active Learning [1] holds promise. But, it requires sending                            The task was binary intent classification in an active learning
data to annotators for labeling. This presents a possible                              setting. The model was bootstrapped with 1,000 examples.
privacy leak when the training set includes sensitive data.                            The active learner then queries a data pool to get a batch of
                                                                                       additional training examples to improve the model.
We want to answer the question: which of these queries
should be manually annotated and added to the ML model
training data while satisfying two conditions:
1. we do not reveal any compromising or personally
   identifiable data to annotators, and
2. labeling this selected subset leads to the greatest
   increase in model performance.

                      PRIVACY VIA RANDOM SAMPLING & PRUNING
A randomized algorithm M is (ε, δ)-differentially private [2] if
for every pair of databases D and D′ differing in one record
and every possible set of outputs C ⊆ C we have:


By preceding k-anonymity with a randomized sampling step
with parameter β, [3] introduces (β, ε, δ) differential privacy
under sampling. The parameters are linked as:                                                                  CONCLUSIONS
                                                                                       Our results show that by taking a small performance hit, we
                                                                                       can achieve accuracy scores close to the baseline with a
                                                                                       smaller annotation budget and stronger privacy guarantees.
                                    PRIVACY PRESERVING ACTIVE LEARNING




                                                                                                                REFERENCES
                                                                                       [1] Burr Settles. Active Learning literature survey. 2010.
                                                                                       Computer Sciences Technical Report
                                                                                       [2] Dwork, Cynthia, et al. "Calibrating noise to sensitivity in
                                                                                       private data analysis. Theory of cryptography Conf. 2006.
                                                                                       [3] Li et al. On sampling, anonymization, and differential
                                                                                       privacy or k-anonymization meets differential privacy.
  RESEARCH POSTER PRESENTATION DESIGN © 2015
  www.PosterPresentations.com