Calibrating Mechanisms for Privacy Preserving Text Analysis Oluwaseyi Feyisetan Borja Balle Amazon Deep Mind sey@amazon.com borja.balle@gmail.com Tom Diethe Thomas Drake Amazon Amazon tdiethe@amazon.com draket@amazon.com ABSTRACT x s has a non-negligible probability of being transformed into any This talk presents a formal approach to carrying out privacy pre- other sentence x̂ s , no matter how unrelated x s and x̂ s are. Unfortu- serving text perturbation using a variant of Differential Privacy (DP) nately, this constraint makes it virtually impossible to enforce that known as Metric DP (mDP). Our approach applies carefully cali- the semantics of x s are approximately captured by the privatized brated noise to vector representation of words in a high dimension sentence x̂ s , since the space of sentences is exponentially large in space as defined by word embedding models. We present a privacy the length |x s |, and the number of sentences semantically related proof that satisfies mDP where the privacy parameter ε provides to x s will have vanishingly small probability under LDP. guarantees with respect to a distance metric defined by the word To address this limitation we adopt metric DP (mDP) [1, 4], a embedding space. We demonstrate how ε can be selected by analyz- relaxation of local DP that originated in the context of location ing plausible deniability statistics backed up by large scale analysis privacy to address precisely the limitation described above. In par- on GloVe and fastText embeddings. We also conduct experiments ticular, mDP allows a mechanism to report a user’s location in on well-known datasets to demonstrate the tradeoff between pri- a privacy-preserving manner, while giving higher probability to vacy and utility for varying values of ε on different task types. Our locations which are close to the current location, and negligible results provide insights into carrying out practical privatization on probability to locations in a completely different part of the planet. text-based applications for a broad range of tasks. Metric DP was originally developed as an abstraction of the privacy model proposed in [2] to address the privacy-utility trade-off in CCS CONCEPTS location privacy. To the best of our knowledge, our paper [6, 7] was the first to use mDP in the context of language data. • Security and privacy → Privacy protections; ACM Reference Format: Oluwaseyi Feyisetan, Borja Balle, Tom Diethe, and Thomas Drake. 2020. Calibrating Mechanisms for Privacy Preserving Text Analysis. In Proceedings of Workshop on Privacy in Natural Language Processing (PrivateNLP ’20). Houston, TX, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 METRIC DIFFERENTIAL PRIVACY Figure 1: Contour plots of different metrics [11]. Left to Over the last decade, Differential Privacy (DP) [5] has emerged as right: L 1 Manhattan distance, L 2 Euclidean distance, L ∞ a de facto standard for privacy-preserving data analysis algorithms. Chebyshev distance, Lp Minkowski distance (L 3 shown here) One reason for such success is the robustness of DP against critical pitfalls exhibited by previous attempts to formalize privacy in the Formally, mDP is defined for mechanisms whose inputs come context of data analysis algorithms. Several variants of DP have from a set X equipped with a distance function d : X × X → R+ been proposed in the literature to address a variety of settings de- satisfying the axioms of a metric (i.e. identity of indiscernibles, sym- pending on whether, for example, privacy is defined with respect to metry and triangle inequality). The definition of mDP depends on aggregate statistics and Machine Learning (ML) models (curator DP) the particular distance function d being used and it is parametrized [5], or privacy is defined with respect to the data points contributed by a privacy parameter ε > 0. We say that a randomized mechanism by each individual (local DP) [8]. M : X → Y satisfies ε-mDP if for any x, x ′ ∈ X the distributions Since our application involves privatizing individual sentences over outputs of M(x) and M(x ′ ) satisfy the following bound: for all submitted by each user, Local DP (LDP) would be the ideal privacy y ∈ Y we have model to consider. However, LDP has a requirement that renders it Pr[M(x) = y] ≤ e εd (x,x ) . ′ (1) impractical for our application: it requires that the secret sentence Pr[M(x ′ ) = y] We note that mDP exhibits the same desirable properties of DP (e.g. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Presented at the PrivateNLP 2020 composition, post-processing, robustness against side knowledge, Workshop on Privacy in Natural Language Processing Colocated with 13th ACM etc.), but we shall not be using these properties explicitly in our International WSDM Conference, 2020, in Houston, Texas, USA. analysis; we refer the reader to [4] for further details. PrivateNLP ’20, February 7, 2020, Houston, TX, USA © 2020 The type of probabilistic guarantee described by (1) is charac- teristic of DP: it says that the log-likelihood ratio of observing any particular output y given two possible inputs x and x ′ is bounded A key difference between LDP and mDP is that the former pro- by εd(x, x ′ ). The key difference between mDP and local DP is that vides a stronger form of plausible deniability by insisting that almost the latter corresponds to a particular instance of the former when every outcome is possible when a word is perturbed, while the later the distance function is given by d(x, x ′ ) = 1 for every x , x ′ . Un- only requires that we give enough probability mass to words close fortunately, this Hamming metric does not provide a way to classify to the original one to ensure that the output does not reveal what some pairs of points in X as being closer than others. This indicates the original word was, although it still releases information about that local DP implies a strong notion of indistinguishability of the the neighborhood where the original word was. input, thus providing very strong privacy by “remembering almost More formally, the statistics we look at are the probability Nw = nothing” about the input. In contrast, metric DP is less restrictive Pr[M(w) = w] of not modifying the input word w, and the (effec- and allows the indistinguishability of the output distributions to be tive) support of the output distribution Sw (i.e. number of possible scaled by the distance between the respective inputs. In particular, output words) for an input w. In particular, given a small probability the further away a pair of inputs are, the more distinguishable the parameter η > 0, we define Sw as the size of the smallest set of output distributions can be, thus allowing these distributions to words that accumulates probability at least 1 − η on input w: remember more about their inputs than under the strictly stronger definition of local DP. Sw = min |{S ⊆ X : Pr[M(w) < S] ≤ η}| . A point of consideration for mDP is that the meaning of the Intuitively, a setting of ε providing plausible deniability should have privacy parameter ε changes if one considers different metrics, and Nw small and Sw large for (almost) all words in w ∈ W. is in general incomparable with the ε parameter used in standard These statistics can also be related to the two extremes of the (local) DP. Thus, in order to understand the privacy consequences Rényi entropy [10], thus providing an additional information-theoretic of a given ε in mDP one needs to understand the structure of the justification for the settings of ε that provide plausible deniability underlying metric d. in terms of large entropy. Recall that for a distribution p over W with pw = PrW ∼p [W = w], the Rényi entropy of order α ≥ 0 is 2 PRIVACY MECHANISM given by Our privacy mechanism is described over the metric space induced ! by word embeddings as follows: 1 Õ α H α (p) = log pw . 1−α w ∈W Algorithm 1: Privacy Preserving Mechanism The Hartley entropy H 0 is the special case of Rényi entropy Input: string x = w 1 w 2 · · · w ℓ , privacy parameter ε > 0 with α = 0. It depends on vocabulary size |W| and is therefore a for i ∈ {1, . . . , ℓ } do best-case scenario as it represents the perfect privacy scenario for Compute embedding ϕi = ϕ(w i ) a user as the number of words grow. It is given by H 0 = log2 |W|. Perturb embedding to obtain ϕˆi = ϕi + N with noise density Min-entropy H ∞ is the special case with α = ∞ which is a worst- p N (z) ∝ exp(−ε ∥z ∥) case scenario because it depends on the adversary attaching the Obtain perturbed word ŵ i = ar дmin u ∈W ∥ϕ(u) − ϕˆi ∥ highest probability to a specific word p(w). It is given by H ∞ = Insert ŵ i in ith position of x̂ − log2 max (p(w)). release x̂ w ∈W This implies that we can see the quantities Sw and Nw as prox- ies for the two extreme Rényi entropies through the approximate identities H 0 (M(w)) ≈ log Sw and H ∞ (M(w)) ≈ log 1/Nw , where 3 STATISTICS FOR PRIVACY CALIBRATION the last approximation relies on the fact that (at least for small We now present a methodology for calibrating the ε parameter of enough ε), w should be the most likely word under the distribution our mDP mechanism M based on the geometric structure of the of M(w). word embedding ϕ used to define the metric d. Our strategy boils down to identifying a small number of statistics associated with 4 WORD EMBEDDINGS the output distributions of M, and finding a range of parameters ε A word embedding ϕ : W → Rn maps each word in some vocabu- where these statistics behave as one would expect from a mecha- lary to a vector of real numbers. nism providing a prescribed level of plausible deniability. We recall The geometry of the resulting embedding model has a direct that the main reason this is necessary, and why the usual rules of impact on defining the output distribution of our redaction mecha- thumb for calibrating ε in traditional (i.e. non-metric) DP cannot be nism. To get an intuition for the structure of these metric spaces – applied here, is because the meaning of ε in mDP depends on the i.e., how words cluster together and the distances between words particular metric being used and is not transferable across metrics. and their neighbors – we ran several analytical experiments on two We start by making some qualitative observations about how ε widely available word embedding models: GloVe [9] and fastText affects the behavior of mechanism M. For the sake of simplicity we [3]. We selected 319, 000 words that were present in both the GloVe focus the discussion on the case where x is a single word x = w, but and fastText embeddings. Though we present findings only from all our observations can be directly generalized to the case |x | > 1. the common 319, 000 words in the embedding vocabularies, we We note these observations are essentially heuristic, although it is carried out experiments over the entire vector space (i.e., 400, 000 not hard to turn them into precise mathematical statements. for GloVe and 2, 519, 370 for fastText). GloVe word embeddings 1.25 25000 GloVe embeddings at k = 10 for matching average values of Nw for GloVe at ε = 29 and fastText Distances at quantiles 1.00 20000 at ε = 36 are also presented in Fig. 4. 0.75 15000 Count 0.50 105 GloVe at ε = 29; avg. Sw ≈ 750 105 fastText at ε = 33; a g. Sw ≈ 750 5th percentile 10000 20th percentile 0.25 50th percentile 80th percentile 104 104 95th percentile 5000 0.00 103 103 100 101 102 103 Count Count 0 K (log scale) 0.4 0.6 0.8 1.0 1.2 102 102 fastText word embeddings 1.25 25000 fastText embeddings at k = 10 101 101 Distances at quantiles 1.00 20000 100 100 0 200 400 600 800 1000 0 200 400 600 800 1000 0.75 Num. of di tinct word (of 1000) Num. of distinct words (of 1000) 15000 Count 0.50 105 GloVe at ε = 29; avg. Nw ≈ 136 105 fastText at ε = 36; avg. Nw ≈ 136 5th percentile 20th percentile 10000 0.25 50th percentile 80th percentile 104 104 95th percentile 5000 0.00 103 103 100 101 102 103 Count Count 0 K (log scale) 0.00 0.25 0.50 0.75 1.00 1.25 102 102 101 101 Figure 2: Distribution of distances between a given vector 100 100 and its k closest neighbors for GloVe and fastText 0 200 400 600 800 1000 0 200 400 600 800 1000 Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Figure 4: Comparing the distribution of similar average Sw Our experiments provide: (i) insights into the distance d(x, x ′ ) and Nw results for GloVe and fastText that controls the privacy guarantees of our mechanism for different embedding models; and (ii) empirical evaluation of the plausible deniability statistics Sw and Nw for the mechanisms obtained using different embeddings. 6 MORE CALIBRATION STATISTICS In this section, we highlight the results of empirical Sw and Nw 5 CHOOSING A WORD EMBEDDING MODEL plausible deniability statistics for 300 dimension fastText and 50 Our analysis gives a reasonable approach to selecting ε by means dimension GloVe embeddings. The results were designed to yield of the proxies provided by the plausible deniability statistics. In comparable numbers to those presented for 300 dimension GloVe general, tuning privacy parameters in (metric) differential privacy embeddings. is still a topic under active research, especially with respect to what ε means for different applications. Task owners ultimately need 7 BIOGRAPHY to determine what is best for their users based on the available Dr. Oluwaseyi Feyisetan is an Applied Scientist at Amazon Alexa descriptive statistics. where he works on Differential Privacy and Privacy Auditing mech- anisms within the context of Natural Language Processing. He fastText epsilon values fastText epsilon values holds 2 pending patents with Amazon on preserving privacy in 10000 10 20 30 40 50 601000 5000 10 20 30 40 50 60500 NLP systems. He completed his PhD at the University of Southamp- fastText fastText fastText avg. Nw values fastText avg. Sw values 900 900 GloVe avg. Nw values GloVe avg. Sw values 400 400 ton in the UK and has published in top tier conferences and journals 800 800 700 700 300 300 on crowdsourcing, homomorphic encryption, and privacy in the 600 600 200 200 context of Active Learning and NLP. He has served as a reviewer 500 500 400 400 100 100 at top NLP conferences including ACL and EMNLP. He is the lead GloVe GloVe 3000 10 20 30 40 50 60300 00 10 20 30 40 50 600 organizer of the Workshop on Privacy and Natural Language Pro- GloVe epsilon values GloVe epsilon values cessing (PrivateNLP) at WSDM with an upcoming event scheduled for EMNLP. Prior to working at Amazon in the US, he spent 7 years Figure 3: Average Sw and Nw statistics for GloVe and fastText in the UK where he worked at different startups and institutions fo- cusing on regulatory compliance, machine learning and NLP within In Fig. 3, we present the average values of Sw and Nw statistics the finance sector, most recently, at the Bank of America. for GloVe and fastText. We observe that similar values over fast- Text cover a broader range of ε values when compared to GloVe. REFERENCES However, the average values are not sufficient to make a conclusive [1] Mário Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Local Differential Privacy on Metric Spaces: optimizing the trade-off comparison between embedding models. The shape of the distri- with utility. In Computer Security Foundations Symposium (CSF). bution needs to be further considered when selecting ε for similar [2] Miguel E Andrés, Nicolás E Bordenabe, Konstantinos Chatzikokolakis, and Catus- cia Palamidessi. 2013. Geo-indistinguishability: Differential privacy for location- values of Nw or Sw . For example, the average value of Sw for GloVe based systems. In Proceedings of the 2013 ACM SIGSAC CCS. ACM, 901–914. at ε = 29 is about the same as that for fastText at ε = 33 (both have [3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. an average Sw value of ≈ 750). However, from Fig. 4, we discover Enriching Word Vectors with Subword Information. TACL 5 (2017). [4] Konstantinos Chatzikokolakis, Miguel E Andrés, Nicolás Emilio Bordenabe, and that they both have slightly different distributions with GloVe being Catuscia Palamidessi. 2013. Broadening the scope of differential privacy using slightly flatter and fastText more skewed to the right. Similar results metrics. In Intl. Symposium on Privacy Enhancing Technologies Symposium. Sw results at ε = 18 Sw results at ε = 24 Sw results at ε = 30 Sw results at ε = 36 Sw results at ε = 42 104 104 104 104 104 Count 102 102 102 102 102 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 100 0 0 0 0 Num. of distinct words (of 1000) Num. of distinct words (of 1000) Num. of distinct words (of 1000) Num. of distinct words (of 1000) Num. of distinct words (of 1000) Nw results at ε = 18 Nw results at ε = 24 Nw results at ε = 30 Nw results at ε = 36 Nw results at ε = 42 104 104 104 104 104 Count 102 102 102 102 102 100 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 0 0 0 0 0 Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Figure 5: Empirical Sw and Nw statistics for fastText word embeddings as a function of ε. Sw results at ε = 5 Sw results at ε = 7 Sw results at ε = 9 Sw results at ε = 11 Sw results at ε = 12 104 104 104 104 104 Count 102 102 102 102 102 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 100 0 0 0 0 Num. of distinct words (of 1000) Num. of distinct words (of 1000) Num. of distinct words (of 1000) Num. of distinct words (of 1000) Num. of distinct words (of 1000) Nw results at ε = 5 Nw results at ε = 7 Nw results at ε = 9 Nw results at ε = 11 Nw results at ε = 12 104 104 104 104 104 Count 102 102 102 102 102 100 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 10 0 200 400 600 800 1000 0 0 0 0 0 Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Count of M(w) = w (of 1000) Figure 6: Empirical Sw and Nw statistics for 50 dimensional GloVe word embeddings as a function of ε. [5] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrat- ing noise to sensitivity in private data analysis. In TCC. Springer, 265–284. [6] Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. In Proceedings of the 13th International Conference on Web Search and Data Mining. 178–186. [7] Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. 2019. Leveraging Hi- erarchical Representations for Preserving Privacy and Utility in Text. In IEEE International Conference on Data Mining (ICDM). [8] Shiva Kasiviswanathan, Homin Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately? SIAM J. Comput. 40, 3 (2011). [9] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543. [10] Alfréd Rényi. 1961. On measures of entropy and information. Technical Report. HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary. [11] Christoph Ruegg, Marcus Cuda, and Jurgen Van Gael. 2009. Distance Metrics. (2009). https://numerics.mathdotnet.com/Distance.html