Privacy-Preserving Textual Analysis via Calibrated Perturbations Oluwaseyi Feyisetan Borja Balle Amazon Amazon sey@amazon.com pigem@amazon.co.uk Thomas Drake Tom Diethe Amazon Amazon draket@amazon.com tdiethe@amazon.co.uk ABSTRACT Accurately learning from user data while providing quanti�able pri- vacy guarantees provides an opportunity to build better ML models while maintaining user trust. This paper presents a formal approach to carrying out privacy preserving text perturbation using the no- tion of d -privacy designed to achieve geo-indistinguishability in location data. Our approach applies carefully calibrated noise to vector representation of words in a high dimension space as de�ned by word embedding models. We present a privacy proof that satis- �es d -privacy where the privacy parameter provides guarantees with respect to a distance metric de�ned by the word embedding space. We demonstrate how can be selected by analyzing plausible deniability statistics backed up by large scale analysis on G��V� and ����T��� embeddings. We conduct privacy audit experiments against 2 baseline models and utility experiments on 3 datasets to demonstrate the tradeo� between privacy and utility for varying values of on di�erent task types. Our results demonstrate prac- tical utility (< 2% utility loss for training binary classi�ers) while providing better privacy guarantees than baseline models. ACM Reference Format: Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy-Preserving Textual Analysis via Calibrated Perturbations. In Pro- ceedings of Workshop on Privacy and Natural Language Processing (Pri- vateNLP ’20). Houston, TX, USA, 1 page. https://doi.org/10.1145/nnnnnnn. nnnnnnn Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Presented at the PrivateNLP 2020 Workshop on Privacy in Natural Language Processing Colocated with 13th ACM International WSDM Conference, 2020, in Houston, Texas, USA. PrivateNLP ’20, February 7, 2020, Houston, TX, USA © 2020 Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations Oluwaseyi Feyisetan Borja Balle Thomas Drake Tom Diethe sey@amazon.com borja.balle@gmail.com draket@amazon.com tdiethe@amazon.com Summary A viable solution: Differential Privacy Mechanism Overview Sample results •User’s goal: meet some specific need with ε-Differential Privacy (DP) bounds the respect to an issued query x influence of any single input on the •Agent’s goal: satisfy the user’s request output of a computation. •Question: what occurs when x is used to Query Result 1 make other inferences about the user DP Analysis Attacker Bob’s data •Mechanism: modify the query to protect privacy whilst preserving semantics DP Analysis •Our approach: Generalized Metric Query Result 2 Differential Privacy. Result 1 is approximately equal to Result 2 Introduction Differential Privacy Mechanism Details Experiment Results What makes privacy difficult? Metric 6 12 17 23 29 35 41 47 Precision 0.00 0.00 0.00 0.00 0.67 0.90 0.93 1.00 High dimensional data Recall 0.00 0.00 0.00 0.00 0.02 0.09 0.14 0.30 Big and richer datasets lead Accuracy 0.50 0.50 0.50 0.50 0.51 0.55 0.57 0.65 to users generating AUC 0.06 0.04 0.11 0.36 0.61 0.85 0.88 0.93 uniquely identifiable information. Scores measure privacy loss (lower is better) Side knowledge Innocuous data reveals customer information when joined with side- knowledge. Privacy in textual data Generalized Metric Differential Privacy Sampling and Calibration NEW YORK TIMES ARTICLE User Text 441779 dog that urinates on everything 441779 safest place to live ... 441779 the best season to visit Italy 441779 landscapers in Lilburn, GA Most of the queries do not contain PII Utility of downstream machine learning model on data (higher is better) POSTER TEMPLATE BY: www.PosterPresentations.com