Privacy-Preserving Textual Analysis via Calibrated
                                    Perturbations
                           Oluwaseyi Feyisetan                                            Borja Balle
                                    Amazon                                                 Amazon
                                sey@amazon.com                                        pigem@amazon.co.uk

                                Thomas Drake                                             Tom Diethe
                                   Amazon                                                    Amazon
                             draket@amazon.com                                        tdiethe@amazon.co.uk
ABSTRACT
Accurately learning from user data while providing quanti�able pri-
vacy guarantees provides an opportunity to build better ML models
while maintaining user trust. This paper presents a formal approach
to carrying out privacy preserving text perturbation using the no-
tion of d -privacy designed to achieve geo-indistinguishability in
location data. Our approach applies carefully calibrated noise to
vector representation of words in a high dimension space as de�ned
by word embedding models. We present a privacy proof that satis-
�es d -privacy where the privacy parameter provides guarantees
with respect to a distance metric de�ned by the word embedding
space. We demonstrate how can be selected by analyzing plausible
deniability statistics backed up by large scale analysis on G��V�
and ����T��� embeddings. We conduct privacy audit experiments
against 2 baseline models and utility experiments on 3 datasets to
demonstrate the tradeo� between privacy and utility for varying
values of on di�erent task types. Our results demonstrate prac-
tical utility (< 2% utility loss for training binary classi�ers) while
providing better privacy guarantees than baseline models.
ACM Reference Format:
Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020.
Privacy-Preserving Textual Analysis via Calibrated Perturbations. In Pro-
ceedings of Workshop on Privacy and Natural Language Processing (Pri-
vateNLP ’20). Houston, TX, USA, 1 page. https://doi.org/10.1145/nnnnnnn.
nnnnnnn


Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). Presented at the PrivateNLP 2020
Workshop on Privacy in Natural Language Processing Colocated with 13th ACM
International WSDM Conference, 2020, in Houston, Texas, USA.
PrivateNLP ’20, February 7, 2020, Houston, TX, USA
© 2020
                                                                            Privacy- and Utility-Preserving Textual Analysis via Calibrated
                                                                                               Multivariate Perturbations
                                                                                   Oluwaseyi Feyisetan                           Borja Balle               Thomas Drake            Tom Diethe
                                                                                       sey@amazon.com                          borja.balle@gmail.com       draket@amazon.com      tdiethe@amazon.com


                Summary                                                                   A viable solution: Differential Privacy                      Mechanism Overview                         Sample results
       •User’s goal: meet some specific need with                                        ε-Differential Privacy (DP) bounds the
       respect to an issued query x                                                      influence of any single input on the
       •Agent’s goal: satisfy the user’s request                                         output of a computation.
       •Question: what occurs when x is used to                                                                                     Query Result 1
       make other inferences about the user                                                              DP
                                                                                                       Analysis

                                                                                       Attacker                   Bob’s data
       •Mechanism: modify the query to protect
       privacy whilst preserving semantics                                                               DP
                                                                                                       Analysis
       •Our approach: Generalized Metric                                                                                            Query Result 2

       Differential Privacy.
                                                                                         Result 1 is approximately equal to Result 2

                Introduction                                                              Differential Privacy                                         Mechanism Details                          Experiment Results
                What makes privacy difficult?                                                                                                                                                   Metric        6   12 17 23 29 35 41 47
                                                                                                                                                                                                Precision   0.00 0.00 0.00 0.00 0.67 0.90 0.93 1.00
                                            High dimensional data                                                                                                                               Recall      0.00 0.00 0.00 0.00 0.02 0.09 0.14 0.30
                                            Big and richer datasets lead                                                                                                                        Accuracy    0.50 0.50 0.50 0.50 0.51 0.55 0.57 0.65
                                            to users generating                                                                                                                                 AUC         0.06 0.04 0.11 0.36 0.61 0.85 0.88 0.93
                                            uniquely identifiable
                                            information.                                                                                                                                               Scores measure privacy loss (lower is better)


                                            Side knowledge
                                            Innocuous data reveals
                                            customer information
                                            when joined with side-
                                            knowledge.

                 Privacy in textual data                                                  Generalized Metric Differential Privacy                      Sampling and Calibration


                                                          NEW YORK TIMES ARTICLE


      User                    Text
      441779                  dog that urinates on everything
      441779                  safest place to live
      ...
      441779                  the best season to visit Italy
      441779                  landscapers in Lilburn, GA

                      Most of the queries do not contain PII                                                                                                                                     Utility of downstream machine learning model on data
                                                                                                                                                                                                                    (higher is better)
POSTER TEMPLATE BY:

www.PosterPresentations.com