Workshop on Privacy in Natural Language Processing
                       (PrivateNLP at WSDM 2020)
           Oluwaseyi Feyisetan∗                               Sepideh Ghanavati                                    Patricia Thaine
                    Amazon                                    University of Maine                               University of Toronto
           Seattle, Washington, USA                           Orono, Maine, USA                               Toronto, Ontario, Canada
              sey@amazon.com                             sepideh.ghanavati@maine.edu                           pthaine@cs.toronto.edu

ABSTRACT                                                                    issuing vanity queries) or implicitly (e.g., age, gender, and nation-
Privacy-preserving data analysis has become essential in Machine            ality can be determined by the way a query is written). Explicit
Learning (ML), where access to vast amounts of data can provide             personally identifiable information (PII), such as an individual’s PIN
large gains the in accuracies of tuned models. A large proportion           or SSN, can potentially be filtered out via rules or pattern match-
of user-contributed data comes from natural language e.g., text             ing. However, more subtle privacy attacks occur when seemingly
transcriptions from voice assistants. It is therefore important for cu-     innocuous information (combined in aggregate and in the presence
rated natural language datasets to preserve the privacy of the users        of side knowledge), is used to discern the private details of an indi-
whose data is collected and for the models trained on sensitive data        vidual. A classical example can be seen from the privacy breach in
to only retain non-identifying (i.e., generalizable) information. The       the ‘anonymized’ AOL search logs of 2006.1
workshop aims to bring together researchers and practitioners from             In addition to keeping data secure and maintaining user trust,
academia and industry to discuss the challenges and approaches              privacy-preserving techniques now need to be more robust than
to designing, building, verifying, and testing privacy-preserving           ever before for companies and researchers to comply with regula-
systems in the context of Natural Language Processing (NLP).                tions such as the General Data Protection Regulation (GDPR) and
                                                                            the California Consumer Privacy Act (CCPA). Failure to comply
CCS CONCEPTS                                                                has led to a number of hefty fees. As a result, questions which
                                                                            were mainly of interest to the research community are getting in-
• Security and privacy → Privacy protections.                               creasingly more attention from companies: (1) how can we create
ACM Reference Format:                                                       and curate NLP datasets while preserving the privacy of the users
Oluwaseyi Feyisetan, Sepideh Ghanavati, and Patricia Thaine. 2020. Work-    whose data is collected? (2) How do we train ML models that only
shop on Privacy in Natural Language Processing (PrivateNLP at WSDM          retain pseudonymized user data?
2020). In Proceedings of Workshop on Privacy in Natural Language Process-      To address these challenges, researchers have started exploring
ing (PrivateNLP ’20). Houston, TX, USA, 2 pages. https://doi.org/10.1145/   different NLP-based methods to remove PII. These methods range
nnnnnnn.nnnnnnn                                                             from simple pattern matching techniques to advanced uses of deep
                                                                            learning on embedded representations. These tend to fall short
1    INTRODUCTION                                                           because they are trained on, and operate only on known types of
                                                                            sensitive entities – signifying that more research is required. This
The collection of user data has grown dramatically in recent years,
                                                                            has led to a rapidly growing research field at the intersection of
raising concerns about the aggregation of sensitive data and the
                                                                            NLP and Privacy.
high risk of personally identifiable information leaks. In response,
                                                                               The topic’s growth has been accompanied by the creation of
methods for privacy-preserving data analysis have been proposed
                                                                            privacy workshop series such as PPML (Privacy Preserving Machine
to protect individual information while maintaining the utility of
                                                                            Learning) and TPDP (Theory and Practice of Differential Privacy)
large quantities of aggregated data.
                                                                            at NeurIPS and CCS. However, there are no privacy workshop
   Preserving the privacy of training data has become essential to
                                                                            focusing specifically on NLP, which has its own set of specific
guaranteeing data security and to maintaining user trust for con-
                                                                            privacy problems.
tinuous access to vast amounts of data that can provide significant
                                                                               To this end, we are holding the PrivateNLP workshop to focus on
model performance gains. As a result, significant research has been
                                                                            this sub-field and emerging NLP-specific challenges. The workshop
done to provide quantifiable guarantees that a user’s contribution to
                                                                            aims to consolidate privacy research with NLP community. It will
a system cannot be linked back to their existence within the under-
                                                                            also help foster greater collaboration within the community and
lying dataset. In statistical data analysis, the theoretical framework
                                                                            strengthen the bond between academic and industry researchers.
of Differential Privacy (DP) has primarily been used. While meth-
                                                                               Our primary motivation is to advance the sub-field of privacy in
ods such as DP focus on numeric data, a large proportion of user
                                                                            text data, which is fundamental in NLP research. Additionally, the
contributions comes not in the form of statistical queries, but natu-
                                                                            workshop can also raise awareness of privacy-related issues within
ral language e.g., search queries, emails, reviews, comments, or text
                                                                            NLP and can be beneficial to those not actively working in the area.
transcriptions from the increasingly ubiquitous voice assistants.
   User-generated data can be sensitive both because of the explicit
and the implicit information they contain. For example, in web
search systems, a user can disclose their identity or a personal            1 A Face Is Exposed for AOL Searcher No. 4417749. https://www.nytimes.com/2006/08/
preference during their query interactions either explicitly (e.g., by      09/technology/09aol.html
2   OBJECTIVES AND SCOPE                                               does research on computational methods for lost language decipher-
The workshop covers aspects of text-based privacy research includ-     ment. Patricia is a recipient of the NSERC Postgraduate Scholarship,
ing, but not limited to:                                               the RBC Graduate Fellowship, the Beatrice ‘Trixie’ Worsley Grad-
                                                                       uate Scholarship in Computer Science, and the Ontario Graduate
     • Generating privacy preserving test sets
                                                                       Scholarship. She has eight years of research and software develop-
     • Inference and identification attacks
                                                                       ment experience, including at the McGill Language Development
     • Generating differentially private derived data
                                                                       Lab, the University of Toronto’s Computational Linguistics Lab, the
     • NLP, privacy and regulatory compliance
                                                                       University of Toronto’s Department of Linguistics, and the Pub-
     • Private Generative Adverserial Networks
                                                                       lic Health Agency of Canada. She is the Co-Founder and CEO of
     • Privacy in Active Learning and Crowdsourcing
                                                                       Private AI, the former President of the Computer Science Gradu-
     • Privacy and Federated Learning in NLP
                                                                       ate Student Union at the University of Toronto, and a member of
     • User perceptions on privatized personal data
                                                                       the Board of Directors of Equity Showcase, one of Canada’s oldest
     • Auditing provenance in language models
                                                                       not-for-profit charitable organizations.
     • Continual learning under privacy constraints
     • NLP and summarization of privacy policies
     • Ethical ramifications of AI/NLP in support of usable privacy
                                                                       3.2    Program Committee
                                                                       Aleksei Triastcyn (Ecole Polytechnique Federale de Lausanne), An-
3 ORGANIZATION AND PROGRAM                                             dreas Nautsch (EURECOM), Arne Kahn (Saarland University), Avi
                                                                       Arampatzis (Democritus University of Thrace), Asma Eidhah Aloufi
3.1 Organizers                                                         (Rochester Institute of Technology), Benjamin Zi Hao Zhao (Uni-
Oluwaseyi Feyisetan (Amazon, USA). Oluwaseyi Feyisetan is an Ap-       versity of New South Wales), Borja Balle (DeepMind), Claire McKay
plied Scientist at Amazon Alexa where he works on Differential         Bowen (Los Alamos National Laboratory), Congzheng Song (Cor-
Privacy and Privacy Auditing mechanisms within the context of          nell), Dinusha Vatsalan (Data61-CSIRO), Elette Boyle (IDC Herzliya),
Natural Language Processing. He holds 2 pending patents with           Fang Liu (University of Notre Dame), Isar Nejadgholi (National Re-
Amazon on preserving privacy in NLP systems. He completed his          search Council Canada), Jamie Hayes (University College London),
PhD at the University of Southampton in the UK and has published       Jason Xue (University of Adelaide), Julius Adebayo (MIT), Kambiz
in top tier conferences and journals on crowdsourcing, homomor-        Ghazinour (State University of New York), Liwei Song (Princeton),
phic encryption, and privacy in the context of Active Learning and     Luca Melis (Amazon USA), Mark Dras (Macquarie University), Max-
NLP. He has served as a reviewer at top NLP conferences including      imin Coavoux (University of Edinburgh), Mitra Bokaei Hosseini
ACL and EMNLP. He is the lead organizer of the Workshop on             (St. Mary’s University), Natasha Fernandes (Macquarie University),
Privacy and Natural Language Processing (PrivateNLP) at WSDM           Nedelina Teneva (Amazon USA), Olya Ohrimenko (Microsoft Re-
with an upcoming event scheduled for EMNLP. Prior to working           search), Pauline Anthonysamy (Google), Sai Teja Peddinti (Google),
at Amazon in the US, he spent 7 years in the UK where he worked        Shomir Wilson (Pennsylvania State University), Tom Diethe (Ama-
at different startups and institutions focusing on regulatory com-     zon UK), Travis Breaux (Carnegie Mellon University)
pliance, machine learning and NLP within the finance sector, most
recently, at the Bank of America.                                      3.3    Keynote Speaker
Sepideh Ghanavati (University of Maine, USA). Assistant profes-        Tom Diethe (Amazon UK). Tom Diethe is an Applied Science Man-
sor in Computer Science at the University of Maine. She is also        ager in Amazon Research, Cambridge UK. Tom is also an Honorary
the director of Privacy Engineering - Regulatory Compliance Lab        Research Fellow at the University of Bristol. Tom was formerly
(PERC_Lab). Her research interests are in the areas of information     a Research Fellow for the “SPHERE” Interdisciplinary Research
privacy and security, software engineering, machine learning and       Collaboration, which is designing a platform for eHealth in a smart-
the Internet of Things (IoT). Previously, she worked as an assistant   home context. This platform is currently being deployed into homes
professor at Texas Tech University, visiting assistant professor at    throughout Bristol.
Radboud University in Nijmegen, the Netherlands and as a visiting         Tom specializes in probabilistic methods for machine learning,
faculty at the Institute for Software Research at Carnegie Mellon      applications to digital healthcare, and privacy enhancing technolo-
University. She is the recipient of Google Faculty Research award      gies. He has a Ph.D. in Machine Learning applied to multivariate
in 2018. She has more than 10 years of academic and industry expe-     signal processing from UCL, and was employed by Microsoft Re-
rience in the area of privacy and regulatory compliance especially     search Cambridge where he co-authored a book titled ‘Model-Based
in the healthcare domain and has published more than 25 peer-          Machine Learning.’ He also has significant industrial experience,
reviewed publications. She was a co-organizer of the ‘Privacy and      with positions at QinetiQ and the British Medical Journal. He is a
Language Technologies’ at the 2019 AAAI Spring Symposium and           fellow of the Royal Statistical Society and a member of the IEEE
has been part of the organizing committee of several workshops         Signal Processing Society.
and conferences in the past.
                                                                       4     ACKNOWLEDGMENTS
Patricia Thaine (Univ. of Toronto, Canada). Patricia Thaine is a PhD
                                                                       The organizers would like to thank the program committee for their
Candidate at the Department of Computer Science (University of
                                                                       service, as well as the the authors who submitted their research to
Toronto) doing research on Privacy-Preserving Natural Language
                                                                       the workshop.
Processing, with a special focus on Applied Cryptography. She also