Workshop on Privacy in Natural Language Processing (PrivateNLP at WSDM 2020) Oluwaseyi Feyisetan∗ Sepideh Ghanavati Patricia Thaine Amazon University of Maine University of Toronto Seattle, Washington, USA Orono, Maine, USA Toronto, Ontario, Canada sey@amazon.com sepideh.ghanavati@maine.edu pthaine@cs.toronto.edu ABSTRACT issuing vanity queries) or implicitly (e.g., age, gender, and nation- Privacy-preserving data analysis has become essential in Machine ality can be determined by the way a query is written). Explicit Learning (ML), where access to vast amounts of data can provide personally identifiable information (PII), such as an individual’s PIN large gains the in accuracies of tuned models. A large proportion or SSN, can potentially be filtered out via rules or pattern match- of user-contributed data comes from natural language e.g., text ing. However, more subtle privacy attacks occur when seemingly transcriptions from voice assistants. It is therefore important for cu- innocuous information (combined in aggregate and in the presence rated natural language datasets to preserve the privacy of the users of side knowledge), is used to discern the private details of an indi- whose data is collected and for the models trained on sensitive data vidual. A classical example can be seen from the privacy breach in to only retain non-identifying (i.e., generalizable) information. The the ‘anonymized’ AOL search logs of 2006.1 workshop aims to bring together researchers and practitioners from In addition to keeping data secure and maintaining user trust, academia and industry to discuss the challenges and approaches privacy-preserving techniques now need to be more robust than to designing, building, verifying, and testing privacy-preserving ever before for companies and researchers to comply with regula- systems in the context of Natural Language Processing (NLP). tions such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Failure to comply CCS CONCEPTS has led to a number of hefty fees. As a result, questions which were mainly of interest to the research community are getting in- • Security and privacy → Privacy protections. creasingly more attention from companies: (1) how can we create ACM Reference Format: and curate NLP datasets while preserving the privacy of the users Oluwaseyi Feyisetan, Sepideh Ghanavati, and Patricia Thaine. 2020. Work- whose data is collected? (2) How do we train ML models that only shop on Privacy in Natural Language Processing (PrivateNLP at WSDM retain pseudonymized user data? 2020). In Proceedings of Workshop on Privacy in Natural Language Process- To address these challenges, researchers have started exploring ing (PrivateNLP ’20). Houston, TX, USA, 2 pages. https://doi.org/10.1145/ different NLP-based methods to remove PII. These methods range nnnnnnn.nnnnnnn from simple pattern matching techniques to advanced uses of deep learning on embedded representations. These tend to fall short 1 INTRODUCTION because they are trained on, and operate only on known types of sensitive entities – signifying that more research is required. This The collection of user data has grown dramatically in recent years, has led to a rapidly growing research field at the intersection of raising concerns about the aggregation of sensitive data and the NLP and Privacy. high risk of personally identifiable information leaks. In response, The topic’s growth has been accompanied by the creation of methods for privacy-preserving data analysis have been proposed privacy workshop series such as PPML (Privacy Preserving Machine to protect individual information while maintaining the utility of Learning) and TPDP (Theory and Practice of Differential Privacy) large quantities of aggregated data. at NeurIPS and CCS. However, there are no privacy workshop Preserving the privacy of training data has become essential to focusing specifically on NLP, which has its own set of specific guaranteeing data security and to maintaining user trust for con- privacy problems. tinuous access to vast amounts of data that can provide significant To this end, we are holding the PrivateNLP workshop to focus on model performance gains. As a result, significant research has been this sub-field and emerging NLP-specific challenges. The workshop done to provide quantifiable guarantees that a user’s contribution to aims to consolidate privacy research with NLP community. It will a system cannot be linked back to their existence within the under- also help foster greater collaboration within the community and lying dataset. In statistical data analysis, the theoretical framework strengthen the bond between academic and industry researchers. of Differential Privacy (DP) has primarily been used. While meth- Our primary motivation is to advance the sub-field of privacy in ods such as DP focus on numeric data, a large proportion of user text data, which is fundamental in NLP research. Additionally, the contributions comes not in the form of statistical queries, but natu- workshop can also raise awareness of privacy-related issues within ral language e.g., search queries, emails, reviews, comments, or text NLP and can be beneficial to those not actively working in the area. transcriptions from the increasingly ubiquitous voice assistants. User-generated data can be sensitive both because of the explicit and the implicit information they contain. For example, in web search systems, a user can disclose their identity or a personal 1 A Face Is Exposed for AOL Searcher No. 4417749. https://www.nytimes.com/2006/08/ preference during their query interactions either explicitly (e.g., by 09/technology/09aol.html 2 OBJECTIVES AND SCOPE does research on computational methods for lost language decipher- The workshop covers aspects of text-based privacy research includ- ment. Patricia is a recipient of the NSERC Postgraduate Scholarship, ing, but not limited to: the RBC Graduate Fellowship, the Beatrice ‘Trixie’ Worsley Grad- uate Scholarship in Computer Science, and the Ontario Graduate • Generating privacy preserving test sets Scholarship. She has eight years of research and software develop- • Inference and identification attacks ment experience, including at the McGill Language Development • Generating differentially private derived data Lab, the University of Toronto’s Computational Linguistics Lab, the • NLP, privacy and regulatory compliance University of Toronto’s Department of Linguistics, and the Pub- • Private Generative Adverserial Networks lic Health Agency of Canada. She is the Co-Founder and CEO of • Privacy in Active Learning and Crowdsourcing Private AI, the former President of the Computer Science Gradu- • Privacy and Federated Learning in NLP ate Student Union at the University of Toronto, and a member of • User perceptions on privatized personal data the Board of Directors of Equity Showcase, one of Canada’s oldest • Auditing provenance in language models not-for-profit charitable organizations. • Continual learning under privacy constraints • NLP and summarization of privacy policies • Ethical ramifications of AI/NLP in support of usable privacy 3.2 Program Committee Aleksei Triastcyn (Ecole Polytechnique Federale de Lausanne), An- 3 ORGANIZATION AND PROGRAM dreas Nautsch (EURECOM), Arne Kahn (Saarland University), Avi Arampatzis (Democritus University of Thrace), Asma Eidhah Aloufi 3.1 Organizers (Rochester Institute of Technology), Benjamin Zi Hao Zhao (Uni- Oluwaseyi Feyisetan (Amazon, USA). Oluwaseyi Feyisetan is an Ap- versity of New South Wales), Borja Balle (DeepMind), Claire McKay plied Scientist at Amazon Alexa where he works on Differential Bowen (Los Alamos National Laboratory), Congzheng Song (Cor- Privacy and Privacy Auditing mechanisms within the context of nell), Dinusha Vatsalan (Data61-CSIRO), Elette Boyle (IDC Herzliya), Natural Language Processing. He holds 2 pending patents with Fang Liu (University of Notre Dame), Isar Nejadgholi (National Re- Amazon on preserving privacy in NLP systems. He completed his search Council Canada), Jamie Hayes (University College London), PhD at the University of Southampton in the UK and has published Jason Xue (University of Adelaide), Julius Adebayo (MIT), Kambiz in top tier conferences and journals on crowdsourcing, homomor- Ghazinour (State University of New York), Liwei Song (Princeton), phic encryption, and privacy in the context of Active Learning and Luca Melis (Amazon USA), Mark Dras (Macquarie University), Max- NLP. He has served as a reviewer at top NLP conferences including imin Coavoux (University of Edinburgh), Mitra Bokaei Hosseini ACL and EMNLP. He is the lead organizer of the Workshop on (St. Mary’s University), Natasha Fernandes (Macquarie University), Privacy and Natural Language Processing (PrivateNLP) at WSDM Nedelina Teneva (Amazon USA), Olya Ohrimenko (Microsoft Re- with an upcoming event scheduled for EMNLP. Prior to working search), Pauline Anthonysamy (Google), Sai Teja Peddinti (Google), at Amazon in the US, he spent 7 years in the UK where he worked Shomir Wilson (Pennsylvania State University), Tom Diethe (Ama- at different startups and institutions focusing on regulatory com- zon UK), Travis Breaux (Carnegie Mellon University) pliance, machine learning and NLP within the finance sector, most recently, at the Bank of America. 3.3 Keynote Speaker Sepideh Ghanavati (University of Maine, USA). Assistant profes- Tom Diethe (Amazon UK). Tom Diethe is an Applied Science Man- sor in Computer Science at the University of Maine. She is also ager in Amazon Research, Cambridge UK. Tom is also an Honorary the director of Privacy Engineering - Regulatory Compliance Lab Research Fellow at the University of Bristol. Tom was formerly (PERC_Lab). Her research interests are in the areas of information a Research Fellow for the “SPHERE” Interdisciplinary Research privacy and security, software engineering, machine learning and Collaboration, which is designing a platform for eHealth in a smart- the Internet of Things (IoT). Previously, she worked as an assistant home context. This platform is currently being deployed into homes professor at Texas Tech University, visiting assistant professor at throughout Bristol. Radboud University in Nijmegen, the Netherlands and as a visiting Tom specializes in probabilistic methods for machine learning, faculty at the Institute for Software Research at Carnegie Mellon applications to digital healthcare, and privacy enhancing technolo- University. She is the recipient of Google Faculty Research award gies. He has a Ph.D. in Machine Learning applied to multivariate in 2018. She has more than 10 years of academic and industry expe- signal processing from UCL, and was employed by Microsoft Re- rience in the area of privacy and regulatory compliance especially search Cambridge where he co-authored a book titled ‘Model-Based in the healthcare domain and has published more than 25 peer- Machine Learning.’ He also has significant industrial experience, reviewed publications. She was a co-organizer of the ‘Privacy and with positions at QinetiQ and the British Medical Journal. He is a Language Technologies’ at the 2019 AAAI Spring Symposium and fellow of the Royal Statistical Society and a member of the IEEE has been part of the organizing committee of several workshops Signal Processing Society. and conferences in the past. 4 ACKNOWLEDGMENTS Patricia Thaine (Univ. of Toronto, Canada). Patricia Thaine is a PhD The organizers would like to thank the program committee for their Candidate at the Department of Computer Science (University of service, as well as the the authors who submitted their research to Toronto) doing research on Privacy-Preserving Natural Language the workshop. Processing, with a special focus on Applied Cryptography. She also