Is It Possible to Preserve Privacy in the Age of AI? Vijayanta Jain Sepideh Ghanavati University of Maine University of Maine Orono, Maine, USA Orono, Maine, USA vijayanta.jain@maine.edu sepideh.ghanavati@maine.edu ABSTRACT time on their services. The number of applications and devices Artificial Intelligence (AI) hopes to provide a positive paradigm that use AI will also increase in the near future. This is evident by shift in technology by providing new features and personalized the increasing number of smartphones with dedicated chips for experience to our digital and physical world. In the future, almost machine learning (ML) [1–3, 27] and devices that come integrated all our digital services and physical devices will be enhanced by AI with personal assistants .2,3 to provide us with better features. However, as training artificially The proliferation of AI poses direct and indirect threats to user intelligent models require a large amount of data, it poses a threat privacy. The direct threat is the inference of personal information to user privacy. The increasing prevalence of AI promotes data and the indirect threat is the promotion of data collection. Movies collection and consequently poses a threat to privacy. To address such as Her, accurately portray the Utopian-AI future some com- these concerns, some research efforts have been directed towards panies hope to provide users as they increase the ubiquity of ML developing techniques to train AI systems while preserving privacy in their digital and physical products. However, as training AI sys- and help users preserve their privacy. In this paper, we survey the tems, such as deep neural networks, requires a large amount of data, literature and identify these privacy-preserving approaches that companies collect usage data from users whenever they interact can be employed to preserve privacy. We also suggest some future with any of their services. There are two major problems with this directions based on our analysis. We find that privacy-preserving collection: first, the usage data collected is used to infer information research, specifically for AI, is in its early stage and requires more such as personal interests, habits, and behavior patterns thus invad- effort to address the current challenges and research gaps. ing privacy; and second, to improve the personalization, intelligent features, and AI-capabilities of the services, companies will con- CCS CONCEPTS tinuously collect and increase the data collected from users, thus leading to an endless-loop of collecting data which threatens user • Privacy → Privacy protections. privacy (see Figure 2). Moreover, the collected data is often used for ad-personalization or shared with third-party which does not meet KEYWORDS user’s expectations and thus, violates user privacy [23]. For exam- Artificial Intelligence, Privacy, Machine Learning, Survey ple, when you interact with Google’s Home Mini, the text from these recordings may be used for ad-personalization (see Figure 1) 1 INTRODUCTION which does not meet the privacy expectations of the users [23]. Artificial Intelligence (AI) is increasingly becoming ubiquitous in Privacy violations in recent times have motivated research ef- our lives through its growing presence in the digital services we forts to develop techniques and methodologies to preserve privacy. use and the physical devices we own. AI already powers our most Previous research work has developed tools that provide users with commonly used digital services, such as search (Google, Bing), more effective notice and choice [9, 18, 19, 31]. With increasing music (Spotify, YouTube Music), entertainment (Netflix, YouTube), concerns about privacy because of AI, some efforts have also been and social media (Facebook, Instagram, Twitter). These services directed towards training machine learning models while preserv- heavily rely on AI or Machine Learning (ML) 1 to provide users with ing privacy [4, 29]. User-focused techniques provide users with the personalized content and better features, such as relevant search necessary tools to preserve privacy whereas privacy-preserving results, the content the users would like, and the people they might machine learning helps companies use machine learning for their know. AI/ML also enhances several physical devices that we own services while still preserving user privacy. In this work, we survey (or can own), for example - smart speakers, such as Google Hub and these methods to understand the methodologies that can be em- Amazon Echo, that rely on natural language processing to detect ployed when users are surrounded by digital services and physical voice, understand, and execute commands such as to control lights, devices that use AI. The contributions of this paper are two-fold: change the temperature, or add groceries to shopping list. Using • We survey the machine learning based methodologies and AI to provide highly personalized experience is beneficial for the techniques. users as well as the providers; users get positive engagement with • Identify research gaps and suggest future directions. these platforms and providers get engaged users who spend more The rest of the paper is organized as follows: in Section 2, we 1 AI and ML are used interchangeably in this paper. report the result of our survey. In Section 3, we discuss some related work whereas Section 4 identifies the challenges and suggests future Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons directions. Finally in Section 5, we conclude our work. License Attribution 4.0 International (CC BY 4.0). WSDM ’20, February 3–7, 2020, Houston, TX, USA 2 https://www.amazon.com/Amazon-Smart-Oven/dp/B07PB21SRV © 2020 https://doi.org/10.1145/1122445.1122456 3 https://www.amazon.com/Echo-Frames/dp/B01G62GWS4 WSDM ’20, February 3–7, 2020, Houston, TX, USA privacy is that having repeated queries to the database can average out the noise and thus revealing the underlying sensitive informa- tion of the database [13]. To solve this, Dwork proposes privacy budget that considers each query to the database as a privacy cost and for each session there is a privacy budget [11, 13]. After the privacy budget has been used for the session, no query results are returned. Other work in this area has been to develop methods to train Figure 1: The Text of Voice Recordings Can be Used for Ad- neural networks on the device itself without sending the data back personalization to the servers [24, 25, 29]. Shokri and Shmatikov [29] present a system to jointly train models without sharing the input dataset of each individual. In their work, they develop a system that allows 2 ANALYSIS OF THE CURRENT LITERATURE several participants to train similar neural networks on their input In this section, we report on our survey of machine-learning based data without sharing the data but selectively sharing the parame- techniques that have been developed to preserve user privacy. We ters with each other to avoid local minima. Similarly, in line with divide this section into two groups: i) privacy preserving machine Shokri and Shmatikov to not share data, McMahan et al. [24] pro- learning approaches and ii) techniques to provide users with notice pose Federated Learning which allows developers to train neural and give them choices. networks in a decentralized and privacy-preserving manner. The ideology behind their work is that neural network models to be trained are sent to the mobile devices which contain the user sensi- tive data and use SGD locally to update the parameters. The models are then sent back to a central server which "averages" the update from all the models to achieve a better model. They term this algo- rithm FederatedAveraging. Similarly, Papernot et al. [25] propose Private Aggregation of Teacher Ensemble (PATE) - a method to train machine learning models while preserving privacy. In their approach, several "teacher" models are trained on disjoint subsets of the dataset, then the "student" model is trained by the aggregation of the "teachers" to accurately "mimic the ensemble". The goal of this work is to address the information leakage problem [15]. The goal of the work outlined above is to develop new algo- rithms and methods to train neural networks on a device or use differentially private algorithms. However, information leakage still provides a threat to the user’s privacy. Information leakage is the concept in which the neural network implicitly contains sensitive information it was trained on. This is demonstrated in [15, 30]. This is an active research topic and new methods, such as PATE, aim to resolve this issue by not exposing the dataset to the machine Figure 2: Cycle of Eternal Increase in Data Collection learning model. 2.2 Mechanisms to Control User’s Data 2.1 Privacy Preserving Machine Learning The primary goal in this field of research has been to provide users Approaches with better notice, give them choices and provide them with the Recent research efforts have been directed to develop privacy- means to control their personal information. Notice and Choice is preserving machine learning techniques [4, 24]. Prior to machine one of the fundamental methods to preserve privacy and is based on learning, differential privacy provided a strong standard to preserve the Openness principle of the OECD Fair Information Principle [16]. privacy for statistical analysis on public datasets. In this technique, In Notice and Choice mechanism, the primary goal has been to whenever a statistical query is made to a database containing sensi- improve and extract relevant information from privacy policies tive information, a randomized function k adds noise to the resulting for the users. This is because privacy policies are lengthy and it is query which preserves privacy while also ensuring the usability of infeasible for users to read the privacy policies for all the digital the database [13]. Some work has used differential privacy for train- and physical services they use/own [10]. Therefore, research has ing machine learning models [4, 7]. Chaudhri and Monteleoni [7] focused on providing them with better notice and choice such as use this technique to develop a privacy-preserving algorithm for in [20, 22, 28]. Other work have achieved similar results by applying logistic regression. Abadi et al. [4] also use this technique to train machine learning techniques. Harkous et al. [18] develop PriBot deep neural networks by developing a noisy Stochastic Gradient a Q&A chatbot that analyzes a privacy policy and then provides Descent (SGD) algorithm. However, a key problem with differential users with sections of the privacy policy that answers their question. Is It Possible to Preserve Privacy in the Age of AI? WSDM ’20, February 3–7, 2020, Houston, TX, USA Some work has focused on identifying the quality of the privacy have not conducted usability studies to examine the user’s view. policy. For example, Constane et al. [8] use text categorization and This inhibits implementing such research into real-world. machine learning to categorize paragraphs of privacy policies and Overall we find that this line of work has focused on giving assess their completeness with a grade. The grade is calculated by users the mechanisms to understand the privacy practices and the weight assigned by the user to each category and the coverage of control their data. Giving users the control of their data is important, the category in a selected section. This method helps users inspect a however, this approach puts the burden on the users to preserve privacy policy in a structured way and read only the paragraphs that their privacy which might be difficult for less tech-savvy users as interest them. Zimmeck et al. introduce Privee [36] which integrates often the privacy settings for websites are hidden under layers of Constane’s classification method with Sadeh’s crowdsourcing. In settings to control. Privee, if a privacy analysis results are available in the repository, the result is returned to the user. Otherwise, the privacy policy is 3 RELATED WORK automatically classified and then, it is returned. PrivacyGuide [31] Papernot et al. [26] provide a Systematization of Knowledge (SoK) uses classification techniques, such as Naïve Bayes and Support of security and privacy challenges in machine learning. This work Vector Machines (SVM), to categorize privacy policies based on surveys the existing literature to identify the security and privacy the EU GDPR [14], summarize them and then allocate risk factors. threats as well as defenses that have been developed to mitigate These above work certainly improve the previous "state-of-the- the threats. The research work also argues based on the analy- art" method of notice & choice - a privacy policy by giving users sis, to develop a framework for understanding the sensitivity of a succinct form of information. However, privacy policies often ML algorithms to its training data to foster security and privacy contain ambiguities that are difficult for technology to answer, for implications of ML algorithms. Our analysis is similar as it eval- example, the number of third parties the data is shared with or how uates privacy implications of these machine learning algorithms, long the data will be stored by the companies. but our work provides a more detailed discussion on the privacy Another active topic of research in providing control of their pri- challenges as compared to [26]. Zhu et al. [35] survey different vacy to users is to model privacy preferences. The goal of this topic methods developed to publish and analyze differentially private of research is to provide users with more control over what infor- data. The work analyzes differentially private data published based mation can mobile applications or other users access. Lin et al. [21] on the type of input data, the number of queries, accuracy, and create a small number of profiles for user’s privacy preference using efficiency and evaluate differentially private data analysis based on clustering and then based on those profiles analyze whether the Laplace/Exponential Framework, such as [7] and Private Learning user from a profile allows certain permissions or not. Similar to Framework, such as [4]. The paper also presents with some future their work, Wijesekera et al. [32] develop a contextually-aware directions for differential privacy, such as executing more local dif- permission system that dynamically permits access to private data ferential privacy. This work is the closest to our work as it surveys of Android applications based on user’s preferences. They argue a privacy-preserving analysis technique and suggests future work. that their permission system is better than the default Android However, in our analysis, we also incorporate the technologies permission system of Ask-On-First-Use (AOFU) as context, "what that help users preserve their privacy. Overall, our work differs [users] were doing on their mobile devices at the time that data was from [26, 35] as we look at the big picture of privacy-preserving requested" [32] affect user’s privacy preferences. In their system, technologies specifically with the increase in use of AI. they use SVM classifier, trained over contextual information and user’s behavior, to make permission decisions. They also conduct a 4 DISCUSSION usability study to model the preferences of 37 users and test their In this paper, we discussed techniques and methodologies devel- system [33]. Similarly, other work to use contextual information oped to preserve user privacy. Primarily, we identified two groups to model privacy preferences has been done for applications in of work: (1) privacy-preserving machine learning, such as noisy web-based services as well. Yuan et al. [34] propose a model that SGD and federated learning, and (2) techniques to provide users uses contextual information to share images, with different granu- with the tool to protect their own privacy. In this section, we discuss larity with other users. In their work, based on the semantic image the advantages of each category of approaches, their existing chal- features and contextual features of a requester, they train logistic lenges, the research gaps, and suggest some potential future work regression, SVM and Random Forest to predict whether the user to address the challenges and gaps identified here. We summarize would share, would not share, or partially share the image requested. our analysis in Table 1. Similarly, Bilogrevic et al. [6] develop Smart Privacy-aware Informa- Differential Privacy and Machine Learning Approaches: tion Sharing Mechanism, a system that shares personal information Differential privacy provides a strong state-of-the-art for data anal- with users, third-party, online services, or mobile apps based on the ysis by introducing noise to query results [12] and this method has user’s privacy preferences and the contextual information. They use also been used to train deep neural networks [4]. One of the biggest Naïve Bayesian, SVM, and Logistic Regression to model preferences. advantages of these approaches is the simplicity and efficiency of They also conduct a user study to understand their preferences and the methodology. Some companies have even started to use dif- the factors influencing their decision. Using contextual information ferential privacy in some of their applications .4 Using differential and providing different levels of information access is a great step privacy for deep learning provides great potential for researchers towards providing the user with greater control of their data but and developers. However, understanding the trade-offs between certain challenges still remain. Primarily, most of these systems 4 https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf WSDM ’20, February 3–7, 2020, Houston, TX, USA Table 1: Summary of Privacy-Preserving Approaches Privacy-Preserving Approach Advantages Disadvantages • Simple and efficient • Requires large noise for effective pri- Differential Privacy and Machine Learning • Easy to Employ vacy at the cost of utility. • Prevents sharing and profiling • More suitable for large-scale applica- Federated Learning and thus better privacy tions • Puts the burden on user to preserve • Gives user the control of their pri- privacy User-Focused Privacy Tools vacy • Limited tools for controlling privacy privacy and utility for specific tasks, models, optimizers, and similar 5 CONCLUSION other factors can further help developers in using differentially- In this work, we provide a brief survey of machine learning based private machine learning. Some initial work has been done in this techniques to preserve user privacy, identify the challenges with area [5] but future work can explore this in detail. these techniques and suggest some future work to address the chal- Federated Learning: Federated learning provides a unique ap- lenges. We argue that the privacy-preserving technologies specifi- proach to machine learning by training models on device instead cally for AI are in their early stages and it will be difficult to preserve of on a central server [24]. By keeping the data on a device, it will privacy in the age of AI. We identify research gaps and suggest prevent sharing with third-party and even profiling user-data for future work that can address some of the gaps and result in more ad-personalization. A key challenge with federated learning is the effective privacy-preserving technologies for AI. In future, we plan complexity of using Federated Learning; small-scale companies on expanding this work for a more critical analysis of different and developers might find differential privacy easier to optimize algorithms and evaluate their efficacy for different use cases. and employ on a smaller scale. Another challenge with this ap- proach is information leakage from the gradients of the neural REFERENCES network [15, 30]. There has been some effort to address this is- [1] [n.d.]. iPhone 11 Pro. https://www.apple.com/iphone-11-pro/. sue by developing different privacy-preserving machine learning [2] [n.d.]. OnePlus 7 Pro. https://www.oneplus.com/7pro#/specs. methodologies [25]. However, a critical gap in this area of research [3] [n.d.]. Samsung Galaxy S10 Intelligence - Virtual Assistant & AR Photo. https: //www.samsung.com/us/mobile/galaxy-s10/intelligence/. is that few research efforts have looked into providing users with [4] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, mechanisms that control the data being used for federated-learning. Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Future work can address this gap. Another future direction for Security. ACM, 308–318. federated learning is to combine differentially-private data with [5] Brendan Avent, Javier Gonzalez, Tom Diethe, Andrei Paleyes, and Borja Balle. federated learning. Initial work has been done in this direction, such 2019. Automatic Discovery of Privacy-Utility Pareto Fronts. arXiv preprint arXiv:1905.10862 (2019). as [17], but future work could expand the analysis by evaluating [6] Igor Bilogrevic, Kévin Huguenin, Berker Agir, Murtuza Jadliwala, Maria Gazaki, different differential privacy algorithms for privatizing data. and Jean-Pierre Hubaux. 2016. A machine-learning based approach to privacy- User-Focused Privacy Preserving: Several methods have been aware information-sharing in mobile social networks. Pervasive and Mobile Computing 25 (2016), 125–142. proposed that uses machine learning to preserve user-privacy [6, [7] Kamalika Chaudhuri and Claire Monteleoni. 2009. Privacy-preserving logistic 18, 32] to provide users with the necessary notices and control regression. In Advances in neural information processing systems. 289–296. [8] Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry den Hartog. 2012. A mechanisms to have control over their data. Some of these meth- machine learning solution to assess privacy policy completeness:(short paper). In ods [18] employ Natural Language Processing (NLP) to understand Proceedings of the 2012 ACM workshop on Privacy in the electronic society. ACM, privacy text to preserve user privacy. Future work in this direction 91–96. [9] Lorrie Faith Cranor. 2003. P3P: Making privacy policies more useful. IEEE Security can employ more advanced architectures for this task to improve & Privacy 1, 6 (2003), 50–55. accuracy and relevance. Another future direction can be to help [10] Lorrie Faith Cranor. 2012. Necessary but not sufficient: Standardized mechanisms companies and developers create applications and systems that for privacy notice and choice. J. on Telecomm. & High Tech. L. 10 (2012), 273. [11] Cynthia Dwork. 2011. Differential privacy. Encyclopedia of Cryptography and preserve user’s privacy. Security (2011), 338–340. Based on our analysis of the current data practices and research [12] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differ- ential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 development, we believe that it will be difficult to preserve privacy (2014), 211–407. in the age of AI. As the ubiquity of AI and economic incentives [13] Aaruran Elamurugaiyan. 2018. A Brief Introduction to Differential Pri- to use AI will increase, it will passively promote data collection vacy. https://medium.com/georgian-impact-blog/a-brief-introduction-to- differential-privacy-eacf8722283b and thus pose a threat to user privacy. The techniques developed [14] EU GDPR [n.d.]. "The EU General Data Protection Regulation (GDPR)". EU GDPR. to preserve user privacy are not as effective as the current data https://eugdpr.org. practices that violates them. Increased research effort along with [15] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model inversion attacks that exploit confidence information and basic countermeasures. In Pro- legal actions will be required to preserve privacy in the age of AI. ceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1322–1333. Is It Possible to Preserve Privacy in the Age of AI? WSDM ’20, February 3–7, 2020, Houston, TX, USA [16] Ben Gerber. [n.d.]. OECDprivacy.org. http://www.oecdprivacy.org/. [27] Brian Rakowski. 2019. Pixel 4 is here to help. https://blog.google/products/pixel/ [17] Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated pixel-4/. learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017). [28] Joel R Reidenberg, N Cameron Russell, Alexander J Callen, Sophia Qasir, and [18] Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, Thomas B Norton. 2015. Privacy harms and the effectiveness of the notice and and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy choice framework. ISJLP 11 (2015), 485. Policies Using Deep Learning. In USENIX Security Symposium. [29] Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In [19] Patrick Gage Kelley, Joanna Bresee, Lorrie Faith Cranor, and Robert W Reeder. Proceedings of the 22nd ACM SIGSAC conference on computer and communications 2009. A nutrition label for privacy. In Proceedings of the 5th Symposium on Usable security. ACM, 1310–1321. Privacy and Security. ACM, 4. [30] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem- [20] Marc Langheinrich. 2002. A privacy awareness system for ubiquitous computing bership inference attacks against machine learning models. In 2017 IEEE Sympo- environments. In international conference on Ubiquitous Computing. Springer, sium on Security and Privacy (SP). IEEE, 3–18. 237–245. [31] Welderufael B. Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, [21] Jialiu Lin, Bin Liu, Norman Sadeh, and Jason I Hong. 2014. Modeling users’ mobile and Jetzabel Serna. 2018. PrivacyGuide: Towards an Implementation of the EU app privacy preferences: Restoring usability in a sea of permission settings. In GDPR on Internet Privacy Policy Evaluation. In Proceedings of the Fourth ACM 10th Symposium On Usable Privacy and Security ( {SOUPS } 2014). 199–212. International Workshop on Security and Privacy Analytics (IWSPA ’18). ACM, New [22] Fei Liu, Rohan Ramanath, Norman Sadeh, and Noah A Smith. 2014. A step York, NY, USA, 15–21. https://doi.org/10.1145/3180445.3180447 towards usable privacy policy: Automatic alignment of privacy statements. In [32] Lynn Tsai, Primal Wijesekera, Joel Reardon, Irwin Reyes, Serge Egelman, David Proceedings of COLING 2014, the 25th International Conference on Computational Wagner, Nathan Good, and Jung-Wei Chen. 2017. Turtle guard: Helping android Linguistics: Technical Papers. 884–894. users apply contextual privacy preferences. In Thirteenth Symposium on Usable [23] Nathan Malkin, Joe Deatrick, Allen Tong, Primal Wijesekera, Serge Egelman, and Privacy and Security ( {SOUPS } 2017). 145–162. David Wagner. 2019. Privacy Attitudes of Smart Speaker Users. Proceedings on [33] Primal Wijesekera, Joel Reardon, Irwin Reyes, Lynn Tsai, Jung-Wei Chen, Nathan Privacy Enhancing Technologies 2019, 4 (2019), 250–271. Good, David Wagner, Konstantin Beznosov, and Serge Egelman. 2018. Contextu- [24] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2016. alizing privacy decisions for better prediction (and protection). In Proceedings of Communication-efficient learning of deep networks from decentralized data. the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 268. arXiv preprint arXiv:1602.05629 (2016). [34] Lin Yuan, Joël Theytaz, and Touradj Ebrahimi. 2017. Context-dependent privacy- [25] Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal aware photo sharing based on machine learning. In IFIP International Conference Talwar. 2016. Semi-supervised knowledge transfer for deep learning from private on ICT Systems Security and Privacy Protection. Springer, 93–107. training data. arXiv preprint arXiv:1610.05755 (2016). [35] Tianqing Zhu, Gang Li, Wanlei Zhou, and S Yu Philip. 2017. Differentially private [26] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P Wellman. data publishing and analysis: A survey. IEEE Transactions on Knowledge and Data 2018. SoK: Security and privacy in machine learning. In 2018 IEEE European Engineering 29, 8 (2017), 1619–1638. Symposium on Security and Privacy (EuroS&P). IEEE, 399–414. [36] Sebastian Zimmeck and Steven M Bellovin. 2014. Privee: An Architecture for Automatically Analyzing Web Privacy Policies.. In USENIX Security, Vol. 14.