Is It Possible to Preserve Privacy in the Age of AI? Vijayanta Jain Sepideh Ghanavati University of Maine University of Maine Orono, Maine, USA Orono, Maine, USA ABSTRACT time on their services. The number of applications and devices Artificial Intelligence (AI) hopes to provide a positive paradigm that use AI will also increase in the near future. This is evident by shift in technology by providing new features and personalized the increasing number of smartphones with dedicated chips for experience to our digital and physical world. In the future, almost machine learning (ML) [1–3, 27] and devices that come integrated all our digital services and physical devices will be enhanced by AI with personal assistants .2,3 to provide us with better features. However, as training artificially The proliferation of AI poses direct and indirect threats to user intelligent models require a large amount of data, it poses a threat privacy. The direct threat is the inference of personal information to user privacy. The increasing prevalence of AI promotes data and the indirect threat is the promotion of data collection. Movies collection and consequently poses a threat to privacy. To address such as Her, accurately portray the Utopian-AI future some com- these concerns, some research efforts have been directed towards panies hope to provide users as they increase the ubiquity of ML developing techniques to train AI systems while preserving privacy in their digital and physical products. However, as training AI sys- and help users preserve their privacy. In this paper, we survey the tems, such as deep neural networks, requires a large amount of data, literature and identify these privacy-preserving approaches that companies collect usage data from users whenever they interact can be employed to preserve privacy. We also suggest some future with any of their services. There are two major problems with this directions based on our analysis. We find that privacy-preserving collection: first, the usage data collected is used to infer information research, specifically for AI, is in its early stage and requires more such as personal interests, habits, and behavior patterns thus invad- effort to address the current challenges and research gaps. ing privacy; and second, to improve the personalization, intelligent features, and AI-capabilities of the services, companies will con- CCS CONCEPTS tinuously collect and increase the data collected from users, thus leading to an endless-loop of collecting data which threatens user • Privacy → Privacy protections. privacy (see Figure 2). Moreover, the collected data is often used for ad-personalization or shared with third-party which does not meet KEYWORDS user’s expectations and thus, violates user privacy [23]. For exam- Artificial Intelligence, Privacy, Machine Learning, Survey ple, when you interact with Google’s Home Mini, the text from these recordings may be used for ad-personalization (see Figure 1) 1 INTRODUCTION which does not meet the privacy expectations of the users [23]. Artificial Intelligence (AI) is increasingly becoming ubiquitous in Privacy violations in recent times have motivated research ef- our lives through its growing presence in the digital services we forts to develop techniques and methodologies to preserve privacy. use and the physical devices we own. AI already powers our most Previous research work has developed tools that provide users with commonly used digital services, such as search (Google, Bing), more effective notice and choice [9, 18, 19, 31]. With increasing music (Spotify, YouTube Music), entertainment (Netflix, YouTube), concerns about privacy because of AI, some efforts have also been and social media (Facebook, Instagram, Twitter). These services directed towards training machine learning models while preserv- heavily rely on AI or Machine Learning (ML) 1 to provide users with ing privacy [4, 29]. User-focused techniques provide users with the personalized content and better features, such as relevant search necessary tools to preserve privacy whereas privacy-preserving results, the content the users would like, and the people they might machine learning helps companies use machine learning for their know. AI/ML also enhances several physical devices that we own services while still preserving user privacy. In this work, we survey (or can own), for example - smart speakers, such as Google Hub and these methods to understand the methodologies that can be em- Amazon Echo, that rely on natural language processing to detect ployed when users are surrounded by digital services and physical voice, understand, and execute commands such as to control lights, devices that use AI. In their work, they develop a system that allows 2 ANALYSIS OF THE CURRENT LITERATURE several participants to train similar neural networks on their input In this section, we report on our survey of machine-learning based data without sharing the data but selectively sharing the parame- techniques that have been developed to preserve user privacy. We ters with each other to avoid local minima. Similarly, in line with divide this section into two groups: i) privacy preserving machine Shokri and Shmatikov to not share data, McMahan et al. [24] pro- learning approaches and ii) techniques to provide users with notice pose Federated Learning which allows developers to train neural and give them choices. networks in a decentralized and privacy-preserving manner. The ideology behind their work is that neural network models to be trained are sent to the mobile devices which contain the user sensi- tive data and use SGD locally to update the parameters. The models are then sent back to a central server which "averages" the update from all the models to achieve a better model. They term this algo- rithm FederatedAveraging. Similarly, Papernot et al. [25] propose Private Aggregation of Teacher Ensemble (PATE) - a method to train machine learning models while preserving privacy. In their approach, several "teacher" models are trained on disjoint subsets of the dataset, then the "student" model is trained by the aggregation of the "teachers" to accurately "mimic the ensemble". The goal of this work is to address the information leakage problem [15]. The goal of the work outlined above is to develop new algo- rithms and methods to train neural networks on a device or use differentially private algorithms. However, information leakage still provides a threat to the user’s privacy. Information leakage is the concept in which the neural network implicitly contains sensitive information it was trained on. This is demonstrated in [15, 30]. This is an active research topic and new methods, such as PATE, aim to resolve this issue by not exposing the dataset to the machine Figure 2: Cycle of Eternal Increase in Data Collection learning model. 2.2 Mechanisms to Control User’s Data 2.1 Privacy Preserving Machine Learning The primary goal in this field of research has been to provide users Approaches with better notice, give them choices and provide them with the Recent research efforts have been directed to develop privacy- means to control their personal information. Notice and Choice is preserving machine learning techniques [4, 24]. Prior to machine one of the fundamental methods to preserve privacy and is based on learning, differential privacy provided a strong standard to preserve the Openness principle of the OECD Fair Information Principle [16]. privacy for statistical analysis on public datasets. In this technique, In Notice and Choice mechanism, the primary goal has been to whenever a statistical query is made to a database containing sensi- improve and extract relevant information from privacy policies tive information, a randomized function k adds noise to the resulting for the users. This is because privacy policies are lengthy and it is query which preserves privacy while also ensuring the usability of infeasible for users to read the privacy policies for all the digital the database [13]. Some work has used differential privacy for train- and physical services they use/own [10]. Therefore, research has ing machine learning models [4, 7]. Chaudhri and Monteleoni [7] focused on providing them with better notice and choice such as use this technique to develop a privacy-preserving algorithm for in [20, 22, 28]. Other work have achieved similar results by applying logistic regression. Abadi et al. [4] also use this technique to train machine learning techniques. Harkous et al. [18] develop PriBot deep neural networks by developing a noisy Stochastic Gradient a Q&A chatbot that analyzes a privacy policy and then provides Descent (SGD) algorithm. However, a key problem with differential users with sections of the privacy policy that answers their question. Is It Possible to Preserve Privacy in the Age of AI? WSDM ’20, February 3–7, 2020, Houston, TX, USA Some work has focused on identifying the quality of the privacy have not conducted usability studies to examine the user’s view. policy. For example, Constane et al. [8] use text categorization and This inhibits implementing such research into real-world. machine learning to categorize paragraphs of privacy policies and Overall we find that this line of work has focused on giving assess their completeness with a grade. The grade is calculated by users the mechanisms to understand the privacy practices and the weight assigned by the user to each category and the coverage of control their data. Giving users the control of their data is important, the category in a selected section. This method helps users inspect a however, this approach puts the burden on the users to preserve privacy policy in a structured way and read only the paragraphs that their privacy which might be difficult for less tech-savvy users as interest them. Zimmeck et al. introduce Privee [36] which integrates often the privacy settings for websites are hidden under layers of Constane’s classification method with Sadeh’s crowdsourcing. In settings to control. Privee, if a privacy analysis results are available in the repository, the result is returned to the user. Otherwise, the privacy policy is 3 RELATED WORK automatically classified and then, it is returned. PrivacyGuide [31] Papernot et al. [26] provide a Systematization of Knowledge (SoK) uses classification techniques, such as Naïve Bayes and Support of security and privacy challenges in machine learning. This work Vector Machines (SVM), to categorize privacy policies based on surveys the existing literature to identify the security and privacy the EU GDPR [14], summarize them and then allocate risk factors. threats as well as defenses that have been developed to mitigate These above work certainly improve the previous "state-of-the- the threats. The research work also argues based on the analy- art" method of notice & choice - a privacy policy by giving users sis, to develop a framework for understanding the sensitivity of a succinct form of information. However, privacy policies often ML algorithms to its training data to foster security and privacy contain ambiguities that are difficult for technology to answer, for implications of ML algorithms. Our analysis is similar as it eval- example, the number of third parties the data is shared with or how uates privacy implications of these machine learning algorithms, long the data will be stored by the companies. but our work provides a more detailed discussion on the privacy Another active topic of research in providing control of their pri- challenges as compared to [26]. Zhu et al. [35] survey different vacy to users is to model privacy preferences. The goal of this topic methods developed to publish and analyze differentially private of research is to provide users with more control over what infor- data. The work analyzes differentially private data published based mation can mobile applications or other users access. Lin et al. [21] on the type of input data, the number of queries, accuracy, and create a small number of profiles for user’s privacy preference using efficiency and evaluate differentially private data analysis based on clustering and then based on those profiles analyze whether the Laplace/Exponential Framework, such as [7] and Private Learning user from a profile allows certain permissions or not. Similar to Framework, such as [4]. The paper also presents with some future their work, Wijesekera et al. [32] develop a contextually-aware directions for differential privacy, such as executing more local dif- permission system that dynamically permits access to private data ferential privacy. This work is the closest to our work as it surveys of Android applications based on user’s preferences. They argue a privacy-preserving analysis technique and suggests future work. that their permission system is better than the default Android However, in our analysis, we also incorporate the technologies permission system of Ask-On-First-Use (AOFU) as context, "what that help users preserve their privacy. Overall, our work differs [users] were doing on their mobile devices at the time that data was from [26, 35] as we look at the big picture of privacy-preserving requested" [32] affect user’s privacy preferences. In their system, technologies specifically with the increase in use of AI. they use SVM classifier, trained over contextual information and user’s behavior, to make permission decisions. They also conduct a 4 DISCUSSION usability study to model the preferences of 37 users and test their In this paper, we discussed techniques and methodologies devel- system [33]. Similarly, other work to use contextual information oped to preserve user privacy. Primarily, we identified two groups to model privacy preferences has been done for applications in of work: (1) privacy-preserving machine learning, such as noisy web-based services as well. Yuan et al. [34] propose a model that SGD and federated learning, and (2) techniques to provide users uses contextual information to share images, with different granu- with the tool to protect their own privacy. In this section, we discuss larity with other users. In their work, based on the semantic image the advantages of each category of approaches, their existing chal- features and contextual features of a requester, they train logistic lenges, the research gaps, and suggest some potential future work regression, SVM and Random Forest to predict whether the user to address the challenges and gaps identified here. We summarize would share, would not share, or partially share the image requested. our analysis in Table 1. Similarly, Bilogrevic et al. [6] develop Smart Privacy-aware Informa- Differential Privacy and Machine Learning Approaches: tion Sharing Mechanism, a system that shares personal information Differential privacy provides a strong state-of-the-art for data anal- with users, third-party, online services, or mobile apps based on the ysis by introducing noise to query results [12] and this method has user’s privacy preferences and the contextual information. They use also been used to train deep neural networks [4]. One of the biggest Naïve Bayesian, SVM, and Logistic Regression to model preferences. advantages of these approaches is the simplicity and efficiency of They also conduct a user study to understand their preferences and the methodology. Some companies have even started to use dif- the factors influencing their decision. Using contextual information ferential privacy in some of their applications .4 Using differential and providing different levels of information access is a great step privacy for deep learning provides great potential for researchers towards providing the user with greater control of their data but and developers. However, understanding the trade-offs between certain challenges still remain. 