A User-Centric and Sentiment Aware Privacy-Disclosure Detection Framework based on Multi-input Neural Network A K M Nuhil Mehdy Hoda Mehrpouyan akmnuhilmehdy@u.boisestate.edu hodamehrpouyan@boisestate.edu Boise State University Boise State University Boise, Idaho, USA Boise, Idaho, USA ABSTRACT there is an imperfect societal consensus that certain information Data and information privacy is a major concern of today’s world. (e.g. personal information, situation, condition, circumstance, etc) More specifically, users’ digital privacy has become one of the most is more private than the others (e.g. public statements, opinion, important issues to deal with, as advancements are being made in comments, etc)[4]. information sharing technology. An increasing number of users Recent advances in communication technologies such as mes- are sharing information through text messages, emails, and so- saging applications and social media [29] have resulted in privacy cial media without proper awareness of privacy threats and their concerns [21] about analogous information amongst the users. In consequences. One approach to prevent the disclosure of private this era of digital communication, an increasing number of users information is to identify them in a conversation and warn the are sharing information through text messages, emails, and social dispatcher before the conveyance happens between the sender and media without proper awareness of privacy threats and their con- the receiver. Another way of preventing information (sensitive) loss sequences. Moreover, in the context of the information society, his- might be to analyze and sanitize a batch of offline documents when torical documents of entities (e.g. people, organization) are needed the data is already accumulated somewhere. However, automating to be made public and shared among authorities every day [23]. the process of identifying user-centric privacy disclosure in textual In such cases, improper disclosure 1 of user’s information could data is challenging. This is because the natural language has an ex- increase his/her security/privacy vulnerabilities, and the negative tremely rich form and structure with different levels of ambiguities. consequences of disclosing such information could be immense [7]. Therefore, we inquire after a potential framework that could bring A recent data scandal involving Facebook and Cambridge Ana- this challenge within reach by precisely recognizing users’ privacy lytica reveals how personally identifiable information of up to 87 disclosures in a piece of text by taking into account - the author- million Facebook users influenced voter’s opinion [10, 25]. Likewise, ship and sentiment (tone) of the content alongside the linguistic millions of data breach incidents are reported all over the world features and techniques. The proposed framework is considered and unfortunately most of them expose users’ personal data [27]. as the supporting plugin to help text classification systems more Therefore, user-centric targeted attacks by exploiting the victim’s accurately identify text that might disclose the author’s personal Personally Identifiable Information (PII) has become a new kind or private information. of privacy threat in the present-day [31]. It’s worth mentioning that United States is the number one destination for such user- CCS CONCEPTS centric targeted attacks based on recent statistics [28]. That being the case, users’ data privacy has become one of the major concerns • Security and privacy → Privacy protections. of today’s world and the requirements for privacy measures to pro- tect sensitive information about individuals have been researched KEYWORDS extensively [3, 12–14, 19, 24]. Privacy, Natural Language Processing, Neural Network As part of this efforts, researchers in the area of Natural Lan- guage Processing (NLP) have focused on developing techniques 1 INTRODUCTION and methodologies to detect, classify, and sanitize private informa- Privacy is an ancient concept concerning human values that could tion in textual data. However, most of these works tend to solve be "intruded upon", "invaded", "violated", "breached", "lost", and these tasks by just detecting set of keywords, leveraging dictionar- "diminished"[29]. Each of these analogies reflects a conception ies of terms, or applying regular expression patterns. These types of privacy that can be found in one or more standard models or of detection do not consider the context and the relationship of theories of privacy. Users’ privacy has been defined as "the right to the keywords in the text, therefore they result in high amount of be left alone" or being free from intrusion by the seclusion and non- false positive (e.g. a doctor’s article about a disease is considered intrusion theory[8, 32]. Even though privacy varies from individual public and not private). However, it is considered sensitive and to individual and each user may have different views of privacy, private when associated with other entities (e.g., a patient himself) in certain ways that yield different meaning and actually reveals Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons someone’s privacy. Therefore, its equally necessary to look into License Attribution 4.0 International (CC BY 4.0). Presented at the PrivateNLP 2020 the keywords, data subject (i.e. users), authorship, tone, and overall Workshop on Privacy and Natural Language Processing Colocated with 13th ACM International WSDM Conference, 2020, in Houston, Texas, USA. PrivateNLP 2020, Feb 7, 2020, Houston, Texas 1 In this work, disclosure is defined as revealing personally identifiable information © 2020 (e.g., name, address, age) or sensitive data (e.g., health, finance, and mental status) to others. PrivateNLP 2020, Feb 7, 2020, Houston, Texas Mehdy and Mehrpouyan meaning of the content before classifying as privacy disclosure (Re- 2 BACKGROUND AND RELATED WORK fer to Figure 1). While a few of the recent works are concerned with This section is a review on the state-of-the art research studies, disclosure detection techniques by considering user-centric factors, related to information disclosure identification of individuals or or- most of them still omit other important decision-making factors ganisations. Specially, the related literature which has been studied such as sentiment and authorship of the content. Therefore, this across different privacy domains such as finance, health, location, paper aims to review the existing methodologies and techniques etc. We briefly described the research works which are related to from the area of NLP and proposes a novel disclosure identification detection, classification, and sanitization of private information in framework by keeping the following factors in mind: natural language text. We categorize the related works into three • Considering users-centric circumstances, tone, and au- distinct groups based on their methodologies: i) Leveraging Dic- thorship of content: content having - sensitive informa- tionaries ii) Information Theory and Global Search iii) Machine tion but no data subject, sensitive keywords but public am- Learning and Statistical Models. bience, analytical tone should not be classified as disclosure. The works under the category of dictionary utilization, leverage • Checking sentence coherence and grammatical struc- the linguistic resources such as privacy dictionary to automate the ture: appearance of random keywords, ambiguous and mean- content analysis of privacy related information. Privacy dictionaries ingless information, or invalid utterances should not be clas- are used with existing automated content-analysis software such sified as disclosure. as LIWC [18]. Vasalou et al. proposes a technique that uses such The rest of the paper is organized as follows: Section 2 contains a dictionary of individual words or phrases which are assigned to the review on the related research works following some of their one or more privacy domains [30]. They showed that the dictio- limitations we have observed. Section 3 describes about the dataset nary categories could distinguish differences between documents used in this paper. The methodology is described in detail in section of privacy discussions and general language by measuring unique 4. In addition, the detail of the deep neural network architecture, linguistic patterns within privacy discussions (e.g., medical records, data cleaning, pre-processing, featurization, and the experiment is confidential business documents). presented in section 5. Lastly, section 6 represents the experimental The researchers from the area of information theory utilize the- results following the conclusion. ories along with large corpus of words to automatically detect sensitive information from textual documents. [22] define sensitive information as the pieces of text that either reveals the identity of a private entity or refer to some confidential information of that entity. In their approach sensitive terms are those that provide more information than common terms due to their specificity. Therefore, the task is to quantify how much information each textual term pro- vides, before identifying those as sensitive terms. Similar document sanitization tasks have been well addressed by Chakaravarthy et. al. where they represent a scheme that detects sensitive elements using a database of entities instead of patterns [5]. Each entity in this database (e.g., persons, products, diseases, etc.) is associated with a set of terms related to the that entity. Each set is considered as the context of an entity. For example, the context of a person type entity could be his/her birth date, name etc. Another research work by Abril et. al. that focuses on domain-independent unstructured documents has also been reviewed [1] where they propose to use a named entity recognition techniques to identify sensitive or private entities. Detection of privacy leaks has also been well-addressed by ma- chine learning and statistical techniques such as association rule mining [6]. In such an approach, (Chow et. al.) employs a model of inference detection using a customized web based corpus as reference where inferences are based on word co-occurrences. The model is then provided a topic (e.g. HIV - human immunodeficiency virus) and said to identify all the associated keywords. Hart et al. (2011) utilize machine learning techniques to classify full documents as either sensitive or non-sensitive by automatic text classification algorithms [9]. Their task is to develop an efficient and automated Figure 1: Example of disclosure post, non-disclosure post, tool for enterprise data loss prevention (DLP) by keeping the sen- and highly similar to disclosure but actually a non- sitive documents secret. They introduce a novel training strategy disclosure post (from top to bottom respectively). called supplement and adjust to create an enterprise-level classifier A User-Centric and Sentiment Aware Privacy-Disclosure Detection Framework based on Multi-input Neural Network PrivateNLP 2020, Feb 7, 2020, Houston, Texas based on support vector machine (SVM) with a linear kernel, stop 4.1 Featurization and Data Representation word elimination, and unigram methodology. As can be seen from the examples in Figure 1 many domain specific keywords can be used in both private and public posts. This makes 2.1 Our Contribution the problem particularly challenging because we cannot simply rely on the lexical items in the text; we have to consider the intent of the The limitations of the current studies are based on the fact that they author of the text, and somehow determine if the intent was for the solely rely on the existence of keywords and neglect the sentence text to be public or private. To this end, we do custom tokenization coherence, ignore grammatical validation, and disregard meaning and enrich our data with additional information using linguistic inference in a piece of content. It has been addressed that these details such as syntactic dependency relations. limitations, in some cases, result in miss classification and could be resolved by integrating parts-of-speech tags, dependency parse tree information, and word embedding[20]. However, a novel approach Tokenization. In many text-based natural language processing is required to take into the account the emotional tone or sentiment tasks, the text is pre-processed by removing punctuation and stop of the users that are hidden in the textual contents. For example, in words, leaving only the lexical items. However, we found that the Figure 1, the text from the red box is revealing someone’s private way people punctuate their texts helps give the clues as to whether (health) information (the patient has cancer) and the text from or not it is a valid private or public information. Therefore, we green box is about the Idaho state that represents some public use NLP Toolkit to tokenize the sentences in a customized way ambiences. It’s quite easy to distinguish these two piece of texts that ignores redundant tokens such as "„", ";–", "!!!", ":-)" but keeps based on the keyword spotting techniques [2]. However, in another the important ones such as ",", ";", ":", ".", "he", "the", "in" etc. This example, the text from the yellow box (comment from a doctor about step of considering all the valid sequential tokens helps our model cancer) has similar keywords as the patient’s post, in the red box, learn important arrangement of tokens for validating relationships containing valid word sequences and the presence of grammatical of entities. This is somewhat in contrast to other text analysis subjects (i.e. first person) with references etc. This piece of text literature where clearing off all the punctuation tends to improve is definitely not revealing private health situation (i.e. the doctor task performance. himself does not have cancer). Hence, it is quite challenging to distinguish between the types of contents without taking into the Syntactic Structure. In the experiments, dependency-parse- consideration the sentiment of the statements. To this end, this tree information is also utilized as additional underlying features paper focuses on distinguishing highly similar contents based on that improved the performance of the neural network model. This the users’ involvement, sentiment, authorship, and grammatical helps the model to observe common sequence of tokens as well as structure to classify texts containing someone’s privacy disclosure. co-occurrence of dependency tags. We use a Dependency Parser However, one of the assumptions of this work is: the proposed (DP) Toolkit to extract the syntactic relation information (which is model does not solve all the privacy and security requirements different from, but in some ways similar to, entity relation infor- of users by providing an entire threat model, rather it provides a mation). This allowed us to enrich our data with dependency parse better NLP tool to be integrated into any comprehensive privacy information. framework. Supplemental Features. In addition to the features mentioned above, more user specific features or meta data are prepared and 3 DATASET provided to the extended variant of our models as supplimental We collected 10,000 users’ (patients and doctors) posts from a public input. Some of those auxiliary data are - i) number of pronouns ii) online health forum, based on the observation (inspired from the emotional tone iii) number of negations found in the post etc. This example of figure 1) that, patients’ posts are somewhat disclosing additional information are supposed to give the neural network their health status in that forum. Whereas, doctors’ comments on model some distinguishable features about highly similar contents patients’ posts are highly similar content (having similar keywords of different class. and syntactic representation) but usually do not disclose doctors’ health status (doctors’ do not have those diseases). Therefore, we 4.2 Deep Neural Network Model labeled patients’ posts as disclosure (private) and doctor’s comments as non-disclosure (public). For this paper, we crawled 5000 posts After doing all the necessary pre-processing steps the data is then and 5000 comments and narrow down our privacy domain to health fed into a multi-input deep neural network to learn the hidden only. The length of the posts and comments varies from 10 words patterns and features to distinguish between texts having disclo- to more than 100 words comprised of several sentences. sure and non-disclosure occurrences. It takes lexical (word tokens) features through one input, syntactical features (dependency parse tree information) through another input following a merging of 4 METHODOLOGY those feature vectors. Later these vectors additionally get merged Combination of both linguistic operations and artificial neural net- with supplemental (auxiliary) inputs before going through a further work is the core of our methodology. A bigger picture of the frame- multi-layer perceptron stage. At the end of the deep neural network, work is depicted in Figure 2. In this section, the data pre-processing, a single neuron is used to provide the probability toward each of representation, and featurization steps are briefly explained follow- the above mentioned classes. More detail about the architecture is ing the detail of the neural network architecture. depicted in appendix C. PrivateNLP 2020, Feb 7, 2020, Houston, Texas Mehdy and Mehrpouyan Figure 2: Bigger picture of the disclosure detection framework. 5 EXPERIMENT neural network based classifier can be trained that goes beyond In the data pre-processing step, we apply Spacy [26] to perform the simple keyword spotting and uses linguistic features to determine if linguistic operations on the text. The Keras functional API is utilized a text contains a disclosure or not with an useful degree of accuracy. to create the multi-input architecture [17]. For implementing word Moreover, it is observed that, integration of user-specific meta data embeddings, we use it’s Embedding [16] layer where pre-trained to the models increases the classification accuracy, significantly (up word embedding (glove) is used with trainable flag set to true. In to 97%). However, the generalizability of the model has not been another input of the multi-input model, same type of embedding well evaluated because of the lack of data set with similar char- layer but without pre-trained vector, is used to learn the embed- acteristics (i.e., indistinguishable utterances yet carrying different ding space from the dependency parse tree information. For the meaning). Convolution on the information of the first input channel, we use the Conv1D layer [15] following a pooling layer just after it. 7 CONCLUSION In the other input of the model, a long short term memory (LSTM) A practical model of privacy disclosure detection is in dire need by layer is used over the dependency parse tree information. The users in this era of social networks that results in activities such as concatenate method of Keras then takes the output vectors from online forum posting, emailing, text messaging etc. Accordingly, the the convolution layer and the lstm layer and merges them into a development of algorithm and tools that helps identifying privacy single vector which then acts as the input to the fully connected disclosure in textual data is important. While many of these works layers. At this step, supplemental input, prepared by utilizing IBM in this area mainly focus on classifying textual data as public or Watson Tone Analyzer[11] are added with the concatenated vector private at the document level by just spotting keywords, only few following another stage of dense layers. Finally, a single neuron of those are concerned with the the privacy detection, taking the with sigmoid activation function outputs the probability of each users context into account. class with 0.5 as the cutoff value. As false negatives of the classifier may bring dangerous consequences, it would be wise to lower this ACKNOWLEDGMENTS probability cutoff value towards the negative class, depending on the usage of the model. Detail of the hyperparameters are listed in The authors would like to thank National Science Foundation for appendix B. its support through the Computer and Information Science and Engineering (CISE) program and Research Initiation Initiative(CRII) 6 RESULTS grant number 1657774 of the Secure and Trustworthy Cyberspace (SaTC) program: A System for Privacy Management in Ubiquitous Prior to experiment with the multi-input model, the classification Environments task was examined using baseline models such as naive bayes clas- sifier and simple convolutional neural network. Appendix A shows in detail the comparison of accuracy among all the models along REFERENCES [1] Daniel Abril, Guillermo Navarro-Arribas, and Vicenç Torra. 2011. On the de- with the model which uses user-specific supplemental input. The classification of confidential documents. In International Conference on Modeling results show that, despite a lack of large amounts of labeled data, Decisions for Artificial Intelligence. Springer, 235–246. A User-Centric and Sentiment Aware Privacy-Disclosure Detection Framework based on Multi-input Neural Network PrivateNLP 2020, Feb 7, 2020, Houston, Texas [2] Roy F Baumeister and Kenneth J Cairns. 1992. Repression and self-presentation: [27] Statista. 2019. Number of U.S. data breaches 2014-2018, by indus- When audiences interfere with self-deceptive strategies. Journal of Personality try. https://www.statista.com/statistics/273550/data-breaches-recorded-in-the- and Social Psychology 62, 5 (1992), 851. united-states-by-number-of-breaches-and-records-exposed/. [Online; accessed [3] Tom Buchanan, Carina Paine, Adam N Joinson, and Ulf-Dietrich Reips. 2007. 01-April-2019]. Development of measures of online privacy concern and protection for use on [28] Symantec. 2019. 10 cyber security facts and statistics for 2018. the Internet. Journal of the Association for Information Science and Technology 58, https://us.norton.com/internetsecurity-emerging-threats-10-facts-about- 2 (2007), 157–165. todays-cybersecurity-landscape-that-you-should-know.html. [Online; accessed [4] Aylin Caliskan Islam, Jonathan Walsh, and Rachel Greenstadt. 2014. Privacy 01-April-2019]. detective: Detecting private information and collective privacy behavior in a large [29] Herman T Tavani. 2007. Philosophical theories of privacy: Implications for an social network. In Proceedings of the 13th Workshop on Privacy in the Electronic adequate online privacy policy. Metaphilosophy 38, 1 (2007), 1–22. Society. ACM, 35–46. [30] Asimina Vasalou, Alastair J Gill, Fadhila Mazanderani, Chrysanthi Papoutsi, and [5] Venkatesan T Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh K Adam Joinson. 2011. Privacy dictionary: A new resource for the automated Mohania. 2008. Efficient techniques for document sanitization. In Proceedings content analysis of privacy. Journal of the Association for Information Science and of the 17th ACM conference on Information and knowledge management. ACM, Technology 62, 11 (2011), 2095–2105. 843–852. [31] Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, and Xinyi Huang. 2016. Targeted [6] Richard Chow, Philippe Golle, and Jessica Staddon. 2008. Detecting privacy leaks online password guessing: An underestimated threat. In Proceedings of the 2016 using corpus-based association rules. In Proceedings of the 14th ACM SIGKDD ACM SIGSAC conference on computer and communications security. ACM, 1242– international conference on Knowledge discovery and data mining. ACM, 893–901. 1254. [7] Emily Christofides, Amy Muise, and Serge Desmarais. 2009. Information dis- [32] Samuel Warren et al. 1890. Louis Brandeis. The Right to Privacy. Harvard Law closure and control on Facebook: Are they two sides of the same coin or two Review 4, 5 (1890), 1. different processes? Cyberpsychology & behavior 12, 3 (2009), 341–345. [8] Jamal Greene. 2009. The so-called right to privacy. UC Davis L. Rev. 43 (2009), 715. A RESULTS IN DETAIL [9] Michael Hart, Pratyusa Manadhata, and Rob Johnson. 2011. Text classification for data loss prevention. In International Symposium on Privacy Enhancing Tech- nologies Symposium. Springer, 18–37. [10] Alex Hern. 2018. Far more than 87m Facebook users had data compromised, MPs told. https://www.theguardian.com/uk-news/2018/apr/17/facebook-users- data-compromised-far-more-than-87m-mps-told/-cambridge-analytica. [Online; accessed 01-April-2019]. [11] IBM. 2019. IBM Watson - Tone Analyzer. https://www.ibm.com/watson/services/tone-analyzer/. [Online; accessed 01-December-2019]. [12] Adam N Joinson, Ulf-Dietrich Reips, Tom Buchanan, and Carina B Paine Schofield. 2010. Privacy, trust, and self-disclosure online. Human–Computer Interaction 25, 1 (2010), 1–24. [13] Rezvan Joshaghani, Stacy Black, Elena Sherman, and Hoda Mehrpouyan. 2019. Formal specification and verification of user-centric privacy policies for ubiq- uitous systems. In Proceedings of the 23rd International Database Applications & Engineering Symposium. 1–10. [14] Rezvan Joshaghani and Hoda Mehrpouyan. 2017. A model-checking approach Figure 3: Accuracy of the model as a binary classification. for enforcing purpose-based privacy policies. In 2017 IEEE Symposium on Privacy- Aware Computing (PAC). IEEE, 178–179. [15] Keras. 2018. Convolutional Layres - Keras Documentation. https://keras.io/layers/convolutional/. [Online; accessed 01-February-2019]. [16] Keras. 2018. Embedding Layres - Keras Documentation. B MODEL HYPERPARAMETERS https://keras.io/layers/embeddings/. [Online; accessed 01-February-2019]. Some hyperparamters worth mentioning are: pre-trained embed- [17] Keras. 2018. Guide to the Functional API - Keras Documentation. https://keras.io/getting-started/functional-api-guide/. [Online; accessed 01- ding with glove 100 dimensional embedding matrix having the February-2019]. capability of adjusting weights through the training iteration. Con- [18] LIWC. 2018. Linguistic Inquiry and Word Count. https://liwc.wpengine.com/. volution with 32 filters with kernel size of 4. These layers have [Online; accessed 01-February-2019]. [19] Naresh K Malhotra, Sung S Kim, and James Agarwal. 2004. Internet users’ rectifier linear unit as activation function and followed by global information privacy concerns (IUIPC): The construct, the scale, and a causal max pooling technique. The LSTM layer contains 32 neurons with model. Information systems research 15, 4 (2004), 336–355. [20] Nuhil Mehdy, Casey Kennington, and Hoda Mehrpouyan. 2019. Privacy Dis- all the default settings as per the keras documentation. The first closures Detection in Natural-Language Text Through Linguistically-motivated stage of dense layers after the first concatenation contains 128 and Artificial Neural Network. In 2nd EAI International Conference on Security and 64 neurons with rectifier linear unit as activation function. The sec- Privacy in New Computing Environments. EAI. [21] Joseph Phelps, Glen Nowak, and Elizabeth Ferrell. 2000. Privacy concerns and ond stage of dense layers contains 64, 32, and 16 neurons with same consumer willingness to provide personal information. Journal of Public Policy kind of activation function following a single output neuron with & Marketing 19, 1 (2000), 27–41. sigmoid as activation function. We train the model for 20 epochs [22] David Sánchez, Montserrat Batet, and Alexandre Viejo. 2012. Detecting sensitive information from textual documents: an information-theoretic approach. In providing the batch size of 32. The model also uses binary cross International Conference on Modeling Decisions for Artificial Intelligence. Springer, entropy as the loss function and rmsprop as the optimizer. 173–184. [23] Ashley Savage and Richard Hyde. 2014. Using freedom of information requests to facilitate research. International Journal of Social Research Methodology 17, 3 C NEURAL NETWORK ARCHITECTURE (2014), 303–317. [24] Arnon Siegel. 1997. In Pursuit of Privacy: Laws, Ethics, and the Rise of Technology. Architecture of the Neural Network (automatically rendered by the The Wilson Quarterly 21, 4 (1997), 100. Keras plotter) is given below. [25] Olivia Solon. 2018. Facebook says Cambridge An- alytica may have gained 37m more users’ data. https://www.theguardian.com/technology/2018/apr/04/facebook-cambridge- analytica-user-data-latest-more-than-thought. [Online; accessed 01-April-2019]. [26] Spacy. 2018. Linguistic Features - Named Entities. https://spacy.io/usage/linguistic-features#section-named-entities. [Online; accessed 01-February-2019]. PrivateNLP 2020, Feb 7, 2020, Houston, Texas Mehdy and Mehrpouyan