-

Personal Information Privacy: What's Next?

1st Khodor Hammoud

ik19544@etu.parisdescartes.fr 5

4th Yucel Saygin

ysaygin@sabanciuniv.edu 2

2nd Salima Benbernou

salima.benbernou@parisdescartes.fr 4

5th Rafiqul Haque

Rafiqul.haque@intelligenciaia.fr 0

3rd Mourad Ouziri

mourad.ouziri@parisdescartes.fr 3

6th Yehia Taher

yehia.taher@uvsq.fr 1 0 Intelligencia R&D , Paris , France 1 Laboratoire DAVID, Université de Versailles - Paris-Saclay 2 Sabanci University , Istanbul , Turkey 3 Universite ́ de Paris Paris , Paris , France 4 Universite ́ de Paris , P a r i s , France 5 Universite ́ de Paris , Paris , France

30 37

-In recent events, user-privacy has been a main focus of personal information abuse are the two recent incidents: for all technological and data-holding companies, due to the Facebook trail [46] and google testimony [47]. Its not only the global interest in protecting personal information. Regulations software vendors, social medias, hardware companies violates lainkde tpheenGaletineesraarloDuantda tPhreohteacntidolninRgeagnudlamtiiosnus(eGoDfPuRse)rsedtaftiar.mTlhaewses the privacy as well. As an example, Samsung's smart TVs privacy rules apply regardless of the data structure, whether it recording audio [51], and the recent events that happened being structured or unstructured. unavailing the potential of the giant Chinese tech company In this work, we perform a summary of the available algo- Huawei using its mobile phone to spy on users, blocking its rithms for providing privacy in structured data, and analyze the phones from using google services and banning them from the dpaotpau.laWretofooulsntdhathtahtaanldthleoupgrhivathcyesienttoeoxltsuparlodvaidtae;andaemquelaytemreedsuiclatsl US [50]. There are regulations that govern user data handling, in terms of de-identifying medical records by removing personal latest of which is the European's General Data Protection identifyers (HIPAA PHI), they fall short in terms of being Regulation (GDPR) [49] which needs transparency, and usergeneralizable to satisfy nonmedical fields. In addition, the metrics anonymity when performing statistical analysis, and places used to measure the performance of these privacy algorithms heavy fines on violating parties. iddoenn'ttiftiaekrehainst.o account the differences in significance that every The task of making the web a safe place for users is a largely Finally, we propose the concept of a domain-independent difficult problem due to the inherently open, nondeterministic adaptable system that learns the significance of terms in a given nature of the Web, and the complex, leakage-prone information text, in terms of person identifiability and text utility, and is flow of many Web-based transactions that involve the transfer then able to provide metrics to help find a balance between user of sensitive, personal information. Despite considerable attenpriIvnadceyxanTderdmast-aupsraivbailciyt,y. k-anonymity, l-diversity, t-closeness, tion, Web privacy continues to pose significant threats and NLP, textual data, privacy in text challenges. One major step is securing the way companies store, share and publish user information, as data regulations I. INTRODUCTION AND M OTIVATIO N impose data publication, which if not secured, can be used The legal right to privacy is a fundamental human right to re-identify the individual owners. Securing stored/published recognized in the UN (United Nation) declaration of human data depends on the way data is stored. In the past, information rights [52]. The unprecedented growth of highly advanced was almost strictly stored in the form of structured relational technologies - in the last two decades - has imperiled privacy databases [44]. Consequently, shared data was in the form significantly. Today, different aspects of human life has been of structured datasets. Ensuing privacy to these datasets was digitalized including communication medium, socialization, first in the form of deleting the unique identifiers, but then L. entertainment, purchasing and many others. People adopted Sweeney [28] published a research result that proved that users the digital systems due to increasing efficiency in day-to- can still be identified from their quazi-identifiers, and proposed day tasks. In some cases the adoption is forced by social a new methodology known as k-anonymity [1]. Following practices such as the use of social media. Nevertheless, the K-anonymity, several solutions were proposed including /!digital transformation created an ample of opportunities for diversity [3] and t-closeness [4] that address shortcomings various organizations and adversaries to abuse privacy since discovered in k-anonymity. However, in 2006, Dwork and the digital systems enable them to hold information of people Aaron introduced differential privacy [6] as a solution for forever. Organizations such as Google can profile anyone privacy-preserving data analysis which can be used to provide without the users being aware of it. The concrete evidence security for both data storage and analysis. Recently, the changes in applications, user and infrastructure characteristics, mostly of the Web 2.0 domain [53] and cloud

This work was made possible thanks to the funding provided by Cognitus. authors. Use permitted under Creative Commons License Attribution 4.0 platform, led to an exponential growth of the internet and the work in Section V. explosion of data sources such as sensors, social media, etc. and massive workloads. This kind of data is typically referred II. PRIVACY IN STRUCTURED DATA to as Big Data [7]. This foster the requirement of a new format There are two natural models for privacy mechanisms: of data storage, known as unstructured data [ 48 ] which is interactive and noninteractive. In the noninteractive setting the essentially the central focus of this research. To be specific, the data collector, a trusted entity, publishes a sanitized version of textual form of this unstructured data is the key focus of this the collected data; the literature uses terms such as anonymizaresearch. Privacy in unstructured text is critically important for tion and de-identification. Traditionally, sanitization employs several reasons yet the most substantial reason is the amount techniques such as data perturbation and sub-sampling, as well of unstructured data generated by companies. More than 80% as removing well-known identifiers such as names, birth dates, of the data generated in the last ten years are unstructured and social security numbers. It may also include releasing (mostly in textual form). This implies the fact that a massive various types of synopses and statistics. In the interactive volume of data is recorded in textual form yet the privacy setting, the data collector, again trusted, provides an interface in unstructured text, to the best of our knowledge, lack robust through which users may pose queries about the data, and get solutions. We studied several use cases that belong to different (possibly noisy) answers. industrial domains including finance, healthcare, and insurance Originally, data were published in tabular format, and made etc. According to our study, banking and healthcare sectors anonymous by simply removing all the explicit identifiers generate a huge volume of unstructured text; both of these like names and phone numbers. However, in most of these industrial domains are facing several challenge concerning cases, the remaining data can be used to re-identify individprivacy of user information - which is the key motivation of uals by linking it to other purposely collected data or by this research. The need of a privacy mechanism for unstruc- looking at unique characteristics in the released data [ 28 ] tured data confidentiality exceeds expectations, especially for [ 29 ] [ 30 ]. Combinations of few characteristics often combine textual data in the healthcare sector [8] [ 9 ]. A large number of in populations to uniquely or nearly uniquely identify some researches in this field thrived aiming at providing anonymity individuals. Most known study on this is one done by Archie in text. Many works rose that provide different privacy et al. [ 29 ] in the university of Texas where they applied solutions for text, mostly focusing on medical data, governed their own de-anonymization methodology to a dataset by the regulations placed by the Health Insurance Portability published by Netflix (Netflix Prize dataset) [ 25 ], which and Accountability Act (HIPAA) [ 36 ]. Older work used rule- contained anonymous movie ratings of 500,000 subscribers based approaches, but more recent work is more centered of Netflix, and demonstrated that an adversary who knows around the use of neural networks and deep learning. only little information about an individual subscriber can

In this paper, we provide a review of the different potential easily identify this subscriber's record in the dataset. A more methodologies used for user data privacy in structured data recent work by Narayanan et al. [ 30 ] shows a similar context, and unstructured textual data. One of our key objectives in this only this time de-anonymizing the Netflix Prize dataset users research is to discover the most promising methodologies that using publicly available Amazon review data [ 26 ] [ 27 ]. Here, have been proposed in literature. Therefore, we've reviewed [ 30 ] were able to uncover more user information like a user's the key existing solutions and conducted a deep and wide full name and shopping habits. comparative study. Also, we reviewed the most prominent tools available on the web for natural language processing, A. Noninteractive Approach that can be/are being used for providing privacy in text. In 1) K-Anonymity: k-anonymity [1] is a property of a dataset our comparative study, we look into the privacy methodologies that describes its level of anonymity. Developed in 1998 as a used for structured data, before the governance of the use of means to address the problem of releasing person-specific data unstructured data. We also identify the major weaknesses of while preserving the anonymity of the individuals to whom the existing approaches for privacy in natural texts. Based on our data refers using generalization and suppression techniques. finding we proposed a novel methodology that would address A dataset is k-anonymous if every combination of identitythese weaknesses. In this paper, we merely presented the revealing characteristics (quazi-identifiers) occurs in at least architecture of our work-in-progress that is aimed at providing k different rows of the dataset. Table I shows a dataset that user anonymity in text. In addition, our proposed solution is has been 2-anonymized; note how the attributes ”Age” and capable of providing metrics concerning the risk of privacy ”Gender” are identical in the top 2 and bottom2 rows. leakage, sensitivity, and usability of a given text document 2) /!-Diversity: /!-diversity [3] was developed in 2006 to containing personal information. solve 2 privacy problems found in k-anonymity. First one is

The remainder of this paper is organized follows. We start that an attacker can discover the values of sensitive attributes by discussing privacy methodologies used in structured data in a k-anonymous dataset when there is little diversity in those in Section II. In Section III we discuss privacy methodologies sensitive attributes. Second is background knowledge attacks. and tools used for textual data, and talk about their advantages To give an example, if there are 100 different men with ages and shortcomings. Then we introduce our approach for privacy above 70 years living in area A who all have allergies to in text in Section IV, and finally conclude and debate future peanuts, then I know that Bob, who is 72 years of age, living

Age [ 10-12 ] [ 10-12 ] [ 11-12 ] [ 11-12 ]

Gender Male Male Female Female Zip Code Nationality Craig Federighi announced Apple's use of the concept to 1305* protect user privacy in iOS [ 39 ]. According to linknovate.com, 11330055** tech corporations are researching heavily into differential pri1305* vacy with Microsoft, Google and Apple being the top entities 1485* Cancer worldwide leading the innovations and advancements as of 11448855** VHieraalrt InDfeiscetaiosne the date of publishing this work [ 41 ]. Google developed new 1485* Viral Infection algorithmic techniques for deep learning and a refined analysis of privacy costs within the framework of differential privacy to solve the problem of models exposing private information [ 40 ]. in area A, also has an allergy to peanuts. /!-diversity aims Google also announced in September 5, 2019 that it is opento solve these problems by applying the following principle: sourcing an internal tool the company uses to securely draw a generalized quasi-identifier q*-block (equivalence class) is insights from datasets that contain the private and sensitive /!-diverse if it contains a minimum of ‘/!‘ properly depicted personal information of its users, called differential privacy values under the sensitive attribute present in these blocks. library [ 42 ].

If every q*-block in a dataset is /!-diverse, then the dataset Although differential privacy is praised for being an inmeets the /!-diversity concept. Table II shows an example of teractive solution that can be adapted to different scenarios an /!-diverse (3-diverse) dataset. (data collection, data analysis, machine learning...), it is not 3) t-Closeness: t-closeness [4] comes as a betterment of /!- without its flaws. Kifer and Machanavajjhala [ 43 ] provide a diversity by decreasing the granularity of the interpreted data. no-free-lunch theorem to show that it is necessary to make Introduced in 2007, where Li et al. [4] showed that /!-diversity assumptions about how the data is generated, to provide is neither necessary nor sufficient to prevent attribute disclo- privacy, which is unlike what differential privacy claims. There sure, and instead provided t-closeness which requires that the is also the open problem of setting the optimum value of the distribution of a sensitive attribute in any equivalence class is algorithm's parameters based on the scenario at hand, like close to the distribution of a sensitive attribute in the overall the parameter ”Epsilon” (E). In addition, the main criticism table. The distance between distributions is measured using against differential privacy is the fact that it produces noisy Earth Movers Distance (EMD). For a categorical attribute, results, decreasing the accuracy of the output. This means EMD is used to measure the distance between the values in that in order to get decent results from a query, one needs to it according to the minimum level of generalization of these have a reasonably large dataset so that the added noise doesn't values in the domain hierarchy. Table III shows an example of interfere much with the accuracy of the results. a dataset that has 0.167-closeness with respect to Salary and III. PRIVACY IN TEXTUAL DATA 0.278-closeness with respect to Disease.

These methods are not applicable for providing privacy for textual data. They were made at a time were structured data was the governing method for data storage.

Unstructured data have internal structure but is not struc

tured via pre-defined data models or schema. It may be textual or nontextual, and human or machine-generated. It doesnt fit neatly into the traditional row and column structure of B. Differential Privacy (Interactive Approach) relational databases. Examples of unstructured data include:

Differential privacy was introduced in 2006 by Dwork and emails, videos, audio files, web pages, and social media Aaron [6]. It offers a robust mathematical definition of privacy, messages. According to mongoDB, in today's world of Big and was developed as a solution for privacy-preserving data Data [7], most of the data that is created is unstructured with analysis. It ensures that the result of an algorithm is not overly some estimates of it being more than 95% of all data generated dependent on any instance, and states that there should be a [ 21 ]. strong probability of producing the same output even if an Our work focuses on privacy in textual data. There has instance was added to or removed from the dataset. Differential been lots of work on applying privacy to text, mostly in the privacy leapt from research papers to tech news headlines form of de-identification. Challenges like the n2c2 2006: Dewhen, in the 2016 WWDC keynote, Apple VP of Engineering identification and Smoking Challenge [8] and the n2c2 2014: De-identification and Heart Disease Risk Factors Challenge performance may not be generalizable to different datasets (i.e. [ 9 ] (previously housed at i2b2) motivated research in textual data from a different institution or a different type of medical data de-identification, namely in the field of healthcare. This report). Another disadvantage is the need for developers to influenced most work being done on textual data privacy be aware of all possible PHI patterns that can occur, such as to primarily target medical documents, due to the relative location patterns that use nonstandard abbreviations (e.g., 'Cal' ease of access to pre-labeled training data; these challenges for California). provided pre-labeled data from the domain to facilitate any Later work tended to be mostly based on machine learntraining/testing required by the algorithms in development. ing methods to classify words as PHI or not PHI, and in A. Medical Field different classes of PHI in the former case. The methods used a range of techniques from Support Vector Machines, to

In many countries including the United States, medical pro- Conditional Random Fields, Decision Trees, and Maximum fessionals are strongly encouraged to adopt electronic health Entropy [ 15 ] [ 16 ] [ 17 ] [ 18 ]. More recent work is more records (EHRs) and may face financial penalties if they fail focused on utilizing neural networks and deep learning in to do so [ 34 ] [ 35 ]. One of the key components of EHRs is its approach to de-identify patient data. Ji Young Lee et al. patient notes. However, before patient notes can be shared with [ 19 ] incorporate human-engineered features as well as features medical investigators, some types of information, referred to as derived from electronic health records to a neural-networkprotected health information (PHI), must be removed in order based de-identification system composed of a Long Short Term to preserve patient confidentiality. In the United States, the Memory neural network [ 20 ].

Health Insurance Portability and Accountability Act (HIPAA) [ 36 ] defines 18 different types of PHI: B. Differential Privacy with Textual Data 1) Names 2) Dates, except year Benjamin Weggenmann et al. provide an automated text 3) Telephone numbers anonymization approach that applies differential privacy to the 4) Geographic data vector space model [54]. They obscure term frequencies in 5) FAX numbers textual documents' TF-IDF vectors in a differentially private 6) Social Security numbers manner. Their aim is to prevent a document's author attribution 7) Email addresses through the evaluation of the document's TF-IDF vectors using 8) Medical record numbers different data-mining techniques. They also demonstrate that 9) Account numbers this approach has a low impact on accuracy when mining 10) Health plan beneficiary numbers these document vectors. Our goal is different from that of 11) Certificate/license numbers Weggenmann in that we aim to provide privacy methods to the 12) Vehicle identifiers and serial numbers including license actual text documents, and not their vector representations.

plates 13) Web URLs C. Review of Available De-Identification Tools 14) Device identifiers and serial numbers 15) Internet protocol addresses 16) Full face photos and comparable images 17) Biometric identifiers (i.e. retinal scan, fingerprints) 18) Any unique identifying number or code

1) MITdeid: MITdeid [23] is an automated de

identification software package that is generally usable on most medical records aimed at removing HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. The software achieves that by utilizing lexical look-up Performing such a task manually proved to be time-consuming tables, regular expressions, and simple heuristics. This tool and quite expensive. Douglass et al. [ 37 ] [ 38 ] reported that has a precision of 93.2%, recall of 99.8%, and an F1-score annotators were paid 50 US dollars per hour and read 20 000 of 96.4%. words per hour at best. This motivated research in this domain 2) De-identification of Patient Notes with Recurrent Neural to automate this process. Networks (2016) - DeID [ 31 ]: The solution presented in

Earlier research in the field were oriented towards rule- this paper uses RNNs (Recurrent Neural Networks) with based or pattern-matching solutions, using either complex LSTM (Long Short-Term Memory) to de-identify medical text regular expressions, dictionaries or a combination of both documents. The system is composed of 3 layers: [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ]. The advantages of rule-based and pattern matching de-identification methods is that they require 1) Character-enhanced token embedding layer. little or no annotated training data, and can be easily and 2) Label prediction layer quickly modified to improve performance by adding rules, 3) Label sequence optimization layer dictionary terms, or regular expressions. Disadvantages of The solution was evaluated on two datasets: n2b2 2014 [ 9 ] and pattern matching de-identification methods are that developers MIMIC de-identification datasets (assembled by the writers of have to craft many complex algorithms in order to account for the system and is twice as large as the n2b2 2014 dataset). It different categories of PHI, and the required customization has a precision of 97.9%, a recall of 97.8%, and an F1-Score to fit a particular dataset. As such, PHI pattern recognition of 97.8%. Algorithm K-anonymity R-diversity t-closeness Differential Privacy MITdeid DeID NLM-Scrubber CliniDeID Structured Structured Textual Textual Textual Textual

Approach generalization/suppression of quazi-identifiers adds on k-anonymity by making the sensitive attributes in every equivalence class of the dataset records contain a minimum of R properly depicted values adds on R-diversity by decreasing the granularity of the interpreted data adding noise to ensure that the result of an algorithm is not overly dependent on any single instance utilizes lexical look-up tables, regular expressions and simple heuristics to remove HIPAA PHI frommedical records uses RNNs with LSTM to remove HIPAA PHI from medical records uses a combination of regular expressions and pattern-matching to remove 12 personal identifier categories in patient medical notes uses twelve de-identification models that use deep learning, shallow learning,or rule-based approaches to remove HIPAA PHI 3) NLM-Scrubber: NLM-Scrubber [ 32 ] is advertised as a varying inaccuracies when it came to NER, especially HIPAA compliant, clinical text de-identification tool designed with identifying person names. and developed at the National Library of Medicine. It looks 2) NeuroNER [ 56 ]: is a program that performs namedfor 12 personal identifier categories in patient medical notes. entity recognition (NER), used by the Stanford CoreNLP It uses a combination of regular expressions and pattern- tool. It's composed of a 3-layer recurrent neural network matching to locate and remove identifiers from documents. with LSTM. The tool was good at detecting names,

Testing on some sample text reveals that it is not difficult unless they were lower-cased. It was also good at to manipulate with this tool. For example, a name that isn't detecting locations, but it doesnt detect any date or age. capitalized is not detected as a name. Also, the age (65) in a 3) Gate [ 57 ]: Gate offers text analysis services (part-ofsentence like: Dave is a 65 year old man is not detected as speech, NER, ), but it uses a static approach. Its not so an age. This tool offers a precision of 93.2%, recall of 99.8%, good at detecting named entities, and its quite easy to and an F1-score of 96.4%. trick it by changing the structure of the sentence.

4) CliniDeID: Is a tool for de-identifying clinical notes 4) IBMs Watson, Natural Language Understanding [ 58 ]: Is according to the HIPAA Safe Harbor method. Owned by the a collection of APIs that offer text analysis tools using company ClinAcuity and based on the work done by Youngjun Natural Language Processing. Watson's performance in Kim et al. [ 33 ], it finds identifiers and tags or replaces them NER was inconclusive for us; on one side, it offers the with surrogates for anonymity. The tool includes twelve de- most complex analysis where it doesn't only name the identification models that use deep learning, shallow learning, entities with decent granularity (an address is divided or rule-based approaches. This tool has a precision of 97%, into ”location” and ”facility”), but can also detect the recall of 94.4% and an F1-score of 95.7%. tonality of the speech. On the other hand, it was still

Table III-A compares all the tools and algorithms mentioned not so difficult to mislead. It doesn't always detect dates, in this section so far. and still doesn't detect person names if they are not 5) Named Entity Recognition Tools: Due to the strong capitalized. correlation between Text de-identification and Named Entity Recognition (NER), here we shall discuss the tools we found D. Discussion that handle text analysis and (NER). The application of differential privacy to textual data is 1) Stanford CoreNLP Tool (2014) [ 55 ]: Is an NLP tool mostly possible by using the word vectorization of key terms created by Stanford university, initially developed in chosen from the text, then applying differential privacy to 2006, further work led to the system being released as these vectors. This is useful for running privacy-preserving free open source software in 2010. The tool supports, to statistical analysis on these terms, or to prevent a document's varying degrees, the languages Arabic, Chinese, English, author attribution [54]. However, it falls short when it comes French and German. In our experimentation, the tool to preserving the structure of the text, since it picks out was quite successful at part-of-speech tagging, but had only specific terms discarding the rest of the text. Our aim Fig. 1. M ODS Model Construction Process for a Given Domain D.

Fig. 2. M ODI Model Construction Process for a Given Domain D. is not to choose specific keywords from a given text to run In order to cause a privacy problem, the document should a specific analysis, rather we want to conserve most of the be identifiable (containing personal identifiers), and should text, removing/obscuring only what's necessary to preserve the contain private information. If any of these conditions are privacy of any individuals present in it while preserving most not satisfied, then the document will not cause privacy leaks. of the text's utility. Therefore, the degree of privacy risks associated with a textual

Dictionary based and pattern-matching based approaches document are a combination of the present identifying and to provide privacy in text, like MITdeid and NLM-Scrubber, private information present in it. aren't much complex to implement, and require little to no Due to the nature of text documents having different priannotated training data. But that comes on the expense of orities to what is private and what is sensitive based on the being static, where every target term to be captured has to be domain that said documents belong to (finance, healthcare...), manually transcribed through complex regular expressions. In it is important to take into account the differences of each addition, solutions using these approaches can't be generalized term's ”criticality” in a given document based on what domain to handle different datasets; they're only made to handle a said document belongs to. In our work, we refer to words in target dataset, or a category of datasets. documents as ”terms”, but not all words, rather ones that are

Neural-network based approaches like DeID and Clin- not stop words. iDeID are the ones with the most promise in terms of accuracy, The main objective of our work is of two folds: adaptability, and generalizability. Once a neural network model 1) Devise a framework to measure the degree of identifiais created and trained to capture specific terms in a piece of bilty and sensitivity of entities in a given text domain. text, it is then able to capture other terms that are symantically Then, using these measures, construct a module to be equivalent, which are learned from the context of the text. This used to assess the privacy risks of a text document is crucial in the field of text analysis since it is very difficult belonging to said domain, if the text was to be published, to predict every possible structure of a sentence in a given based on the sensitive information and the personal idenlanguage, or every possible use of a term or word. The problem tifiers present in this document. The provided measures with available solutions is that they are fixated on the field of take into consideration the semantics of the documents medicine and capturing PHIs, and don't take infto account texts as the main utility guarantee. about other domians like finance or trading. Besides, these 2) Based on the devised framework, provide a set of tools solutions treat all identifiers equally, although one identifier and algorithms for privacy preserving textual data man(like names) can have higher priority to be removed than others agement including textual data publishing and mining. (like age) since it can identify an individual more easily.

A. Term Attribution

IV. OUR APPROACH

We associate three attributes to each term in a given text

In our work, we are concentrating on text, therefore all document. operations are done on textual documents, each of which 1) identifiability: the degree to which a term can identify contains natural language text. We assume that a document a real person. For example, a person's name has a high is associated with a single natural person and, without loss of level of identifiability, but his country of residence might generality, we assume that multiple documents may refer to a have a lower level of that (there are more people living in single natural person. the United Kingdom than there are people called Jason).

TABLE V

S AMPLE OF T HE 3 ME TRICS FOR GI VEN TE RMS

by finding terms in the vendor data that is similar in context to the terms present in the original dictionaries. Figures 1 and 2 represent the construction process of the model for sensitivity (M ODS ) from the dictionary for sensitive terms (S E E DS ) and the model for identifiability (M ODI ) from the dictionary for identifiable terms (SEEDI ) respectively.

C. Restriction Flexibility

For every given data domain exists a pre-calculated criticality threshold. This threshold is evaluated through 2) Privacy Sensitivity: is this term a sensitive piece of the analysis of domain-specific datasets, and is used as an information? Each domain has its own set of sensitive anchor point for the data holder to fine-tune the level of keywords, so this is domain-specific. A term such as anonymity that is to be applied to the text in a manner location could be both privacy-sensitive and identifying that suits their privacy standards. This is important because a real person. there is a compromise that must be met between privacy and 3) Semantic value: how much information does this term utility; the more strict the privacy rules being applied are, the convey with respect to the general context of the doc- less utility the document has (since we are removing data ument. Semantic value should be assessed with respect from the document). Our solution makes it easy to strike to the application. For example, in case of the domain the required balance, as privacy risks are quantified, allowing of social studies, the opinion-baring terms have more data holders to alter solution parameters to fulfill their privacy semantic value, while in a health application this may requirements. be different.

V. CONCLUSION AND FUTURE W O R K

For each term in a given document, each of these attributes is marked with a numerical value (a metric) reflecting the Data storage is now mostly moving towards unstructured significance of the term with respect to the attribute. Each of data, and the use of textual data has become an inseparable the three metrics can be considered as a normalized weight, part of our daily lives. We only continue to share more of where a higher value indicates that the term is more significant. our personal info through online services and social media. In

Identifyability metric. For each term, this value is calcu- this paper, we've introduced a new concept for a data-oriented lated by checking the term's uniqueness; for example, given 2 solution that provides measures for a given text document to names (Bob and Jason), if the name Bob appears more often assess the privacy-leak potential of said document, as well as than Jason, then Bob has a lower identifyability metric than measure its semantic utility. We believe that this work can Jason,since it is more unique. pave the way for a new data-driven orientation in the privacy

Sensitivity metric. For each term, this value is calculated research field. based on the work done by Sa´nchez et al. [ 45 ], where by using REF E RE N CE S the semantics of the text, we can assess the degree of sensitiveness of terms according to the amount of information they [1] foPr.mSaatimona:rakti-AannodnyLm. itSywaenedneiyt.s PErnoftoerccteinmgenPtritvharocyugwhhGeenneDraisliczlaotsiionng aInndprovide. Then, this assessment is represented as a normalized Suppression, Technical Report (SRI-CSL-98-04), 1998 numerical value. [2] L. Sweeney. k-Anonymity: a model for protecting privacy. International

Semantic value metric. Using sentiment analysis, the J2o0u0r2nal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (7), sentiment of sentences in the text is studied to determine the [3] A. Machanavajjhala & J. Gehrke & D. Kifer & M. Venkitasubsignificance of each term to the semantic structure of their ramaniam. (2006). l-Diversity: Privacy Beyond k-Anonymity. ACM respective sentence. The terms that each sentence is centered 1T0ra.1n1sa4c5t/i1o2n1s72o9n9.1K2n1o7w3l0e0d.ge Discovery From Data - TKDD. 1. 24. around are considered as terms of semantic significance for [4] N. Li, T. Li and S. Venkatasubramanian, ”t-Closeness: Privacy Bethe text. This significance is then represented as a normalized yond k-Anonymity and l-Diversity,” 2007 IEEE 23rd International numerical value. 1C0o.n1f1e0r9en/IcCeDoEn.20D07at.a367E8n5g6ineering, Istanbul, 2007, pp. 106-115. doi: A sample table is provided in Table V. [5] Rajendran, Keerthana & Jayabalan, Manoj & Rana, Muhammad Ehsan. (2017). A Study on k-anonymity, l-diversity, and t-closeness Techniques B. Adaptability focusing Medical Data. 17.

Our solution contains data-driven models adaptable to dif- [6] FoCu.nDdwatoiornks, Aan.dRoTtrhe,nTdsheinATlghoeroitrhemticiaclFCooumndpauttieornsScoifenDcieffeVroeln.ti9a,l NProivs.ac3y4, ferent data domains, with the ability to customize these models (2014) 211407, 2014 C. Dwork and A. Roth DOI: 10.1561/0400000042 further through training on vendor-specific data. This concept [7] Chang, F., J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, is based on word vectorization, where dictionaries of vendor- sTy.sCtehmanfdorra,sAtru.cFtiukreesdadnadtaR. .PEr.oGceruebdeinr,g2o0f06th.eB7igthtabUleS:EANIdXistSriybmutpeodssiutomraogne specific terms relating to identifiability and sensitivity are Operating Systems Design and Implementation (OSDI 06). Berkeley, CA, first constructed, then expanded upon using domain-specific USA, pp: 15-15. knowledge-base to form a model, which is adapted to the [8] hnt2tpcs2://port2a0l.0d6b:mi.hmDse.hidarevnatirfdi.ceadtiuo/nprojecatsn/dn2c2-2S0m0o6k/,ingAccesCsehdallenogne, contexts used in vendor use cases. The model is constructed 2019-10-20

[9] n2c2 2014: De-identification and Heart Disease Risk Factors Challenge , https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/, Accessed on 2019- 10 -21

[10] Beckwith

. Development and evaluation of an open source software tool for deidentification of pathology reports . BMC Med Inform Decis Mak . 2006 . p. 12 .

[11] Berman

. Concept-match medical data scrubbing. How pathology text can be used in research . Arch Pathol Lab Med . 2003 . pp. 6806 .

[12] Fielstein

, Brown

, Speroff

Algorithmic

De-identification of VA Medical Exam Text for HIPAA Privacy Compliance: Preliminary Findings . Medinfo. 2004 . p. 1590 .

[13] Friedlin

, McDonald

. A software tool for removing patient identifying information from clinical documents . J Am Med Inform Assoc . 2008 ; 15 ( 5 ): 60110 . doi: 10 .1197/jamia.M2702.

[14] Gupta

, Saul

, Gilbertson

. Evaluation of a deidentification (DeId) software engine to share pathology reports and clinical documents for research . Am J Clin Pathol . 2004 . pp. 17686 .

[15] Aramaki

Automatic

Deidentification by using Sentence Features and Label Consistency . i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data , Washington, DC. 2006 .

[16] Guo

Identifying Personal Health Information Using Support Vector Machines . i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data , Washington, DC. 2006 .

[17] Szarvas

, Farkas

, Kocsor

. A multilingual named entity recognition system using boosting and c4.5 decision tree learning algorithms . 9th Int Conf Disc Sci (DS2006) , LNAI . 2006 . pp. 267278 .

[18] Wellner

Rapidly retargetable approaches to de-identification in medical records . J Am Med Inform Assoc . 2007 . pp. 56473

[19] Lee , Ji & Dernoncourt, Franck & Uzuner, Ozlem & Szolovits, Peter. ( 2016 ). Feature-Augmented Neural Networks for Patient Note Deidentification .

[20] Shi , X. , Chen , Z. , Wang , H. , Yeung , D. , Wong , W. , & Woo , W. ( 2015 ). Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting . ArXiv, abs/1506.04214.

[21] mongoDB: Unstructured Data In Big Data , https://www.mongodb.com/scale/unstructured -data-in-big- data , Accessed on 2019-10-15.

[22] National Association of Health Data Organizations, A Guide to StateLevel Ambulatory Care Data Collection Activities (Falls Church: National Association of Health Data Organizations , Oct. 1996 ).

[23] Kayaalp , M. , Browne , A. C. , Dodd , Z. A. , Sagan , P. , & McDonald , C. J. ( 2014 ). De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports . AMIA ... Annual Symposium proceedings. AMIA Symposium , 2014 , 767776 .

[24]

Group

Insurance Commission testimony before the Massachusetts Health Care Committee . See Session of the Joint Committee on Health Care , Massachusetts State Legislature, ( March 19, 1997 ).

[25] Netflix Prize Dataset: https://www.kaggle.com/netflix-inc/netflix-prizedata. Downloaded on July 15 , 2019 .

[26]

He ,

McAuley . Modeling the visual evolution of fashion trends with one-class collaborative filtering . WWW , 2016 .

[27] J. McAuley , C.

Targett , J.

Shi , A. van den Hengel. Image-based recommendations on styles and substitutes . SIGIR , 2015 .

[28]

Sweeney , Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4 . Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: 2000 . Forthcoming book entitled , The Identifiability of Data.

[29] Archie , M. , Gershon , S. , Katcoff , A. , & Zeng , A. ( 2018 ). Who s Watching ? De-anonymization of Netflix Reviews using Amazon Reviews .

[30]

Arvind

Narayanan and

Vitaly

Shmatikov . 2008 . Robust Deanonymization of Large Sparse Datasets . In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP '08) . IEEE Computer Society, Washington, DC, USA, 111 - 125 . DOI: https://doi.org/10.1109/SP. 2008 .33

[31] Franck

Dernoncourt

, Ji Young Lee, Ozlem Uzuner,

Peter

Szolovits , Deidentification of patient notes with recurrent neural networks , Journal of the American Medical Informatics Association , Volume 24 , Issue

, May

2017

, Pages 596606.

[32] Kayaalp

, Browne

, Dodd

, Sagan

, McDonald

. Deidentification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports . AMIA Annu Symp Proc. 2014 Nov 14 ; 2014 : 767 - 76 .

[33] Kim , Y. , Heider , P. , & Meystre , S. ( 2018 ). Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives . AMIA ... Annual Symposium proceedings. AMIA Symposium , 2018 , 663672 .

[34] DesRoches

, Worzala

, Bates

Some hospitals are falling behind in meeting meaningful use criteria and could be vulnerable to penalties in 2015 . Health Affairs. 2013 ; 32 : 135560 .

[35] Wright

, Henkin

, Feblowitz , et al. Early results of the meaningful use program for electronic health records . New Engl J Med . 2013 ; 368 : 77980 .

[36] Office for Civil Rights H. Standards for privacy of individually identifiable health information . Final rule. Federal Register . 2002 ; 67 : 53181

[37] Douglass

, Clifford

, Reisner

, et al. De-identification algorithm for free-text nursing notes . Comput Cardiol . 2005 : 33134 .

[38] Douglas

, Clifford

, Reisner

, et al. Computer-assisted deidentification of free text in the MIMIC II database . Comput Cardiol . 2004 : 34144 .

[39] Apple Adopts Differential Privacy , https://www.apple.com/privacy/docs/ Differential Privacy Overview.pdf, accessed on 2019-10-20

[40] Abadi

, Chu

, Goodfellow

, McMahan

, Mironov

, Talwar

, Zhang L. Deep Learning with Differential Privacy , https://arxiv.org/abs/1607.00133, 23rd ACM Conference on Computer and Communications Security (ACM CCS) , 2016 , 308 - 318

[41] Top 10 entities worldwide leading the innovations and advancements in Differential Privacy , https://www.linknovate.com/search/ ?query=% 22differential%20privacy%22%2C%22privacy%20by%20design % 22&utm source=blog.linknovate.com&utm medium=referral&utm campaign=data%20stories&utm content=differential%20privacy, Accessed on 2019-10-20

[42] Enabling developers and organizations to use differential privacy , https://developers.googleblog.com/ 2019 /09/enabling-developers-andorganizations. html, accessed on 2019-10-20

[43]

Daniel

Kifer and

Ashwin

Machanavajjhala . 2011 . No free lunch in data privacy . In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11) . ACM, New York, NY, USA, 193 - 204 . DOI: https://doi.org/10.1145/1989323.1989345

[44]

David

Maier , Theory of Relational Databases, Computer Science Press; 1st edition (March 1983 ), 978 - 0914894421 .

[45] David

´nchez , Montserrat Batet, Alexandre Viejo, Detecting Sensitive Information from Textual Documents: An Information-Theoretic Approach, Modeling Decisions for Artificial Intelligence , Springer Berlin Heidelberg, 2012 , 173 - 184 ”,

[46] The Facebook and Cambridge Analytica scandal, explained with a simple diagram , https://www.vox.com/policy-andpolitics/ 2018 /3/23/17151916/facebook-cambridge-analytica-trumpdiagram, Accessed on 2019-10-20.

[47]

Googles

Sundar Pichai was grilled on privacy, data collection, and China during congressional hearing , https://www.cnbc.com/ 2018 /12/11/googleceo-sundar -pichai-testifies-before-congress-on-bias-privacy .html, Accessed on 2019- 20 -10.

[48] Abdullah , Ahmad & Zhuge, Qingfeng, ( 2015 ). From Relational Databases to NoSQL Databases: Performance Evaluation. Research Journal of Applied Sciences, Engineering and Technology . 11 . 434 - 439 . 10 .19026/rjaset.11.1799.

[49] GDPR, https://eugdpr.org, Accessed on 2019-10-15

[50] Trump order bans US firms from dealing with Huawei , https://www.techradar.com/news/trump -order-bans-us-firms-from-dealingwith- huawei , Accessed on 2019-10-25.

[51] If you have a smart TV, take a closer look at your privacy settings , https://www.cnbc.com/ 2017 /03/09/if-you -have-a-smart-tv-take-acloser-look-at-your-privacy-settings .html, accessed on 2019- 10 -12.

[52] Privacy and

Human

Rights , http://gilc.org/privacy/survey/intro.html, Accessed on 2019- 10 -15.

[53] Hecht , R. and S. Jablonski , 2011 . Nosql evaluation . Proceeding of International Conference on Cloud and Service Computing , pp: 336 - 41 [54] Weggenman , B. , & Kerschbaum , F. ( 2018 , July) SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining SIGIR18 , July 8 - 12 , 2018 , Ann Arbor, MI, USA

[55]

Stanford

Core NLP , https://stanfordnlp.github.io/CoreNLP, Accessed 2019- 10 -2

[56] Dernoncourt , Franck & Lee, Ji & Szolovits, Peter. ( 2017 ). NeuroNER: an easy-to-use program for named-entity recognition based on neural networks .

[57] Gate , https://gate.ac.uk/gate/doc/papers.html, Accessed on 2019-10-19

[58]

Natural

Language Understanding , https://www.ibm.com/watson/services /natural-language-understanding, Accessed on 2019-10-10