<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Personal Information Privacy: What's Next?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>1st Khodor Hammoud</string-name>
          <email>ik19544@etu.parisdescartes.fr</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>4th Yucel Saygin</string-name>
          <email>ysaygin@sabanciuniv.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>2nd Salima Benbernou</string-name>
          <email>salima.benbernou@parisdescartes.fr</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>5th Rafiqul Haque</string-name>
          <email>Rafiqul.haque@intelligenciaia.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>3rd Mourad Ouziri</string-name>
          <email>mourad.ouziri@parisdescartes.fr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>6th Yehia Taher</string-name>
          <email>yehia.taher@uvsq.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligencia R&amp;D</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratoire DAVID, Université de Versailles - Paris-Saclay</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sabanci University</institution>
          ,
          <addr-line>Istanbul</addr-line>
          ,
          <country country="TR">Turkey</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universite ́ de Paris Paris</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universite ́ de Paris</institution>
          ,
          <addr-line>P a r i s</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Universite ́ de Paris</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>30</fpage>
      <lpage>37</lpage>
      <abstract>
        <p>-In recent events, user-privacy has been a main focus of personal information abuse are the two recent incidents: for all technological and data-holding companies, due to the Facebook trail [46] and google testimony [47]. Its not only the global interest in protecting personal information. Regulations software vendors, social medias, hardware companies violates lainkde tpheenGaletineesraarloDuantda tPhreohteacntidolninRgeagnudlamtiiosnus(eGoDfPuRse)rsedtaftiar.mTlhaewses the privacy as well. As an example, Samsung's smart TVs privacy rules apply regardless of the data structure, whether it recording audio [51], and the recent events that happened being structured or unstructured. unavailing the potential of the giant Chinese tech company In this work, we perform a summary of the available algo- Huawei using its mobile phone to spy on users, blocking its rithms for providing privacy in structured data, and analyze the phones from using google services and banning them from the dpaotpau.laWretofooulsntdhathtahtaanldthleoupgrhivathcyesienttoeoxltsuparlodvaidtae;andaemquelaytemreedsuiclatsl US [50]. There are regulations that govern user data handling, in terms of de-identifying medical records by removing personal latest of which is the European's General Data Protection identifyers (HIPAA PHI), they fall short in terms of being Regulation (GDPR) [49] which needs transparency, and usergeneralizable to satisfy nonmedical fields. In addition, the metrics anonymity when performing statistical analysis, and places used to measure the performance of these privacy algorithms heavy fines on violating parties. iddoenn'ttiftiaekrehainst.o account the differences in significance that every The task of making the web a safe place for users is a largely Finally, we propose the concept of a domain-independent difficult problem due to the inherently open, nondeterministic adaptable system that learns the significance of terms in a given nature of the Web, and the complex, leakage-prone information text, in terms of person identifiability and text utility, and is flow of many Web-based transactions that involve the transfer then able to provide metrics to help find a balance between user of sensitive, personal information. Despite considerable attenpriIvnadceyxanTderdmast-aupsraivbailciyt,y. k-anonymity, l-diversity, t-closeness, tion, Web privacy continues to pose significant threats and NLP, textual data, privacy in text challenges. One major step is securing the way companies store, share and publish user information, as data regulations I. INTRODUCTION AND M OTIVATIO N impose data publication, which if not secured, can be used The legal right to privacy is a fundamental human right to re-identify the individual owners. Securing stored/published recognized in the UN (United Nation) declaration of human data depends on the way data is stored. In the past, information rights [52]. The unprecedented growth of highly advanced was almost strictly stored in the form of structured relational technologies - in the last two decades - has imperiled privacy databases [44]. Consequently, shared data was in the form significantly. Today, different aspects of human life has been of structured datasets. Ensuing privacy to these datasets was digitalized including communication medium, socialization, first in the form of deleting the unique identifiers, but then L. entertainment, purchasing and many others. People adopted Sweeney [28] published a research result that proved that users the digital systems due to increasing efficiency in day-to- can still be identified from their quazi-identifiers, and proposed day tasks. In some cases the adoption is forced by social a new methodology known as k-anonymity [1]. Following practices such as the use of social media. Nevertheless, the K-anonymity, several solutions were proposed including /!digital transformation created an ample of opportunities for diversity [3] and t-closeness [4] that address shortcomings various organizations and adversaries to abuse privacy since discovered in k-anonymity. However, in 2006, Dwork and the digital systems enable them to hold information of people Aaron introduced differential privacy [6] as a solution for forever. Organizations such as Google can profile anyone privacy-preserving data analysis which can be used to provide without the users being aware of it. The concrete evidence security for both data storage and analysis. Recently, the changes in applications, user and infrastructure characteristics, mostly of the Web 2.0 domain [53] and cloud</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This work was made possible thanks to the funding provided by Cognitus.
authors. Use permitted under Creative
Commons License Attribution 4.0
platform, led to an exponential growth of the internet and the work in Section V.
explosion of data sources such as sensors, social media, etc.
and massive workloads. This kind of data is typically referred II. PRIVACY IN STRUCTURED DATA
to as Big Data [7]. This foster the requirement of a new format There are two natural models for privacy mechanisms:
of data storage, known as unstructured data [
        <xref ref-type="bibr" rid="ref40">48</xref>
        ] which is interactive and noninteractive. In the noninteractive setting the
essentially the central focus of this research. To be specific, the data collector, a trusted entity, publishes a sanitized version of
textual form of this unstructured data is the key focus of this the collected data; the literature uses terms such as
anonymizaresearch. Privacy in unstructured text is critically important for tion and de-identification. Traditionally, sanitization employs
several reasons yet the most substantial reason is the amount techniques such as data perturbation and sub-sampling, as well
of unstructured data generated by companies. More than 80% as removing well-known identifiers such as names, birth dates,
of the data generated in the last ten years are unstructured and social security numbers. It may also include releasing
(mostly in textual form). This implies the fact that a massive various types of synopses and statistics. In the interactive
volume of data is recorded in textual form yet the privacy setting, the data collector, again trusted, provides an interface
in unstructured text, to the best of our knowledge, lack robust through which users may pose queries about the data, and get
solutions. We studied several use cases that belong to different (possibly noisy) answers.
industrial domains including finance, healthcare, and insurance Originally, data were published in tabular format, and made
etc. According to our study, banking and healthcare sectors anonymous by simply removing all the explicit identifiers
generate a huge volume of unstructured text; both of these like names and phone numbers. However, in most of these
industrial domains are facing several challenge concerning cases, the remaining data can be used to re-identify
individprivacy of user information - which is the key motivation of uals by linking it to other purposely collected data or by
this research. The need of a privacy mechanism for unstruc- looking at unique characteristics in the released data [
        <xref ref-type="bibr" rid="ref20">28</xref>
        ]
tured data confidentiality exceeds expectations, especially for [
        <xref ref-type="bibr" rid="ref21">29</xref>
        ] [
        <xref ref-type="bibr" rid="ref22">30</xref>
        ]. Combinations of few characteristics often combine
textual data in the healthcare sector [8] [
        <xref ref-type="bibr" rid="ref1">9</xref>
        ]. A large number of in populations to uniquely or nearly uniquely identify some
researches in this field thrived aiming at providing anonymity individuals. Most known study on this is one done by Archie
in text. Many works rose that provide different privacy et al. [
        <xref ref-type="bibr" rid="ref21">29</xref>
        ] in the university of Texas where they applied
solutions for text, mostly focusing on medical data, governed their own de-anonymization methodology to a dataset
by the regulations placed by the Health Insurance Portability published by Netflix (Netflix Prize dataset) [
        <xref ref-type="bibr" rid="ref17">25</xref>
        ], which
and Accountability Act (HIPAA) [
        <xref ref-type="bibr" rid="ref28">36</xref>
        ]. Older work used rule- contained anonymous movie ratings of 500,000 subscribers
based approaches, but more recent work is more centered of Netflix, and demonstrated that an adversary who knows
around the use of neural networks and deep learning. only little information about an individual subscriber can
      </p>
      <p>
        In this paper, we provide a review of the different potential easily identify this subscriber's record in the dataset. A more
methodologies used for user data privacy in structured data recent work by Narayanan et al. [
        <xref ref-type="bibr" rid="ref22">30</xref>
        ] shows a similar context,
and unstructured textual data. One of our key objectives in this only this time de-anonymizing the Netflix Prize dataset users
research is to discover the most promising methodologies that using publicly available Amazon review data [
        <xref ref-type="bibr" rid="ref18">26</xref>
        ] [
        <xref ref-type="bibr" rid="ref19">27</xref>
        ]. Here,
have been proposed in literature. Therefore, we've reviewed [
        <xref ref-type="bibr" rid="ref22">30</xref>
        ] were able to uncover more user information like a user's
the key existing solutions and conducted a deep and wide full name and shopping habits.
comparative study. Also, we reviewed the most prominent
tools available on the web for natural language processing, A. Noninteractive Approach
that can be/are being used for providing privacy in text. In 1) K-Anonymity: k-anonymity [1] is a property of a dataset
our comparative study, we look into the privacy methodologies that describes its level of anonymity. Developed in 1998 as a
used for structured data, before the governance of the use of means to address the problem of releasing person-specific data
unstructured data. We also identify the major weaknesses of while preserving the anonymity of the individuals to whom the
existing approaches for privacy in natural texts. Based on our data refers using generalization and suppression techniques.
finding we proposed a novel methodology that would address A dataset is k-anonymous if every combination of
identitythese weaknesses. In this paper, we merely presented the revealing characteristics (quazi-identifiers) occurs in at least
architecture of our work-in-progress that is aimed at providing k different rows of the dataset. Table I shows a dataset that
user anonymity in text. In addition, our proposed solution is has been 2-anonymized; note how the attributes ”Age” and
capable of providing metrics concerning the risk of privacy ”Gender” are identical in the top 2 and bottom2 rows.
leakage, sensitivity, and usability of a given text document 2) /!-Diversity: /!-diversity [3] was developed in 2006 to
containing personal information. solve 2 privacy problems found in k-anonymity. First one is
      </p>
      <p>The remainder of this paper is organized follows. We start that an attacker can discover the values of sensitive attributes
by discussing privacy methodologies used in structured data in a k-anonymous dataset when there is little diversity in those
in Section II. In Section III we discuss privacy methodologies sensitive attributes. Second is background knowledge attacks.
and tools used for textual data, and talk about their advantages To give an example, if there are 100 different men with ages
and shortcomings. Then we introduce our approach for privacy above 70 years living in area A who all have allergies to
in text in Section IV, and finally conclude and debate future peanuts, then I know that Bob, who is 72 years of age, living</p>
      <p>
        Age
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">10-12</xref>
        ]
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">10-12</xref>
        ]
[
        <xref ref-type="bibr" rid="ref3 ref4">11-12</xref>
        ]
[
        <xref ref-type="bibr" rid="ref3 ref4">11-12</xref>
        ]
      </p>
      <p>
        Gender
Male
Male
Female
Female
Zip Code Nationality Craig Federighi announced Apple's use of the concept to
1305* protect user privacy in iOS [
        <xref ref-type="bibr" rid="ref31">39</xref>
        ]. According to linknovate.com,
11330055** tech corporations are researching heavily into differential
pri1305* vacy with Microsoft, Google and Apple being the top entities
1485* Cancer worldwide leading the innovations and advancements as of
11448855** VHieraalrt InDfeiscetaiosne the date of publishing this work [
        <xref ref-type="bibr" rid="ref33">41</xref>
        ]. Google developed new
1485* Viral Infection algorithmic techniques for deep learning and a refined analysis
of privacy costs within the framework of differential privacy to
solve the problem of models exposing private information [
        <xref ref-type="bibr" rid="ref32">40</xref>
        ].
in area A, also has an allergy to peanuts. /!-diversity aims Google also announced in September 5, 2019 that it is
opento solve these problems by applying the following principle: sourcing an internal tool the company uses to securely draw
a generalized quasi-identifier q*-block (equivalence class) is insights from datasets that contain the private and sensitive
/!-diverse if it contains a minimum of ‘/!‘ properly depicted personal information of its users, called differential privacy
values under the sensitive attribute present in these blocks. library [
        <xref ref-type="bibr" rid="ref34">42</xref>
        ].
      </p>
      <p>
        If every q*-block in a dataset is /!-diverse, then the dataset Although differential privacy is praised for being an
inmeets the /!-diversity concept. Table II shows an example of teractive solution that can be adapted to different scenarios
an /!-diverse (3-diverse) dataset. (data collection, data analysis, machine learning...), it is not
3) t-Closeness: t-closeness [4] comes as a betterment of /!- without its flaws. Kifer and Machanavajjhala [
        <xref ref-type="bibr" rid="ref35">43</xref>
        ] provide a
diversity by decreasing the granularity of the interpreted data. no-free-lunch theorem to show that it is necessary to make
Introduced in 2007, where Li et al. [4] showed that /!-diversity assumptions about how the data is generated, to provide
is neither necessary nor sufficient to prevent attribute disclo- privacy, which is unlike what differential privacy claims. There
sure, and instead provided t-closeness which requires that the is also the open problem of setting the optimum value of the
distribution of a sensitive attribute in any equivalence class is algorithm's parameters based on the scenario at hand, like
close to the distribution of a sensitive attribute in the overall the parameter ”Epsilon” (E). In addition, the main criticism
table. The distance between distributions is measured using against differential privacy is the fact that it produces noisy
Earth Movers Distance (EMD). For a categorical attribute, results, decreasing the accuracy of the output. This means
EMD is used to measure the distance between the values in that in order to get decent results from a query, one needs to
it according to the minimum level of generalization of these have a reasonably large dataset so that the added noise doesn't
values in the domain hierarchy. Table III shows an example of interfere much with the accuracy of the results.
a dataset that has 0.167-closeness with respect to Salary and III. PRIVACY IN TEXTUAL DATA
0.278-closeness with respect to Disease.
      </p>
      <p>These methods are not applicable for providing privacy for
textual data. They were made at a time were structured data
was the governing method for data storage.</p>
    </sec>
    <sec id="sec-2">
      <title>Unstructured data have internal structure but is not struc</title>
      <p>tured via pre-defined data models or schema. It may be textual
or nontextual, and human or machine-generated. It doesnt
fit neatly into the traditional row and column structure of
B. Differential Privacy (Interactive Approach) relational databases. Examples of unstructured data include:</p>
      <p>
        Differential privacy was introduced in 2006 by Dwork and emails, videos, audio files, web pages, and social media
Aaron [6]. It offers a robust mathematical definition of privacy, messages. According to mongoDB, in today's world of Big
and was developed as a solution for privacy-preserving data Data [7], most of the data that is created is unstructured with
analysis. It ensures that the result of an algorithm is not overly some estimates of it being more than 95% of all data generated
dependent on any instance, and states that there should be a [
        <xref ref-type="bibr" rid="ref13">21</xref>
        ].
strong probability of producing the same output even if an Our work focuses on privacy in textual data. There has
instance was added to or removed from the dataset. Differential been lots of work on applying privacy to text, mostly in the
privacy leapt from research papers to tech news headlines form of de-identification. Challenges like the n2c2 2006:
Dewhen, in the 2016 WWDC keynote, Apple VP of Engineering identification and Smoking Challenge [8] and the n2c2 2014:
De-identification and Heart Disease Risk Factors Challenge performance may not be generalizable to different datasets (i.e.
[
        <xref ref-type="bibr" rid="ref1">9</xref>
        ] (previously housed at i2b2) motivated research in textual data from a different institution or a different type of medical
data de-identification, namely in the field of healthcare. This report). Another disadvantage is the need for developers to
influenced most work being done on textual data privacy be aware of all possible PHI patterns that can occur, such as
to primarily target medical documents, due to the relative location patterns that use nonstandard abbreviations (e.g., 'Cal'
ease of access to pre-labeled training data; these challenges for California).
provided pre-labeled data from the domain to facilitate any Later work tended to be mostly based on machine
learntraining/testing required by the algorithms in development. ing methods to classify words as PHI or not PHI, and in
A. Medical Field different classes of PHI in the former case. The methods
used a range of techniques from Support Vector Machines, to
      </p>
      <p>
        In many countries including the United States, medical pro- Conditional Random Fields, Decision Trees, and Maximum
fessionals are strongly encouraged to adopt electronic health Entropy [
        <xref ref-type="bibr" rid="ref7">15</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">16</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">17</xref>
        ] [
        <xref ref-type="bibr" rid="ref10">18</xref>
        ]. More recent work is more
records (EHRs) and may face financial penalties if they fail focused on utilizing neural networks and deep learning in
to do so [
        <xref ref-type="bibr" rid="ref26">34</xref>
        ] [
        <xref ref-type="bibr" rid="ref27">35</xref>
        ]. One of the key components of EHRs is its approach to de-identify patient data. Ji Young Lee et al.
patient notes. However, before patient notes can be shared with [
        <xref ref-type="bibr" rid="ref11">19</xref>
        ] incorporate human-engineered features as well as features
medical investigators, some types of information, referred to as derived from electronic health records to a
neural-networkprotected health information (PHI), must be removed in order based de-identification system composed of a Long Short Term
to preserve patient confidentiality. In the United States, the Memory neural network [
        <xref ref-type="bibr" rid="ref12">20</xref>
        ].
      </p>
      <p>
        Health Insurance Portability and Accountability Act (HIPAA)
[
        <xref ref-type="bibr" rid="ref28">36</xref>
        ] defines 18 different types of PHI: B. Differential Privacy with Textual Data
1) Names
2) Dates, except year Benjamin Weggenmann et al. provide an automated text
3) Telephone numbers anonymization approach that applies differential privacy to the
4) Geographic data vector space model [54]. They obscure term frequencies in
5) FAX numbers textual documents' TF-IDF vectors in a differentially private
6) Social Security numbers manner. Their aim is to prevent a document's author attribution
7) Email addresses through the evaluation of the document's TF-IDF vectors using
8) Medical record numbers different data-mining techniques. They also demonstrate that
9) Account numbers this approach has a low impact on accuracy when mining
10) Health plan beneficiary numbers these document vectors. Our goal is different from that of
11) Certificate/license numbers Weggenmann in that we aim to provide privacy methods to the
12) Vehicle identifiers and serial numbers including license actual text documents, and not their vector representations.
      </p>
      <p>plates
13) Web URLs C. Review of Available De-Identification Tools
14) Device identifiers and serial numbers
15) Internet protocol addresses
16) Full face photos and comparable images
17) Biometric identifiers (i.e. retinal scan, fingerprints)
18) Any unique identifying number or code</p>
    </sec>
    <sec id="sec-3">
      <title>1) MITdeid: MITdeid [23] is an automated de</title>
      <p>
        identification software package that is generally usable
on most medical records aimed at removing HIPAA PHI, and
an extended PHI set that includes doctors' names and years of
dates. The software achieves that by utilizing lexical look-up
Performing such a task manually proved to be time-consuming tables, regular expressions, and simple heuristics. This tool
and quite expensive. Douglass et al. [
        <xref ref-type="bibr" rid="ref29">37</xref>
        ] [
        <xref ref-type="bibr" rid="ref30">38</xref>
        ] reported that has a precision of 93.2%, recall of 99.8%, and an F1-score
annotators were paid 50 US dollars per hour and read 20 000 of 96.4%.
words per hour at best. This motivated research in this domain 2) De-identification of Patient Notes with Recurrent Neural
to automate this process. Networks (2016) - DeID [
        <xref ref-type="bibr" rid="ref23">31</xref>
        ]: The solution presented in
      </p>
      <p>
        Earlier research in the field were oriented towards rule- this paper uses RNNs (Recurrent Neural Networks) with
based or pattern-matching solutions, using either complex LSTM (Long Short-Term Memory) to de-identify medical text
regular expressions, dictionaries or a combination of both documents. The system is composed of 3 layers:
[
        <xref ref-type="bibr" rid="ref2">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">11</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">14</xref>
        ]. The advantages of rule-based and
pattern matching de-identification methods is that they require 1) Character-enhanced token embedding layer.
little or no annotated training data, and can be easily and 2) Label prediction layer
quickly modified to improve performance by adding rules, 3) Label sequence optimization layer
dictionary terms, or regular expressions. Disadvantages of The solution was evaluated on two datasets: n2b2 2014 [
        <xref ref-type="bibr" rid="ref1">9</xref>
        ] and
pattern matching de-identification methods are that developers MIMIC de-identification datasets (assembled by the writers of
have to craft many complex algorithms in order to account for the system and is twice as large as the n2b2 2014 dataset). It
different categories of PHI, and the required customization has a precision of 97.9%, a recall of 97.8%, and an F1-Score
to fit a particular dataset. As such, PHI pattern recognition of 97.8%.
Algorithm
K-anonymity
R-diversity
t-closeness
Differential Privacy
MITdeid
DeID
NLM-Scrubber
CliniDeID
Structured
Structured
Textual
Textual
Textual
Textual
      </p>
      <p>
        Approach
generalization/suppression of quazi-identifiers
adds on k-anonymity by making the sensitive
attributes in every equivalence class of the dataset
records contain a minimum of R properly depicted
values
adds on R-diversity by decreasing the granularity of
the interpreted data
adding noise to ensure that the result of an algorithm
is not overly dependent on any single instance
utilizes lexical look-up tables, regular expressions
and simple heuristics to remove HIPAA PHI
frommedical records
uses RNNs with LSTM to remove HIPAA PHI from
medical records
uses a combination of regular expressions and
pattern-matching to remove 12 personal identifier
categories in patient medical notes
uses twelve de-identification models that use deep
learning, shallow learning,or rule-based approaches
to remove HIPAA PHI
3) NLM-Scrubber: NLM-Scrubber [
        <xref ref-type="bibr" rid="ref24">32</xref>
        ] is advertised as a varying inaccuracies when it came to NER, especially
HIPAA compliant, clinical text de-identification tool designed with identifying person names.
and developed at the National Library of Medicine. It looks 2) NeuroNER [
        <xref ref-type="bibr" rid="ref47">56</xref>
        ]: is a program that performs
namedfor 12 personal identifier categories in patient medical notes. entity recognition (NER), used by the Stanford CoreNLP
It uses a combination of regular expressions and pattern- tool. It's composed of a 3-layer recurrent neural network
matching to locate and remove identifiers from documents. with LSTM. The tool was good at detecting names,
      </p>
      <p>
        Testing on some sample text reveals that it is not difficult unless they were lower-cased. It was also good at
to manipulate with this tool. For example, a name that isn't detecting locations, but it doesnt detect any date or age.
capitalized is not detected as a name. Also, the age (65) in a 3) Gate [
        <xref ref-type="bibr" rid="ref48">57</xref>
        ]: Gate offers text analysis services
(part-ofsentence like: Dave is a 65 year old man is not detected as speech, NER, ), but it uses a static approach. Its not so
an age. This tool offers a precision of 93.2%, recall of 99.8%, good at detecting named entities, and its quite easy to
and an F1-score of 96.4%. trick it by changing the structure of the sentence.
      </p>
      <p>
        4) CliniDeID: Is a tool for de-identifying clinical notes 4) IBMs Watson, Natural Language Understanding [
        <xref ref-type="bibr" rid="ref49">58</xref>
        ]: Is
according to the HIPAA Safe Harbor method. Owned by the a collection of APIs that offer text analysis tools using
company ClinAcuity and based on the work done by Youngjun Natural Language Processing. Watson's performance in
Kim et al. [
        <xref ref-type="bibr" rid="ref25">33</xref>
        ], it finds identifiers and tags or replaces them NER was inconclusive for us; on one side, it offers the
with surrogates for anonymity. The tool includes twelve de- most complex analysis where it doesn't only name the
identification models that use deep learning, shallow learning, entities with decent granularity (an address is divided
or rule-based approaches. This tool has a precision of 97%, into ”location” and ”facility”), but can also detect the
recall of 94.4% and an F1-score of 95.7%. tonality of the speech. On the other hand, it was still
      </p>
      <p>
        Table III-A compares all the tools and algorithms mentioned not so difficult to mislead. It doesn't always detect dates,
in this section so far. and still doesn't detect person names if they are not
5) Named Entity Recognition Tools: Due to the strong capitalized.
correlation between Text de-identification and Named Entity
Recognition (NER), here we shall discuss the tools we found D. Discussion
that handle text analysis and (NER). The application of differential privacy to textual data is
1) Stanford CoreNLP Tool (2014) [
        <xref ref-type="bibr" rid="ref46">55</xref>
        ]: Is an NLP tool mostly possible by using the word vectorization of key terms
created by Stanford university, initially developed in chosen from the text, then applying differential privacy to
2006, further work led to the system being released as these vectors. This is useful for running privacy-preserving
free open source software in 2010. The tool supports, to statistical analysis on these terms, or to prevent a document's
varying degrees, the languages Arabic, Chinese, English, author attribution [54]. However, it falls short when it comes
French and German. In our experimentation, the tool to preserving the structure of the text, since it picks out
was quite successful at part-of-speech tagging, but had only specific terms discarding the rest of the text. Our aim
Fig. 1. M ODS Model Construction Process for a Given
Domain D.
      </p>
      <p>Fig. 2. M ODI Model Construction Process for a Given
Domain D.
is not to choose specific keywords from a given text to run In order to cause a privacy problem, the document should
a specific analysis, rather we want to conserve most of the be identifiable (containing personal identifiers), and should
text, removing/obscuring only what's necessary to preserve the contain private information. If any of these conditions are
privacy of any individuals present in it while preserving most not satisfied, then the document will not cause privacy leaks.
of the text's utility. Therefore, the degree of privacy risks associated with a textual</p>
      <p>Dictionary based and pattern-matching based approaches document are a combination of the present identifying and
to provide privacy in text, like MITdeid and NLM-Scrubber, private information present in it.
aren't much complex to implement, and require little to no Due to the nature of text documents having different
priannotated training data. But that comes on the expense of orities to what is private and what is sensitive based on the
being static, where every target term to be captured has to be domain that said documents belong to (finance, healthcare...),
manually transcribed through complex regular expressions. In it is important to take into account the differences of each
addition, solutions using these approaches can't be generalized term's ”criticality” in a given document based on what domain
to handle different datasets; they're only made to handle a said document belongs to. In our work, we refer to words in
target dataset, or a category of datasets. documents as ”terms”, but not all words, rather ones that are</p>
      <p>Neural-network based approaches like DeID and Clin- not stop words.
iDeID are the ones with the most promise in terms of accuracy, The main objective of our work is of two folds:
adaptability, and generalizability. Once a neural network model 1) Devise a framework to measure the degree of
identifiais created and trained to capture specific terms in a piece of bilty and sensitivity of entities in a given text domain.
text, it is then able to capture other terms that are symantically Then, using these measures, construct a module to be
equivalent, which are learned from the context of the text. This used to assess the privacy risks of a text document
is crucial in the field of text analysis since it is very difficult belonging to said domain, if the text was to be published,
to predict every possible structure of a sentence in a given based on the sensitive information and the personal
idenlanguage, or every possible use of a term or word. The problem tifiers present in this document. The provided measures
with available solutions is that they are fixated on the field of take into consideration the semantics of the documents
medicine and capturing PHIs, and don't take infto account texts as the main utility guarantee.
about other domians like finance or trading. Besides, these 2) Based on the devised framework, provide a set of tools
solutions treat all identifiers equally, although one identifier and algorithms for privacy preserving textual data
man(like names) can have higher priority to be removed than others agement including textual data publishing and mining.
(like age) since it can identify an individual more easily.</p>
      <sec id="sec-3-1">
        <title>A. Term Attribution</title>
        <p>IV. OUR APPROACH</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>We associate three attributes to each term in a given text</title>
      <p>In our work, we are concentrating on text, therefore all document.
operations are done on textual documents, each of which 1) identifiability: the degree to which a term can identify
contains natural language text. We assume that a document a real person. For example, a person's name has a high
is associated with a single natural person and, without loss of level of identifiability, but his country of residence might
generality, we assume that multiple documents may refer to a have a lower level of that (there are more people living in
single natural person. the United Kingdom than there are people called Jason).</p>
      <p>TABLE V</p>
      <p>S AMPLE OF T HE 3 ME TRICS FOR GI VEN TE RMS</p>
      <p>by finding terms in the vendor data that is similar in context to
the terms present in the original dictionaries. Figures 1 and 2
represent the construction process of the model for sensitivity
(M ODS ) from the dictionary for sensitive terms (S E E DS )
and the model for identifiability (M ODI ) from the dictionary
for identifiable terms (SEEDI ) respectively.</p>
      <sec id="sec-4-1">
        <title>C. Restriction Flexibility</title>
        <p>For every given data domain exists a pre-calculated
criticality threshold. This threshold is evaluated through
2) Privacy Sensitivity: is this term a sensitive piece of the analysis of domain-specific datasets, and is used as an
information? Each domain has its own set of sensitive anchor point for the data holder to fine-tune the level of
keywords, so this is domain-specific. A term such as anonymity that is to be applied to the text in a manner
location could be both privacy-sensitive and identifying that suits their privacy standards. This is important because
a real person. there is a compromise that must be met between privacy and
3) Semantic value: how much information does this term utility; the more strict the privacy rules being applied are, the
convey with respect to the general context of the doc- less utility the document has (since we are removing data
ument. Semantic value should be assessed with respect from the document). Our solution makes it easy to strike
to the application. For example, in case of the domain the required balance, as privacy risks are quantified, allowing
of social studies, the opinion-baring terms have more data holders to alter solution parameters to fulfill their privacy
semantic value, while in a health application this may requirements.
be different.</p>
        <p>V. CONCLUSION AND FUTURE W O R K</p>
        <p>For each term in a given document, each of these attributes
is marked with a numerical value (a metric) reflecting the Data storage is now mostly moving towards unstructured
significance of the term with respect to the attribute. Each of data, and the use of textual data has become an inseparable
the three metrics can be considered as a normalized weight, part of our daily lives. We only continue to share more of
where a higher value indicates that the term is more significant. our personal info through online services and social media. In</p>
        <p>Identifyability metric. For each term, this value is calcu- this paper, we've introduced a new concept for a data-oriented
lated by checking the term's uniqueness; for example, given 2 solution that provides measures for a given text document to
names (Bob and Jason), if the name Bob appears more often assess the privacy-leak potential of said document, as well as
than Jason, then Bob has a lower identifyability metric than measure its semantic utility. We believe that this work can
Jason,since it is more unique. pave the way for a new data-driven orientation in the
privacy</p>
        <p>
          Sensitivity metric. For each term, this value is calculated research field.
based on the work done by Sa´nchez et al. [
          <xref ref-type="bibr" rid="ref37">45</xref>
          ], where by using REF E RE N CE S
the semantics of the text, we can assess the degree of
sensitiveness of terms according to the amount of information they [1] foPr.mSaatimona:rakti-AannodnyLm. itSywaenedneiyt.s PErnoftoerccteinmgenPtritvharocyugwhhGeenneDraisliczlaotsiionng
aInndprovide. Then, this assessment is represented as a normalized Suppression, Technical Report (SRI-CSL-98-04), 1998
numerical value. [2] L. Sweeney. k-Anonymity: a model for protecting privacy. International
        </p>
        <p>Semantic value metric. Using sentiment analysis, the J2o0u0r2nal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (7),
sentiment of sentences in the text is studied to determine the [3] A. Machanavajjhala &amp; J. Gehrke &amp; D. Kifer &amp; M.
Venkitasubsignificance of each term to the semantic structure of their ramaniam. (2006). l-Diversity: Privacy Beyond k-Anonymity. ACM
respective sentence. The terms that each sentence is centered 1T0ra.1n1sa4c5t/i1o2n1s72o9n9.1K2n1o7w3l0e0d.ge Discovery From Data - TKDD. 1. 24.
around are considered as terms of semantic significance for [4] N. Li, T. Li and S. Venkatasubramanian, ”t-Closeness: Privacy
Bethe text. This significance is then represented as a normalized yond k-Anonymity and l-Diversity,” 2007 IEEE 23rd International
numerical value. 1C0o.n1f1e0r9en/IcCeDoEn.20D07at.a367E8n5g6ineering, Istanbul, 2007, pp. 106-115. doi:
A sample table is provided in Table V. [5] Rajendran, Keerthana &amp; Jayabalan, Manoj &amp; Rana, Muhammad Ehsan.
(2017). A Study on k-anonymity, l-diversity, and t-closeness Techniques
B. Adaptability focusing Medical Data. 17.</p>
        <p>Our solution contains data-driven models adaptable to dif- [6] FoCu.nDdwatoiornks, Aan.dRoTtrhe,nTdsheinATlghoeroitrhemticiaclFCooumndpauttieornsScoifenDcieffeVroeln.ti9a,l NProivs.ac3y4,
ferent data domains, with the ability to customize these models (2014) 211407, 2014 C. Dwork and A. Roth DOI: 10.1561/0400000042
further through training on vendor-specific data. This concept [7] Chang, F., J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows,
is based on word vectorization, where dictionaries of vendor- sTy.sCtehmanfdorra,sAtru.cFtiukreesdadnadtaR. .PEr.oGceruebdeinr,g2o0f06th.eB7igthtabUleS:EANIdXistSriybmutpeodssiutomraogne
specific terms relating to identifiability and sensitivity are Operating Systems Design and Implementation (OSDI 06). Berkeley, CA,
first constructed, then expanded upon using domain-specific USA, pp: 15-15.
knowledge-base to form a model, which is adapted to the [8] hnt2tpcs2://port2a0l.0d6b:mi.hmDse.hidarevnatirfdi.ceadtiuo/nprojecatsn/dn2c2-2S0m0o6k/,ingAccesCsehdallenogne,
contexts used in vendor use cases. The model is constructed 2019-10-20</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [9]
          <fpage>n2c2</fpage>
          2014:
          <article-title>De-identification and Heart Disease Risk Factors Challenge</article-title>
          , https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/, Accessed on 2019-
          <volume>10</volume>
          -21
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Beckwith</surname>
            <given-names>BA</given-names>
          </string-name>
          .
          <article-title>Development and evaluation of an open source software tool for deidentification of pathology reports</article-title>
          .
          <source>BMC Med Inform Decis Mak</source>
          .
          <year>2006</year>
          . p.
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Berman</surname>
            <given-names>JJ</given-names>
          </string-name>
          .
          <article-title>Concept-match medical data scrubbing. How pathology text can be used in research</article-title>
          .
          <source>Arch Pathol Lab Med</source>
          .
          <year>2003</year>
          . pp.
          <fpage>6806</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Fielstein</surname>
            <given-names>EM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
            <given-names>SH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speroff</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Algorithmic</surname>
          </string-name>
          <article-title>De-identification of VA Medical Exam Text for HIPAA Privacy Compliance: Preliminary Findings</article-title>
          . Medinfo.
          <year>2004</year>
          . p.
          <fpage>1590</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Friedlin</surname>
            <given-names>FJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            <given-names>CJ</given-names>
          </string-name>
          .
          <article-title>A software tool for removing patient identifying information from clinical documents</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          .
          <year>2008</year>
          ;
          <volume>15</volume>
          (
          <issue>5</issue>
          ):
          <fpage>60110</fpage>
          . doi:
          <volume>10</volume>
          .1197/jamia.M2702.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Gupta</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saul</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilbertson</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Evaluation of a deidentification (DeId) software engine to share pathology reports and clinical documents for research</article-title>
          .
          <source>Am J Clin Pathol</source>
          .
          <year>2004</year>
          . pp.
          <fpage>17686</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Aramaki</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Automatic</surname>
          </string-name>
          <article-title>Deidentification by using Sentence Features and Label Consistency</article-title>
          .
          <source>i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data</source>
          , Washington, DC.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Guo</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Identifying Personal Health Information Using Support Vector Machines</surname>
          </string-name>
          .
          <source>i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data</source>
          , Washington, DC.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Szarvas</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farkas</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kocsor</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>A multilingual named entity recognition system using boosting and c4.5 decision tree learning algorithms</article-title>
          .
          <source>9th Int Conf Disc Sci (DS2006)</source>
          ,
          <source>LNAI</source>
          .
          <year>2006</year>
          . pp.
          <fpage>267278</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Wellner</surname>
            <given-names>B.</given-names>
          </string-name>
          <article-title>Rapidly retargetable approaches to de-identification in medical records</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          .
          <year>2007</year>
          . pp.
          <fpage>56473</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Ji</given-names>
          </string-name>
          &amp; Dernoncourt, Franck &amp; Uzuner, Ozlem &amp; Szolovits,
          <string-name>
            <surname>Peter.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Feature-Augmented Neural Networks for Patient Note Deidentification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeung</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Woo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting</article-title>
          . ArXiv, abs/1506.04214.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [21] mongoDB:
          <article-title>Unstructured Data In Big Data</article-title>
          , https://www.mongodb.com/scale/unstructured
          <article-title>-data-in-big-</article-title>
          <string-name>
            <surname>data</surname>
          </string-name>
          ,
          <source>Accessed on 2019-10-15.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [22]
          <article-title>National Association of Health Data Organizations, A Guide to StateLevel Ambulatory Care Data Collection Activities (Falls Church: National Association of Health Data Organizations</article-title>
          , Oct.
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Kayaalp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Browne</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodd</surname>
            ,
            <given-names>Z. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sagan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports</article-title>
          .
          <source>AMIA ... Annual Symposium proceedings. AMIA Symposium</source>
          ,
          <year>2014</year>
          ,
          <volume>767776</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Group</given-names>
            <surname>Insurance</surname>
          </string-name>
          <article-title>Commission testimony before the Massachusetts Health Care Committee</article-title>
          .
          <source>See Session of the Joint Committee on Health Care</source>
          , Massachusetts State Legislature, (
          <year>March</year>
          19,
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [25] Netflix Prize Dataset: https://www.kaggle.com/netflix-inc/netflix-prizedata.
          <source>Downloaded on July 15</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McAuley</surname>
          </string-name>
          .
          <article-title>Modeling the visual evolution of fashion trends with one-class collaborative filtering</article-title>
          .
          <source>WWW</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [27]
          <string-name>
            <surname>J. McAuley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Targett</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
          </string-name>
          , A. van den Hengel.
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>SIGIR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4</article-title>
          . Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA:
          <year>2000</year>
          .
          <article-title>Forthcoming book entitled</article-title>
          ,
          <source>The Identifiability of Data.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Archie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gershon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katcoff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2018</year>
          ). Who s Watching ?
          <article-title>De-anonymization of Netflix Reviews using Amazon Reviews</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Arvind</given-names>
            <surname>Narayanan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vitaly</given-names>
            <surname>Shmatikov</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Robust Deanonymization of Large Sparse Datasets</article-title>
          .
          <source>In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP '08)</source>
          . IEEE Computer Society, Washington, DC, USA,
          <fpage>111</fpage>
          -
          <lpage>125</lpage>
          . DOI: https://doi.org/10.1109/SP.
          <year>2008</year>
          .33
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Franck</surname>
            <given-names>Dernoncourt</given-names>
          </string-name>
          , Ji Young Lee, Ozlem Uzuner,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Szolovits</surname>
          </string-name>
          ,
          <article-title>Deidentification of patient notes with recurrent neural networks</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          , Volume
          <volume>24</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>3</given-names>
          </string-name>
          ,
          <string-name>
            <surname>May</surname>
            <given-names>2017</given-names>
          </string-name>
          , Pages 596606.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Kayaalp</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Browne</surname>
            <given-names>AC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodd</surname>
            <given-names>ZA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sagan</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            <given-names>CJ</given-names>
          </string-name>
          .
          <article-title>Deidentification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports</article-title>
          .
          <source>AMIA Annu Symp Proc. 2014 Nov</source>
          <volume>14</volume>
          ;
          <year>2014</year>
          :
          <fpage>767</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heider</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Meystre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives</article-title>
          .
          <source>AMIA ... Annual Symposium proceedings. AMIA Symposium</source>
          ,
          <year>2018</year>
          ,
          <volume>663672</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [34]
          <string-name>
            <surname>DesRoches</surname>
            <given-names>CM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worzala</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bates</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>Some hospitals are falling behind in meeting meaningful use criteria and could be vulnerable to penalties in 2015</article-title>
          . Health Affairs.
          <year>2013</year>
          ;
          <volume>32</volume>
          :
          <fpage>135560</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Wright</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henkin</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feblowitz</surname>
          </string-name>
          , et al.
          <article-title>Early results of the meaningful use program for electronic health records</article-title>
          .
          <source>New Engl J Med</source>
          .
          <year>2013</year>
          ;
          <volume>368</volume>
          :
          <fpage>77980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [36]
          <article-title>Office for Civil Rights H. Standards for privacy of individually identifiable health information</article-title>
          .
          <source>Final rule. Federal Register</source>
          .
          <year>2002</year>
          ;
          <volume>67</volume>
          :
          <fpage>53181</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Douglass</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clifford</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reisner</surname>
            <given-names>A</given-names>
          </string-name>
          , et al.
          <article-title>De-identification algorithm for free-text nursing notes</article-title>
          .
          <source>Comput Cardiol</source>
          .
          <year>2005</year>
          :
          <volume>33134</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Douglas</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clifford</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reisner</surname>
            <given-names>A</given-names>
          </string-name>
          , et al.
          <article-title>Computer-assisted deidentification of free text in the MIMIC II database</article-title>
          .
          <source>Comput Cardiol</source>
          .
          <year>2004</year>
          :
          <volume>34144</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Apple Adopts Differential Privacy</surname>
          </string-name>
          , https://www.apple.com/privacy/docs/ Differential Privacy Overview.pdf,
          <source>accessed on 2019-10-20</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Abadi</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodfellow</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McMahan</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mironov</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwar</surname>
            <given-names>K</given-names>
          </string-name>
          , Zhang L.
          <article-title>Deep Learning with Differential Privacy</article-title>
          , https://arxiv.org/abs/1607.00133,
          <source>23rd ACM Conference on Computer and Communications Security (ACM CCS)</source>
          ,
          <year>2016</year>
          ,
          <fpage>308</fpage>
          -
          <lpage>318</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [41]
          <article-title>Top 10 entities worldwide leading the innovations and advancements in Differential Privacy</article-title>
          , https://www.linknovate.com/search/ ?query=%
          <source>22differential%20privacy%22%2C%22privacy%20by%20design %</source>
          <article-title>22&amp;utm source=blog.linknovate.com&amp;utm medium=referral&amp;utm campaign=data%20stories&amp;utm content=differential%20privacy,</article-title>
          <source>Accessed on 2019-10-20</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [42]
          <article-title>Enabling developers and organizations to use differential privacy</article-title>
          , https://developers.googleblog.com/
          <year>2019</year>
          /09/enabling-developers-andorganizations.
          <source>html, accessed on 2019-10-20</source>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Kifer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ashwin</given-names>
            <surname>Machanavajjhala</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>No free lunch in data privacy</article-title>
          .
          <source>In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11)</source>
          . ACM, New York, NY, USA,
          <fpage>193</fpage>
          -
          <lpage>204</lpage>
          . DOI: https://doi.org/10.1145/1989323.1989345
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>David</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <source>Theory of Relational Databases, Computer Science Press; 1st edition (March</source>
          <year>1983</year>
          ),
          <fpage>978</fpage>
          -
          <lpage>0914894421</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [45]
          <string-name>
            <surname>David</surname>
            <given-names>Sa</given-names>
          </string-name>
          ´nchez , Montserrat Batet, Alexandre Viejo,
          <source>Detecting Sensitive Information from Textual Documents: An Information-Theoretic Approach, Modeling Decisions for Artificial Intelligence</source>
          , Springer Berlin Heidelberg,
          <year>2012</year>
          ,
          <fpage>173</fpage>
          -
          <lpage>184</lpage>
          ”,
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [46]
          <article-title>The Facebook and Cambridge Analytica scandal, explained with a simple diagram</article-title>
          , https://www.vox.com/policy-andpolitics/
          <year>2018</year>
          /3/23/17151916/facebook-cambridge-analytica-trumpdiagram,
          <source>Accessed on 2019-10-20.</source>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Googles</given-names>
            <surname>Sundar</surname>
          </string-name>
          <article-title>Pichai was grilled on privacy, data collection, and China during congressional hearing</article-title>
          , https://www.cnbc.com/
          <year>2018</year>
          /12/11/googleceo-sundar
          <article-title>-pichai-testifies-before-congress-on-bias-privacy</article-title>
          .html, Accessed on 2019-
          <volume>20</volume>
          -10.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [48]
          <string-name>
            <surname>Abdullah</surname>
            ,
            <given-names>Ahmad</given-names>
          </string-name>
          &amp; Zhuge, Qingfeng, (
          <year>2015</year>
          ). From Relational Databases to NoSQL Databases: Performance Evaluation.
          <source>Research Journal of Applied Sciences, Engineering and Technology</source>
          .
          <volume>11</volume>
          .
          <fpage>434</fpage>
          -
          <lpage>439</lpage>
          .
          <fpage>10</fpage>
          .19026/rjaset.11.1799.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [49] GDPR, https://eugdpr.org,
          <source>Accessed on 2019-10-15</source>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [50]
          <article-title>Trump order bans US firms from dealing with Huawei</article-title>
          , https://www.techradar.com/news/trump
          <article-title>-order-bans-us-firms-from-dealingwith-</article-title>
          <string-name>
            <surname>huawei</surname>
          </string-name>
          ,
          <source>Accessed on 2019-10-25.</source>
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [51]
          <article-title>If you have a smart TV, take a closer look at your privacy settings</article-title>
          , https://www.cnbc.com/
          <year>2017</year>
          /03/09/if-you
          <article-title>-have-a-smart-tv-take-acloser-look-at-your-privacy-settings</article-title>
          .html, accessed on 2019-
          <volume>10</volume>
          -12.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [52] Privacy and
          <string-name>
            <given-names>Human</given-names>
            <surname>Rights</surname>
          </string-name>
          , http://gilc.org/privacy/survey/intro.html, Accessed on 2019-
          <volume>10</volume>
          -15.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [53]
          <string-name>
            <surname>Hecht</surname>
            , R. and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jablonski</surname>
          </string-name>
          ,
          <year>2011</year>
          .
          <article-title>Nosql evaluation</article-title>
          .
          <source>Proceeding of International Conference on Cloud and Service Computing</source>
          , pp:
          <fpage>336</fpage>
          -
          <lpage>41</lpage>
          [54]
          <string-name>
            <surname>Weggenman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kerschbaum</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2018</year>
          ,
          <article-title>July) SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining SIGIR18</article-title>
          ,
          <source>July</source>
          <volume>8</volume>
          -
          <issue>12</issue>
          ,
          <year>2018</year>
          , Ann Arbor, MI, USA
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>Stanford</given-names>
            <surname>Core</surname>
          </string-name>
          <string-name>
            <surname>NLP</surname>
          </string-name>
          , https://stanfordnlp.github.io/CoreNLP, Accessed 2019-
          <volume>10</volume>
          -2
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [56]
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>Franck</given-names>
          </string-name>
          &amp; Lee,
          <string-name>
            <surname>Ji</surname>
          </string-name>
          &amp; Szolovits,
          <string-name>
            <surname>Peter.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>NeuroNER: an easy-to-use program for named-entity recognition based on neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [57]
          <string-name>
            <surname>Gate</surname>
          </string-name>
          , https://gate.ac.uk/gate/doc/papers.html,
          <source>Accessed on 2019-10-19</source>
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>Natural</given-names>
            <surname>Language Understanding</surname>
          </string-name>
          , https://www.ibm.com/watson/services /natural-language-understanding,
          <source>Accessed on 2019-10-10</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>