<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Trienes</string-name>
          <email>jan.trienes@nedap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dolf Trieschnigg</string-name>
          <email>dolf.trieschnigg@nedap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christin Seifert</string-name>
          <email>c.seifert@utwente.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Djoerd Hiemstra</string-name>
          <email>djoerd.hiemstra@ru.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nedap Healthcare</institution>
          ,
          <addr-line>Groenlo</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Radboud University</institution>
          ,
          <addr-line>Nijmegen</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Twente</institution>
          ,
          <addr-line>Enschede</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, deidentification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration efort and domainknowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        With the strong adoption of electronic health records (EHRs), large
quantities of unstructured medical patient data become available.
This data ofers significant opportunities to advance medical
research and to improve healthcare related services. However, it has
to be ensured that the privacy of a patient is protected when
performing secondary analysis of medical data. This is not only an
ethical prerequisite, but also a legal requirement imposed by
privacy legislations such as the US Health Insurance Portability and
Accountability Act (HIPAA) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the European General Data
Protection Regulation (GDPR) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To facilitate privacy protection,
de-identification has been proposed as a process that removes or
masks any kind of protected health information (PHI) of a patient
such that it becomes dificult to establish a link between an
individual and the data [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. What type of information constitutes PHI
is in part defined by privacy laws of the corresponding country.
      </p>
      <p>
        For instance, the HIPAA regulation defines 18 categories of PHI
including names, geographic locations, and phone numbers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
According to the HIPAA safe-harbor rule, data is no longer
personally identifying and subject to the privacy regulation if these 18 PHI
categories have been removed. As the GDPR does not provide such
clear PHI definitions, we employ the HIPAA definitions throughout
this paper.
      </p>
      <p>
        As most EHRs consist of unstructured, free-form text, manual
deidentification is a time-consuming and error-prone process which
does not scale to the amounts of data needed for many data mining
and machine learning scenarios [
        <xref ref-type="bibr" rid="ref21 ref7">7, 21</xref>
        ]. Therefore, automatic
deidentification methods are desirable. Previous research proposed
a wide range of methods that make use of natural language
processing techniques including rule-based matching and machine
learning [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. However, most evaluations are constrained to
medical records written in the English language. The generalizability of
de-identification methods across languages and domains is largely
unexplored.
      </p>
      <p>
        To test the generalizability of existing de-identification methods,
we annotated a new dataset of 1260 medical records from three
sectors of Dutch healthcare: elderly care, mental care and disabled
care (Section 3). Figure 1 shows an example record with annotated
PHI. We then compare the performance of the following three
de-identification methods on this data (Section 4):
(1) A rule-based system named DEDUCE developed for Dutch
psychiatric clinical notes [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
(2) A feature-based Conditional Random Field (CRF) as described
in Liu et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
(3) A deep neural network with a bidirectional long short-term
memory architecture and a CRF output layer (BiLSTM-CRF) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
We test the transferability of each method across three domains
of Dutch healthcare. Finally, the generalizability of the methods is
compared across languages using two widely used English
benchmark corpora (Section 5).
      </p>
      <p>This paper makes three main contributions. First, our
experiments show that the only openly available de-identification method
for the Dutch language fails to generalize to other Dutch medical
domains. This highlights the importance of a thorough evaluation of
the generalizability of de-identification methods. Second, we ofer a
novel comparison of several state-of-the-art de-identification
methods both across languages and domains. Our experiments show that
a popular neural architecture generalizes best even when limited
amounts of training data are available. The neural method only
considers word/character sequences which we find to be suficient and
more robust across languages and domains compared to the
structural features employed by traditional machine learning approaches.
However, our experiments also reveal that the neural method may
Medische overdracht Datum 26-04-2017 DATE (patiënt nr. 64088 ID )
Instelling Duinendaal CARE INSTITUTE
Datum verrichting 24-04-2017 DATE Tijdstip 23:45
S regel: VG ALS Heeft sonde deze is eruit, alle medicatie al gekregen.
Familie is boos, dhr heeft last van slijmvorming. Is hier iets aan te doen?
O regel: NV
E regel: Slijmvorming
P regel: Nu niet direct op te lossen.</p>
      <p>ICPC code A45.00 (Advies/observatie/voorlichting/dieet)
Patiënt Dhr. Jan P. Jansen NAME
(M), 06-11-1956 DATE</p>
      <p>Arts
J.O. Besteman NAME Adres Wite Mar 782 Kamerik ADDRESS
Verrichting Telefonisch consult ANW (t: 06-7802651 PHONE/FAX )
==================== English Translation ====================
Medical transfer date 26-04-2017 DATE (patient no. 64088 ID )
Institution Duinendaal CARE INSTITUTE
Date 24-04-2017 DATE Time 23:45
Subjective (S): VG ALS got feeding tube removed, already received all
medication. Family is upset, Mr. suffers from increased mucus formation. Can
anything be done about that?
Objective (O): NV
Evaluation (E): Mucus formation
Plan (P): Cannot be solved immediately.</p>
      <p>ICPC code A45.00 (Advice/observation/information/diet)
Patient Mr. Jan P. Jansen NAME
(M), 06-11-1956 DATE</p>
      <p>Doctor
J.O. Besteman NAME Address Wite Mar 782 Kamerik ADDRESS
Provided phone consult ANW (t: 06-7802651 PHONE/FAX )
still experience a substantially lower performance in new domains.
A direct consequence for de-identification practitioners is that
pretrained models require additional fine-tuning to be fully applicable
to new domains. Third, we share our pre-trained models and code
with the research community. The creation of these resources is
connected to a significant time efort and requires access to
sensitive medical data. We anticipate that this resource is of direct value
to text mining researchers.</p>
      <p>
        This work was presented at the first Health Search and Data
Mining Workshop (HSDM 2020) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The implementation of the
de-identification systems, pre-trained models and code for running
the experiments is available at: github.com/nedap/deidentify.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>Previous work on de-identification can be roughly organized into
four groups: (1) creation of benchmark corpora, (2) approaches to
de-identification, (3) work on languages other than English, and (4)
cross-domain de-identification.</p>
      <p>
        Various English benchmark corpora have been created including
nursing notes, longitudinal patient records and psychiatric intake
notes [
        <xref ref-type="bibr" rid="ref21 ref29 ref31">21, 29, 31</xref>
        ]. Furthermore, Deléger et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] created a
heterogeneous dataset comprised of 22 diferent document types. Contrary
to the existing datasets which only contain records from at most two
diferent medical institutes, the data used in this paper was sampled
from a total of 9 institutes that are active in the Dutch healthcare
sector. The contents, structure and writing style of the documents
strongly depend on the processes and individuals specific to an
institute which contributes to a heterogeneous corpus.
      </p>
      <p>
        Most existing de-identification approaches are either rule-based
or machine learning based. Rule-based methods combine various
heuristics in form of patterns, lookup lists and fuzzy string
matching to identify PHI [
        <xref ref-type="bibr" rid="ref11 ref21">11, 21</xref>
        ]. The majority of machine learning
approaches employ feature-based CRFs [
        <xref ref-type="bibr" rid="ref1 ref12">1, 12</xref>
        ], ensembles combining
CRFs with rules [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and most recently also neural networks [
        <xref ref-type="bibr" rid="ref18 ref5">5, 18</xref>
        ].
A thorough overview of the diferent de-identification methods
is given in Meystre [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In this study, we compare several
stateof-the-art de-identification methods. With respect to rule-based
approaches, we apply DEDUCE, a recently developed method for
Dutch data [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. To the best of our knowledge, this is the only
openly available de-identification method tailored to Dutch data.
For a feature-based machine learning method, we re-implement
the token-level CRF by Liu et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Previous work on neural
deidentification used a BiLSTM-CRF architecture with character-level
and ELMo embeddings [
        <xref ref-type="bibr" rid="ref15 ref5">5, 15</xref>
        ]. Similarly, we use a BiLSTM-CRF
but apply recent advances in neural sequence modeling by using
contextual string embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        To the best of our knowledge, we are the first study to ofer a
comparison of de-identification methods across languages. With
respect to de-identification in languages other than English, only
three studies consider Dutch data. Scheurwegs et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] applied a
Support Vector Machine and a Random Forest classifier to a dataset
of 200 clinical records. Menger et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] developed and released a
rule-based method on 400 psychiatric nursing notes and treatment
plans of a single Dutch hospital. Tjong Kim Sang et al. [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]
evaluated an existing named entity tagger for the de-identification of
autobiographic emails on publicly available Wikipedia texts.
Furthermore, de-identification in several other languages has been
studied including German, French, Korean and Swedish [
        <xref ref-type="bibr" rid="ref22 ref26">22, 26</xref>
        ].
      </p>
      <p>
        With respect to cross-domain de-identification, the 2016 CEGS
N-GRID shared task evaluated the portability of pre-trained
deidentification methods to a new set of English psychiatric records [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
Overall, the existing systems did not perform well on the new data.
Here, we provide a similar comparison by cross-testing on three
domains of Dutch healthcare.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATASETS</title>
      <p>
        This section describes the construction of our Dutch benchmark
dataset called NUT (Nedap/University of Twente). The data was
sampled from 9 healthcare institutes and annotated for PHI
according to a tagging scheme derived from Stubbs and Uzuner [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
Furthermore, following common practice in the preparation of
deidentification corpora, we replaced PHI instances with realistic
surrogates to comply with privacy regulations. To compare the
performance of the de-identification methods across languages, we use
the English i2b2/UTHealth and the nursing notes corpus [
        <xref ref-type="bibr" rid="ref21 ref31">21, 31</xref>
        ].
An overview of the three datasets can be found in Table 1.
We sample data from a snapshot of the databases of 9 healthcare
institutes with a total of 83,000 patients. Three domains of healthcare
are equally represented in this snapshot: elderly care, mental care
and disabled care. We consider two classes of documents to sample
from: surveys and progress reports. Surveys are questionnaire-like
forms which are used by the medical staf to take notes during
intake interviews, record the outcomes of medical tests or to formalize
the treatment plan of a patient. Progress reports are short
documents describing the current conditions of a patient receiving care,
sometimes on a daily basis. The use of surveys and progress reports
difers strongly across healthcare institute and domain. In total,
this snapshot consists of 630,000 surveys and 13 million progress
reports.
      </p>
      <p>
        When sampling from the snapshot described above, we aim to
maximize both the variety of document types, and the variety of PHI,
two essential properties of a de-identification benchmark corpus
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. First, to ensure a wide variety of document types, we select
surveys in a stratified fashion according to their type label provided
by the EHR system (e.g., intake interview, care plan, etc.). Second, to
maximize the variety in PHI, we sample medical reports on a patient
basis: for each patient, a random selection of 10 medical reports is
combined into a patient file. We then select patient lfies uniformly at
random to ensure that no patient appears multiple times within the
sample. Furthermore, to control the annotation efort, we impose
two subjective limits on the document length. A document has
to contain at least 50 tokens, but no more than 1000 tokens to be
included in the sample. For each of the 9 healthcare institutes, we
sample 140 documents (70 surveys and 70 patient files), which yields
a total sample size of 1260 documents (see Table 1).
      </p>
      <p>We received approval for the collection and use of our dataset
from the ethics review board of our institution. Due to privacy
regulations, the dataset constructed in this paper cannot be shared.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Annotation Scheme</title>
      <p>
        Since the GDPR does not provide any strict rules about which types
of PHI should be removed during de-identification, we base our PHI
tagging scheme on the guidelines defined by the US HIPAA
regulations. In particular, we closely follow the annotation guidelines
and the tagging scheme used by Stubbs and Uzuner [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] which
consists of 32 PHI tags among 8 classes: Name, Profession, Location,
Age, Date, Contact Information, IDs and Other. The Other category
is used for information that can be used to identify a patient, but
which does not fall into any of the remaining categories. For
example, the sentence “the patient was a guest speaker on the subject of
diabetes in the Channel 2 talkshow.” would be tagged as Other. It is
worth mentioning that this tagging scheme does not only capture
direct identifiers relating to a patient (e.g., name and date of birth),
but also indirect identifiers that could be used in combination with
other information to reveal the identity of a patient. Indirect
identifiers include, for example, the doctor’s name, information about
the hospital and a patient’s profession.
      </p>
      <p>
        We made two adjustments to the tagging scheme by Stubbs and
Uzuner [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. First, to reduce the annotation efort, we merged some
of the 32 fine-grained PHI tags to a more generic set of 16 tags
(see Table 2). For example, the fine-grained location tags Street,
City, State, ZIP, and Country were merged into a generic Address
tag. While this simplifies the annotation process, it complicates the
generation of realistic surrogates. Given an address string, one has
to infer its format to replace the individual parts with surrogates of
the same semantic type. We address this issue in Section 3.4. Second,
due to the high frequency of care institutes in our dataset, we
decided to introduce a separate Care Institute tag that complements
the Organization tag. This allows for a straightforward surrogate
generation where names of care institute are replaced with another
care institute rather than with more generic company names (e.g.,
Google).
3.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Annotation Process</title>
      <p>
        Following previous work on the construction of de-identification
benchmark corpora [
        <xref ref-type="bibr" rid="ref31 ref4">4, 31</xref>
        ], we employ a double-annotation
strategy: two annotators read and tag the same documents. In total, 12
non-domain experts annotated the sample of 1260 medical records
independently and in parallel. The documents were randomly split
into 6 sets and we randomly assigned a pair of annotators to each
set. To ensure that the annotators had a common understanding of
the annotation instructions, an evaluation session was held after
each pair of annotators completed the first 20 documents. 1 In total,
it took 77 hours to double-annotate the entire dataset of 1260
documents, or approximately 3.7 minutes per document. We measured
the inter-annotator agreement (IAA) using entity-level F1 scores.2
Table 3 shows the IAA per PHI category. Overall, the agreement
level is fairly high (0.84). However, we find that location names (i.e.,
care institutes, hospitals, organizations and internal locations) are
often highly ambiguous which is reflected by the low agreement
scores of these categories (between 0.29 and 0.52).
      </p>
      <p>
        To improve annotation eficiency, we integrated the rule-based
de-identification tool DEDUCE [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] with our annotation software
to pre-annotate each document. This functionality could be
activated on a document basis by each annotator. If an annotator
used this functionality, they had to review the pre-annotations,
correct potential errors and check for missed PHI instances. During
the evaluation sessions, annotators mentioned that the existing
tool proved helpful when annotating repetitive names, dates and
email addresses. Note that this pre-annotation strategy might give
DEDUCE a slight advantage. However, the low performance of
DEDUCE in the formal benchmark in Section 5 does not reflect this.
      </p>
      <p>
        After annotation, the main author of this paper reviewed 19,165
annotations and resolved any disagreements between the two
annotators to form the gold-standard of 17,464 PHI annotations. Table 3
shows the distribution of PHI tags after adjudication. Overall the
adjudication has been done risk-averse: if only one annotator
identified a piece of text as PHI, we assume that the other annotator
1We include the annotation instructions that were provided to the annotators in
the online repository of this paper. The instructions are in large parts based on the
annotation guidelines in Stubbs and Uzuner [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
2It has been shown that the F-score is more suitable to quantify IAA in
sequencetagging scenarios compared to other measures such as the Kappa score [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
missed this potential PHI instance. In addition to the manual
adjudication, we performed two automatic checks: (1) we ensured
that PHI instances occurring in multiple files received the same
PHI tag, and (2) any instances that were tagged in one part of the
corpora but not in the other were manually reviewed and added
to the gold-standard. We used the BRAT annotation tool for both
annotation and adjudication [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
3.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Surrogate Generation</title>
      <p>
        As the annotated dataset consists of personally identifying
information which is protected by the GDPR, we generate artificial
replacements for each of the PHI instances before using the data
for the development of de-identification methods. This process is
known as surrogate generation, a common practice in the
preparation of de-identification corpora [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. As surrogate generation will
inevitably alter the semantics of the corpus to an extent where it
afects the de-identification performance, it is important that this
step is done as thoroughly as possible [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. Here, we follow the
semi-automatic surrogate generation procedure that has been used
to prepare the i2b2/UTHealth shared task corpora. Below, we
summarize this procedure and mention the language specific resources
we used. We refer the reader to Stubbs et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] for a thorough
discussion of the method. After running the automatic replacement
scripts, we reviewed each of the surrogates to ensure that
continuity within a document is preserved and no PHI is leaked into the
new dataset.
      </p>
      <p>
        We adapt the surrogate generation method of Stubbs et al. [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] to
the Dutch language as follows. A list of 10,000 most common family
names and given names is used to generate random surrogates
for name PHI instances.3 We replace dates by first parsing the
format (e.g., “12 nov. 2018” → “%d %b. %Y”),4 and then randomly
shifting all dates within a document by the same amount of years
and days into the future. For addresses, we match names of cities,
streets, and countries with a dictionary of Dutch locations,5 and
then pick random replacements from that dictionary. As Dutch ZIP
codes follow a standard format (“1234AB”), their replacement is
straightforward. Names of hospitals, care institutes, organizations
and internal locations are randomly shufled within the dataset. PHI
instances of type Age are capped at 89 years. Finally, alphanumeric
strings such as Phone/FAX, Email, URL/IP, SSN and IDs are replaced
by substituting each alphanumeric character with another character
of the same class. We manually rewrite Profession and Other tags,
as an automatic replacement is not applicable.
4
      </p>
    </sec>
    <sec id="sec-7">
      <title>METHODS</title>
      <p>This section presents the three de-identification methods and the
evaluation procedure.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Rule-based Method: DEDUCE</title>
      <p>
        DEDUCE is an unsupervised de-identification method specifically
developed for Dutch medical records [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. It is based on lookup
tables, decision rules and fuzzy string matching and has been
validated on a corpus of 400 psychiatric nursing notes and treatment
      </p>
      <sec id="sec-8-1">
        <title>3See www.naamkunde.net, accessed 2019-12-09 4Rule-based date parser: github.com/nedap/dateinfer, accessed 2019-12-09 5See openov.nl, accessed 2019-12-09</title>
        <p>plans of a single hospital. Following the authors’ recommendations,
we customize the method to include a list of 1200 institutions that
are common in our domain. Also, we resolve two incompatibilities
between the PHI coding schemes of our dataset and the DEDUCE
output. First, as DEDUCE does not distinguish between hospitals,
care institutes, organizations and internal locations, we group these
four PHI tags under a single Named Location tag. Second, our Name
annotations do not include titles (e.g., “Dr.” or “Ms.”). Therefore,
titles are stripped from the DEDUCE output.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Feature-based Method: Conditional</title>
    </sec>
    <sec id="sec-10">
      <title>Random Field</title>
      <p>
        CRFs and hybrid rule-based systems provide state-of-the-art
performance in recent shared tasks [
        <xref ref-type="bibr" rid="ref29 ref30">29, 30</xref>
        ]. Therefore, we implement
a CRF approach to contrast with the unsupervised rule-based
system. In particular, we re-implement the token-based CRF method
by Liu et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and re-use a subset6 of their features (see
Table 4). The linear-chain CRF is trained using LBFGS and elastic net
regularization [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. Using a validation set, we optimize the two
regularization coeficients of the L1 and L2 norms with a random
search in the loд10 space of [10−4, 101] with 250 trials. We use the
CRFSuite implementation by Okazaki [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
4.3
      </p>
    </sec>
    <sec id="sec-11">
      <title>Neural Method: BiLSTM-CRF</title>
      <p>
        To reduce the need for hand-crafted features in traditional
CRFbased de-identification, recent work applies neural methods [
        <xref ref-type="bibr" rid="ref15 ref18 ref5">5, 15,
18</xref>
        ]. Here, we re-implement a BiLSTM-CRF architecture with
contextual string embeddings, which has recently shown to provide
6We disregard word-representation features as Liu et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] found that they had a
negative performance impact.
state-of-the-art results for sequence labeling tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Hyperparameters are set to the best performing configuration in Akbik et al .
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: we use stochastic gradient descent with no momentum and
an initial learning rate of 0.1. If the training loss does not decrease
for 3 consecutive epochs, the learning rate is halved. Training is
stopped if the learning rate falls below 10−4 or 150 epochs are
reached. Furthermore, the number of hidden layers in the LSTM
is set to 1 with 256 recurrent units. We employ locked dropout
with a value of 0.5 and use a mini-batch size of 32. With respect to
the embedding layer, we use the pre-trained GloVe (English) and
fasttext (Dutch) embedding on a word-level, and concatenate them
with the pre-trained contextualized string embeddings included in
Flair7 [
        <xref ref-type="bibr" rid="ref10 ref2 ref24">2, 10, 24</xref>
        ].
4.4
      </p>
    </sec>
    <sec id="sec-12">
      <title>Preprocessing and Sequence Tagging</title>
      <p>
        We use a common preprocessing routine for all three datasets. For
tokenization and sentence segmentation, the spaCy tokenizer is
used.8 The POS/NER features of the CRF method are generated
by the built-in spaCy models. After sentence segmentation, we
tag each token according to the Beginning, Inside, Outside (BIO)
scheme. In rare occasions, sequence labeling methods may produce
invalid transitions (e.g., O- → I-). In a post-processing step, we
replace invalid I- tags with B- tags [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
4.5
      </p>
    </sec>
    <sec id="sec-13">
      <title>Evaluation</title>
      <p>
        The de-identification methods are assessed according to precision,
recall and F1 computed on an entity-level, the standard
evaluation approach for NER systems [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. In an entity-level evaluation,
predicted PHI ofsets and types have to match exactly. Following
the evaluation of de-identification shared tasks, we use the
microaveraged entity-level F1 score as primary metric [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].9
      </p>
      <p>
        We randomly split our dataset and the nursing notes corpus into
training, validation and testing sets with a 60/20/20 ratio. As the
i2b2 corpus has a pre-defined test set of 40%, a random set of 20% of
the training documents serves as validation data. Finally, we test for
statistical significance using two-sided approximate randomization
with N = 9999 [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-14">
      <title>RESULTS</title>
      <p>In this section, we first discuss the de-identification results obtained
on our Dutch dataset (Section 5.1). Afterwards, we present an error
analysis of the best performing method (Section 5.2). This section is
concluded with the benchmark for the English datasets (Section 5.3)
and the cross-domain de-identification (Section 5.4).
5.1</p>
    </sec>
    <sec id="sec-15">
      <title>De-identification of Dutch Dataset</title>
      <p>Both machine learning methods outperform the rule-based system
DEDUCE by a large margin (see Table 5). Furthermore, the
BiLSTMCRF provides a substantial improvement of 10% points in recall over
the traditional CRF method, while maintaining precision. Overall,
the neural method has an entity-level recall of 87.1% while achieving</p>
      <sec id="sec-15-1">
        <title>7github.com/zalandoresearch/flair, accessed 2019-12-09</title>
        <p>8spacy.io, accessed 2019-12-09
9De-identification systems are often also evaluated on a less strict token-level. As a
system that scores high on an entity-level will also score high on a token-level, we
only measure according to the stricter level of evaluation.
a recall of 95.6% for names, showing that the neural method is
operational for many de-identification scenarios. In addition, we
make the following observations.</p>
        <p>Neural method performs at least as good as rule-based
method. By inspecting the model performance on a PHI-tag level,
we observe that the neural method outperforms DEDUCE for all
classes of PHI (see Table 6). Only for the Phone and Email
category, the rule-based method has a slightly higher precision.
Similarly, we studied the impact of the training data set size on the
de-identification performance. Both machine learning methods
outperform DEDUCE even with as little training data as 10% of the
total sentences (see Figure 2). This suggests that in most
environments where training data are available (or can be obtained), the
machine learning methods are to be preferred.</p>
        <p>Rule-based method can provide a “safety net.” It can be
observed that DEDUCE performs reasonably well for names, phone
numbers, email addresses and URLs (see Table 6). As these PHI
instances are likely to directly reveal the identity of an individual,
their removal is essential. However, DEDUCE does not generalize
beyond the PHI types mentioned above. Especially named locations
are non-trivial to capture with a rule-based system as their
identification strongly relies on the availability of exhaustive lookup lists.
In contrast, the neural method provides a significant improvement
for named locations (5.8% vs. 65.9% recall). We assume that
wordlevel and character-level embeddings provide an efective tool to
capture these entities.</p>
        <p>Initials, IDs and professions are hard to detect. During
annotation, we observed a low F1 annotator agreement of 0.46, 0.43,
and 0.31 for initials, IDs and professions, respectively. This shows
that these PHI types are among the hardest to identify, even for
humans (see Table 3). One possible cause for this is that IDs and
initials are often hard to discriminate from abbreviations and
medical measurements. We observe that the BiLSTM-CRF detects those
PHI classes with high precision but low recall. With respect to
professions, we find that phrases are often wrongly tagged. For
example, colloquial job descriptions (e.g., “works behind the cash
desk”) as opposed to the job title (e.g., “cashier”) make it infeasible
to tackle this problem with lookup lists, while a machine learner
likely requires more training data to capture this PHI.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>Error Analysis on Dutch Dataset</title>
      <p>
        To gain a better understanding of the best performing model and an
intuition for its limitations, we conduct a manual error analysis of
the false positives (FPs) and false negatives (FNs) produced by the
BiLSTM-CRF on the test set. We discuss the error categorization
scheme in Section 5.2.1 and present the results in Section 5.2.2.
5.2.1 Error Categorization. We distinguish between two error groups:
(1) modeling errors, and (2) annotation/preprocessing errors. We
define modeling errors to be problems that can be addressed with
diferent de-identification techniques and additional training data.
In contrast, annotation and preprocessing errors are not directly
caused by the sequence labeling model, but are issues in the
training data or the preprocessing pipeline which need to be addressed
manually. Inspired by the classification scheme of Dernoncourt
et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we consider the following sources of modeling errors:
• Abbreviation. PHI instances which are abbreviations or
acronyms for names, care institutes and companies. These
are hard to detect and can be ambiguous as they are easily
confused with medical terms and measurements.
• Ambiguity. A human reader may be unable to decide whether
a given text fragment is PHI.
• Debatable. It can be argued that the token should not have
been annotated as PHI.
• Prefix. Names of internal locations, organizations and
companies are often prefixed with articles (i.e., “de” and “het”).
Sometimes, it is unclear whether the prefix is part of the
oficial name or part of the sentence construction. This
ambiguity is reflected in the training data which causes the model
to inconsistently include or exclude those prefixes.
• Common Language. PHI instances consisting of common
language are hard to discriminate from the surrounding text.
• Other. Remaining modeling errors that do not fall into the
categories mentioned above. In those cases, it is not
immediately apparent why the misclassification occurs.
      </p>
      <p>Preprocessing errors are categorized as follows:
• Missing Annotation. The text fragment is PHI, but was
missed during the annotation phase.
• Annotation Error. The annotator assigned an invalid entity
boundary.
• Tokenization Error. The annotated text span could not
be split into a compatible token span. Those tokens were
marked as “Outside (O)” during BIO tagging.</p>
      <p>We consider all error categories to be mutually exclusive.
5.2.2 Results of Error Analysis. Table 7 summarizes the error
analysis results and shows the absolute and relative frequency of each
error category. Overall, we find that the majority of modeling
errors cannot be easily explained through human inspection (“Other
reason” in Table 7). The remaining errors are mainly caused by
ambiguous PHI instances and preprocessing errors. In more detail,
we make the following observations:</p>
      <sec id="sec-16-1">
        <title>Abbreviations are the second most common cause for mod</title>
        <p>eling errors (13.9% of FNs, 9.7% of FPs). We hypothesize that more
FNs (n = 469)</p>
        <p>FPs (n = 288)</p>
        <sec id="sec-16-1-1">
          <title>Count</title>
        </sec>
        <sec id="sec-16-1-2">
          <title>Part</title>
        </sec>
        <sec id="sec-16-1-3">
          <title>Count Part 65 15</title>
          <p>training data will likely not in itself help to correctly identify this
type of PHI. It is conceivable to design custom features (e.g., based
on shape, positioning in a sentence, presence/absence in a medical
dictionary) to increase precision. However, it is an open question
how recall can be improved.</p>
        </sec>
      </sec>
      <sec id="sec-16-2">
        <title>PHI instances consisting of common language are likely to</title>
        <p>be wrongly tagged (7.5% FNs, 3.1% FPs). This is caused by the
fact that there are insuficient training examples where common
language is used to refer to PHI. For example, the organization
name in the sentence “Vandaag heb ik Beter Horen gebeld” (Eng:
“I called Beter Horen today”) was incorrectly classified as non-PHI.
Each individual word, and also the combination of the two words,
can be used in diferent contexts without referring to PHI. However,
in this specific context, it is apparent that “Beter Horen” must refer
to an organization.</p>
        <p>A substantial amount of errors is due to annotation and
preprocessing issues. Annotation errors (4.5% FNs, 6.3% FPs) can
be resolved by correcting the respective PHI ofsets in the gold
standard. Tokenization errors (8.7% FNs, 6.9% FPs) need to be fixed
through a diferent preprocessing routine. For example, the
annotation &lt;DATE 2016&gt;/&lt;DATE 2017&gt; should have been split into [2016,
/, 2017] with BIO tagging [B, O, B]. However, the spaCy
tokenizer segmented this text into a single token [2016/2017]. In this
case, entity boundaries do no longer align with token boundaries
which results in an invalid BIO tagging of [O] for the entire span.</p>
      </sec>
      <sec id="sec-16-3">
        <title>Several false positives are in fact PHI and should be anno</title>
        <p>tated. The model identifies several PHI instances which were missed
during the annotation phase (11.5% of the FPs). Once more, this
demonstrates that proper de-identification is an error-prone task
for human annotators.
5.3</p>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>De-identification of English Datasets</title>
      <p>
        When training and testing both machine learning methods on the
English i2b2 and the nursing notes datasets, we can observe that
the BiLSTM-CRF significantly outperforms the CRF in both cases
(see Table 5). Similar to our Dutch dataset, the neural method
provides an increase of up to 11.2% points in recall (nursing notes)
while the precision remains relatively stable. This shows that the
neural method has the best generalization capabilities even across
languages. More importantly, it does not require the development of
domain-specific lookup lists or sophisticated pattern matching rules.
To put the results into perspective: the second-highest ranked team
in the i2b2 2014 challenge used a sophisticated ensemble combining
a CRF with domain-specific rules [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Their system obtained an
entity-level F1 score of 0.9124 which is on-par with the performance
of our neural method that requires no configuration. We can expect
that the performance of the neural method further improves after
hyperparameter optimization. Finally, note that both machine
learning methods can be easily applied to a new PHI tagging scheme,
whereas rule-based methods are limited to the PHI definition they
were developed for.
5.4
      </p>
    </sec>
    <sec id="sec-18">
      <title>Cross-domain De-identification</title>
      <p>In many de-identification scenarios, heterogeneous training data
from multiple medical institutes and domains are rarely available.
This raises the question, how well a model that has been trained
on a homogeneous set of medical records generalizes to records
of other medical domains. We trained the three de-identification
methods on one domain of Dutch healthcare (e.g., elderly care) and
tested each model on the records of the remaining two domains (e.g.,
disabled care and mental care). We followed the same training and
evaluation procedures described in Section 4.5. Table 8 summarizes
the performance of each method on the diferent tasks.</p>
      <p>Again, the neural method consistently outperforms the
rulebased and feature-based methods in all three domains which
suggests that it is a fair default choice for de-identification. This is
underlined by the fact that the amount of training data is severely
limited in this experiment: each domain only has 420 documents
of which 20% of the records are reserved for testing. Interestingly,
DEDUCE performs rather stable and even outperforms the CRF
within the domain of elderly care.</p>
      <p>Given an ideal de-identification method, one would expect that
performance on unseen data of a diferent domain is similar to
the test score obtained on the available (homogeneous) data.
Table 9 shows a performance breakdown for each of the three testing
domains for the neural method. It can be seen that in 4 out of 6
cases, the test score in a new domain is lower than the test score
obtained on the in-domain data. The largest delta of the observed
indomain test score (disabled care, 0.919 F1) and the performance in</p>
      <sec id="sec-18-1">
        <title>Training Domain</title>
      </sec>
      <sec id="sec-18-2">
        <title>Test Domain</title>
      </sec>
      <sec id="sec-18-3">
        <title>Elderly</title>
      </sec>
      <sec id="sec-18-4">
        <title>Disabled</title>
      </sec>
      <sec id="sec-18-5">
        <title>Mental</title>
      </sec>
      <sec id="sec-18-6">
        <title>Elderly</title>
        <p>Disabled
Mental
the transfer domain (elderly care, 0.698 F1) is 0.221 in F1. This raises
an important point when performing de-identification in practice:
while the neural method shows the best generalization capabilities
compared to the other de-identification methods, the performance
can still be significantly lower when applying a pre-trained model
in new domains.
5.5</p>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>Limitations</title>
      <p>
        While the contextual string embeddings used in this paper have
shown to provide state-of-the-art results for NER [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
transformerbased architectures for contextual embeddings have also gained
significant attention (e.g., BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). It would make an interesting
experiment to benchmark diferent types of pre-trained
embeddings for the task of de-identification. Furthermore, we observe
that the neural method provides strong performance even with
limited training data (see Figure 2). It is unclear what contribution
large pre-trained embeddings have in those scenarios which
warrants an ablation study testing diferent model configurations. We
leave the exploration of those ideas to future research.
6
      </p>
    </sec>
    <sec id="sec-20">
      <title>CONCLUSION</title>
      <p>This paper presents the construction of a novel Dutch dataset and
a comparison of state-of-the-art de-identification methods across
Dutch and English medical records. Our experiments show the
following. (1) An existing rule-based method for the Dutch language
does not generalize well to new domains. (2) If one is looking for an
out-of-the-box de-identification method, neural approaches show
the best generalization performance across languages and domains.
(3) When testing across diferent domains, a substantial decrease of
performance has to be expected, an important consideration when
applying de-identification in practice.</p>
      <p>
        There are several directions for future work. Motivated by the
limited generalizability of pre-trained models across diferent
domains, transfer learning techniques can provide a way forward. A
preliminary study by Lee et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] shows that they can be
beneficial for de-identification. Finally, our experiments show that phrases
such as professions are among the most dificult information to
de-identify. It is an open challenge how to design methods that can
capture this type of information.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>John</surname>
            <given-names>S. Aberdeen</given-names>
          </string-name>
          , Samuel Bayer, Reyyan Yeniterzi, Benjamin Wellner, Cheryl Clark,
          <string-name>
            <given-names>David A.</given-names>
            <surname>Hanauer</surname>
          </string-name>
          , Bradley Malin, and
          <string-name>
            <given-names>Lynette</given-names>
            <surname>Hirschman</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The MITRE Identification Scrubber Toolkit: Design, training, and</article-title>
          <string-name>
            <surname>assessment. I. J</surname>
          </string-name>
          .
          <source>Medical Informatics</source>
          <volume>79</volume>
          ,
          <issue>12</issue>
          (
          <year>2010</year>
          ),
          <fpage>849</fpage>
          -
          <lpage>859</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alan</given-names>
            <surname>Akbik</surname>
          </string-name>
          , Tanja Bergmann, and
          <string-name>
            <given-names>Roland</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Pooled Contextualized Embeddings for Named Entity Recognition</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <fpage>724</fpage>
          -
          <lpage>728</lpage>
          . https://doi.org/10. 18653/v1/
          <fpage>N19</fpage>
          -1078
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Alan</given-names>
            <surname>Akbik</surname>
          </string-name>
          , Duncan Blythe, and
          <string-name>
            <given-names>Roland</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Contextual String Embeddings for Sequence Labeling</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics</source>
          ,
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          . https://www.aclweb.org/anthology/C18-1139
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Louise</given-names>
            <surname>Deléger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Todd</given-names>
            <surname>Lingren</surname>
          </string-name>
          , Megan Kaiser, Katalin Molnár, Laura Stoutenborough, Michal Kouril, Keith Marsolo, and
          <string-name>
            <given-names>Imre</given-names>
            <surname>Solti</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Building Gold Standard Corpora for Medical Natural Language Processing Tasks</article-title>
          .
          <source>In AMIA</source>
          <year>2012</year>
          , American Medical Informatics Association Annual Symposium, Chicago, Illinois, USA, November 3-
          <issue>7</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Franck</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          , Ji Young Lee,
          <string-name>
            <given-names>Özlem</given-names>
            <surname>Uzuner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Szolovits</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>De-identification of patient notes with recurrent neural networks</article-title>
          .
          <source>JAMIA 24</source>
          ,
          <issue>3</issue>
          (
          <year>2017</year>
          ),
          <fpage>596</fpage>
          -
          <lpage>606</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . Computing Research Repository arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ). http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Margaret</given-names>
            <surname>Douglass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gari D.</given-names>
            <surname>Cliford</surname>
          </string-name>
          , Andrew Reisner, George B. Moody, and Roger G. Mark.
          <year>2004</year>
          .
          <article-title>Computer-assisted de-identification of free text in the MIMIC II database</article-title>
          . In Computers in Cardiology,
          <year>2004</year>
          . IEEE,
          <fpage>341</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Eickhof</surname>
          </string-name>
          , Yubin Kim,
          <string-name>
            <given-names>and Ryen</given-names>
            <surname>White</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Overview of the Health Search and Data Mining (HSDM</article-title>
          <year>2020</year>
          )
          <article-title>Workshop</article-title>
          .
          <source>In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining (WSDM '20)</source>
          . ACM, New York, NY, USA. https://doi.org/10.1145/3336191.3371879
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>GDPR.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Regulation on the protection of natural persons with regard to the processing of personal data and on the free movement of such data</article-title>
          ,
          <source>and repealing Directive</source>
          <volume>95</volume>
          /46/EC (
          <article-title>Data Protection Directive)</article-title>
          .
          <source>Oficial Journal of the European Union L119</source>
          (
          <year>2016</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Edouard</surname>
            <given-names>Grave</given-names>
          </string-name>
          , Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learning Word Vectors for 157 Languages</article-title>
          .
          <source>In Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ). https://www.aclweb.org/anthology/L18-1550
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Dilip</surname>
            <given-names>Gupta</given-names>
          </string-name>
          , Melissa Saul,
          <string-name>
            <given-names>and John</given-names>
            <surname>Gilbertson</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Evaluation of a Deidentification (De-Id) Software Engine to Share Pathology Reports and Clinical Documents for Research</article-title>
          .
          <source>American Journal of Clinical Pathology</source>
          <volume>121</volume>
          ,
          <issue>2</issue>
          (
          <year>2004</year>
          ),
          <fpage>176</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Bin</surname>
            <given-names>He</given-names>
          </string-name>
          , Yi Guan, Jianyi Cheng, Keting Cen, and
          <string-name>
            <given-names>Wenlan</given-names>
            <surname>Hua</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>CRFs Based De-identification of Medical Records</article-title>
          .
          <source>Journal of Biomedical</source>
          Informatics 58, S (
          <year>2015</year>
          ),
          <fpage>S39</fpage>
          -
          <lpage>S46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>HIPAA</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Health Insurance Portability</article-title>
          and
          <string-name>
            <given-names>Accountability</given-names>
            <surname>Act</surname>
          </string-name>
          .
          <source>Public Law 104-191</source>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>HIPAA</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule</article-title>
          .
          <source>Retrieved December 09</source>
          ,
          <year>2019</year>
          from https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/deidentification/index.html
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kaung</surname>
            <given-names>Khin</given-names>
          </string-name>
          , Philipp Burckhardt, and
          <string-name>
            <given-names>Rema</given-names>
            <surname>Padman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A Deep Learning Architecture for De-identification of Patient Notes: Implementation and Evaluation</article-title>
          . Computing Research Repository arXiv:
          <year>1810</year>
          .
          <volume>01570</volume>
          (
          <year>2018</year>
          ). http: //arxiv.org/abs/
          <year>1810</year>
          .01570
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Ji</given-names>
            <surname>Young</surname>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Franck</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Szolovits</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Transfer Learning for Named-Entity Recognition with Neural Networks</article-title>
          .
          <source>In Proceedings of the 11th Language Resources and Evaluation Conference</source>
          . Miyazaki, Japan,
          <fpage>4470</fpage>
          -
          <lpage>4473</lpage>
          . https: //www.aclweb.org/anthology/L18-1708
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Zengjian</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Yangxin Chen, Buzhou Tang, Xiaolong Wang, Qingcai Chen,
          <string-name>
            <given-names>Haodi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jingfeng</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Qiwen Deng</surname>
            , and
            <given-names>Suisong</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Automatic de-identification of electronic medical records using token-level and characterlevel conditional random fields</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>58</volume>
          (
          <year>2015</year>
          ),
          <fpage>S47</fpage>
          -
          <lpage>S52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Zengjian</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Buzhou Tang,
          <string-name>
            <given-names>Xiaolong</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Qingcai</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Deidentification of Clinical Notes via Recurrent Neural Network and Conditional Random Field</article-title>
          .
          <source>Journal of Biomedical</source>
          Informatics 75, S (
          <year>2017</year>
          ),
          <fpage>S34</fpage>
          -
          <lpage>S42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Vincent</surname>
            <given-names>Menger</given-names>
          </string-name>
          , Floor Scheepers, Lisette Maria van Wijk,
          <string-name>
            <given-names>and Marco</given-names>
            <surname>Spruit</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text</article-title>
          .
          <source>Telematics and Informatics</source>
          <volume>35</volume>
          ,
          <issue>4</issue>
          (
          <year>2018</year>
          ),
          <fpage>727</fpage>
          -
          <lpage>736</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Stephane</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Meystre</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>De-identification of Unstructured Clinical Data for Patient Privacy Protection</article-title>
          .
          <source>In Medical Data Privacy Handbook, Aris GkoulalasDivanis and Grigorios Loukides (Eds.)</source>
          . Springer International Publishing,
          <volume>697</volume>
          -
          <fpage>716</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Ishna</surname>
            <given-names>Neamatullah</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Margaret M. Douglass</surname>
          </string-name>
          ,
          <string-name>
            <surname>Li-Wei H. Lehman</surname>
            , Andrew T. Reisner, Mauricio Villarroel,
            <given-names>William J</given-names>
          </string-name>
          .
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Szolovits</surname>
            , George B. Moody, Roger G. Mark, and
            <given-names>Gari D.</given-names>
          </string-name>
          <string-name>
            <surname>Cliford</surname>
          </string-name>
          .
          <year>2008</year>
          . Automated de
          <article-title>-identification of free-text medical records</article-title>
          .
          <source>BMC Med</source>
          . Inf. &amp;
          <article-title>Decision Making 8 (</article-title>
          <year>2008</year>
          ),
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Aurélie</surname>
            <given-names>Névéol</given-names>
          </string-name>
          , Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Clinical Natural Language Processing in languages other than English: opportunities and challenges</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          <volume>9</volume>
          ,
          <issue>1</issue>
          (
          <year>2018</year>
          ),
          <volume>12</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          :
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Naoaki</given-names>
            <surname>Okazaki</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>CRFsuite: a fast implementation of Conditional Random Fields (CRFs)</article-title>
          .
          <source>Retrieved December 09</source>
          ,
          <year>2019</year>
          from http://www.chokkan.org/ software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Jefrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <volume>1532</volume>
          -
          <fpage>1543</fpage>
          . https://doi.org/10.3115/v1/
          <fpage>D14</fpage>
          -1162
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks</article-title>
          .
          <source>Computing Research Repository arXiv:1707.06799</source>
          (
          <year>2017</year>
          ). http://arxiv.org/abs/1707.06799
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Richter-Pechanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Riezler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Dieterich</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>DeIdentification of German Medical Admission Notes</article-title>
          .
          <source>Studies in health technology and informatics 253</source>
          (
          <year>2018</year>
          ),
          <fpage>165</fpage>
          -
          <lpage>169</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Elyne</surname>
            <given-names>Scheurwegs</given-names>
          </string-name>
          , Kim Luyckx, Filip Van der Schueren, and Tim Van den Bulcke.
          <year>2013</year>
          .
          <article-title>De-Identification of Clinical Free Text in Dutch with Limited Training Data: A Case Study</article-title>
          .
          <source>In Proceedings of the Workshop on NLP for Medicine and Biology associated with RANLP</source>
          <year>2013</year>
          .
          <volume>18</volume>
          -
          <fpage>23</fpage>
          . https://www.aclweb.org/anthology/W13- 5103
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Pontus</surname>
            <given-names>Stenetorp</given-names>
          </string-name>
          , Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>BRAT: A Web-based Tool for NLP-assisted Text Annotation</article-title>
          .
          <source>In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>102</fpage>
          -
          <lpage>107</lpage>
          . https://www.aclweb.org/anthology/E12-2021
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Amber</surname>
            <given-names>Stubbs</given-names>
          </string-name>
          , Michele Filannino, and
          <string-name>
            <given-names>Özlem</given-names>
            <surname>Uzuner</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>75</volume>
          (
          <year>2017</year>
          ),
          <fpage>S4</fpage>
          -
          <lpage>S18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Amber</surname>
            <given-names>Stubbs</given-names>
          </string-name>
          , Christopher Kotfila, and
          <string-name>
            <given-names>Özlem</given-names>
            <surname>Uzuner</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>58</volume>
          (
          <year>2015</year>
          ),
          <fpage>S11</fpage>
          -
          <lpage>S19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Amber</given-names>
            <surname>Stubbs</surname>
          </string-name>
          and
          <string-name>
            <given-names>Özlem</given-names>
            <surname>Uzuner</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>58</volume>
          (
          <year>2015</year>
          ),
          <fpage>S20</fpage>
          -
          <lpage>S29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Amber</surname>
            <given-names>Stubbs</given-names>
          </string-name>
          , Özlem Uzuner, Christopher Kotfila, Ira Goldstein, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Szolovits</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Challenges in Synthesizing Surrogate PHI in Narrative EMRs</article-title>
          .
          <source>In Medical Data Privacy Handbook, Aris Gkoulalas-Divanis and Grigorios Loukides (Eds.)</source>
          . Springer International Publishing,
          <volume>717</volume>
          -
          <fpage>735</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Tjong Kim Sang</surname>
          </string-name>
          , Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben Westerhof, and
          <string-name>
            <given-names>Anneke</given-names>
            <surname>Sools</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>De-identification of Dutch Medical Text</article-title>
          .
          <source>In 2nd Healthcare Text Analytics Conference (HealTAC2019)</source>
          . Cardif, Wales, UK.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Erik</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Tjong Kim Sang and Fien De Meulder</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Introduction to the CoNLL2003 Shared Task: Language-independent Named Entity Recognition</article-title>
          .
          <source>In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4 (CONLL '03)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          . https://doi.org/10.3115/1119176.1119195
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Yeh</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>More Accurate Tests for the Statistical Significance of Result Diferences</article-title>
          .
          <source>In Proceedings of the 18th Conference on Computational Linguistics - Volume</source>
          <volume>2</volume>
          (
          <issue>COLING</issue>
          '00).
          <article-title>Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>947</fpage>
          -
          <lpage>953</lpage>
          . https://doi.org/10.3115/992730.992783
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Reyyan</surname>
            <given-names>Yeniterzi</given-names>
          </string-name>
          , John S. Aberdeen, Samuel Bayer, Benjamin Wellner, Lynette Hirschman, and
          <string-name>
            <given-names>Bradley</given-names>
            <surname>Malin</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Efects of personal identifier resynthesis on clinical text de-identification</article-title>
          .
          <source>JAMIA 17</source>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ),
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Hui</given-names>
            <surname>Zou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Hastie</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Regularization and variable selection via the Elastic Net</article-title>
          .
          <source>Journal of the Royal Statistical Society, Series B</source>
          <volume>67</volume>
          (
          <year>2005</year>
          ),
          <fpage>301</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>