Comparing Rule-based, Feature-based and Deep Neural
           Methods for De-identification of Dutch Medical Records
              Jan Trienes                           Dolf Trieschnigg                      Christin Seifert                   Djoerd Hiemstra
          Nedap Healthcare                          Nedap Healthcare                     University of Twente               Radboud University
        Groenlo, Netherlands                      Groenlo, Netherlands                  Enschede, Netherlands              Nijmegen, Netherlands
       jan.trienes@nedap.com                   dolf.trieschnigg@nedap.com                c.seifert@utwente.nl              djoerd.hiemstra@ru.nl

ABSTRACT                                                                               For instance, the HIPAA regulation defines 18 categories of PHI
Unstructured information in electronic health records provide an                       including names, geographic locations, and phone numbers [14].
invaluable resource for medical research. To protect the confi-                        According to the HIPAA safe-harbor rule, data is no longer person-
dentiality of patients and to conform to privacy regulations, de-                      ally identifying and subject to the privacy regulation if these 18 PHI
identification methods automatically remove personally identifying                     categories have been removed. As the GDPR does not provide such
information from these medical records. However, due to the un-                        clear PHI definitions, we employ the HIPAA definitions throughout
availability of labeled data, most existing research is constrained                    this paper.
to English medical text and little is known about the generalizabil-                      As most EHRs consist of unstructured, free-form text, manual de-
ity of de-identification methods across languages and domains. In                      identification is a time-consuming and error-prone process which
this study, we construct a varied dataset consisting of the medical                    does not scale to the amounts of data needed for many data mining
records of 1260 patients by sampling data from 9 institutes and                        and machine learning scenarios [7, 21]. Therefore, automatic de-
three domains of Dutch healthcare. We test the generalizability of                     identification methods are desirable. Previous research proposed
three de-identification methods across languages and domains. Our                      a wide range of methods that make use of natural language pro-
experiments show that an existing rule-based method specifically                       cessing techniques including rule-based matching and machine
developed for the Dutch language fails to generalize to this new                       learning [20]. However, most evaluations are constrained to medi-
data. Furthermore, a state-of-the-art neural architecture performs                     cal records written in the English language. The generalizability of
strongly across languages and domains, even with limited training                      de-identification methods across languages and domains is largely
data. Compared to feature-based and rule-based methods the neural                      unexplored.
method requires significantly less configuration effort and domain-                       To test the generalizability of existing de-identification methods,
knowledge. We make all code and pre-trained de-identification                          we annotated a new dataset of 1260 medical records from three
models available to the research community, allowing practitioners                     sectors of Dutch healthcare: elderly care, mental care and disabled
to apply them to their datasets and to enable future benchmarks.                       care (Section 3). Figure 1 shows an example record with annotated
                                                                                       PHI. We then compare the performance of the following three
KEYWORDS                                                                               de-identification methods on this data (Section 4):
natural language processing, machine learning, privacy protection,                        (1) A rule-based system named DEDUCE developed for Dutch
medical records                                                                               psychiatric clinical notes [19]
                                                                                          (2) A feature-based Conditional Random Field (CRF) as described
                                                                                              in Liu et al. [17]
                                                                                          (3) A deep neural network with a bidirectional long short-term
1    INTRODUCTION                                                                             memory architecture and a CRF output layer (BiLSTM-CRF) [3]
With the strong adoption of electronic health records (EHRs), large                    We test the transferability of each method across three domains
quantities of unstructured medical patient data become available.                      of Dutch healthcare. Finally, the generalizability of the methods is
This data offers significant opportunities to advance medical re-                      compared across languages using two widely used English bench-
search and to improve healthcare related services. However, it has                     mark corpora (Section 5).
to be ensured that the privacy of a patient is protected when per-                        This paper makes three main contributions. First, our experi-
forming secondary analysis of medical data. This is not only an                        ments show that the only openly available de-identification method
ethical prerequisite, but also a legal requirement imposed by pri-                     for the Dutch language fails to generalize to other Dutch medical do-
vacy legislations such as the US Health Insurance Portability and                      mains. This highlights the importance of a thorough evaluation of
Accountability Act (HIPAA) [13] and the European General Data                          the generalizability of de-identification methods. Second, we offer a
Protection Regulation (GDPR) [9]. To facilitate privacy protection,                    novel comparison of several state-of-the-art de-identification meth-
de-identification has been proposed as a process that removes or                       ods both across languages and domains. Our experiments show that
masks any kind of protected health information (PHI) of a patient                      a popular neural architecture generalizes best even when limited
such that it becomes difficult to establish a link between an indi-                    amounts of training data are available. The neural method only con-
vidual and the data [20]. What type of information constitutes PHI                     siders word/character sequences which we find to be sufficient and
is in part defined by privacy laws of the corresponding country.                       more robust across languages and domains compared to the struc-
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   tural features employed by traditional machine learning approaches.
License Attribution 4.0 International (CC BY 4.0).                                     However, our experiments also reveal that the neural method may
                                                                                                Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra


    Medische overdracht Datum 26-04-2017 DATE (patiënt nr. 64088 ID )              notes [21, 29, 31]. Furthermore, Deléger et al. [4] created a heteroge-
    Instelling Duinendaal CARE INSTITUTE
                                                                                   neous dataset comprised of 22 different document types. Contrary
                                                                                   to the existing datasets which only contain records from at most two
    Datum verrichting 24-04-2017 DATE Tijdstip 23:45
                                                                                   different medical institutes, the data used in this paper was sampled
    S regel: VG ALS Heeft sonde deze is eruit, alle medicatie al gekregen.         from a total of 9 institutes that are active in the Dutch healthcare
    Familie is boos, dhr heeft last van slijmvorming. Is hier iets aan te doen?    sector. The contents, structure and writing style of the documents
    O regel: NV                                                                    strongly depend on the processes and individuals specific to an
    E regel: Slijmvorming                                                          institute which contributes to a heterogeneous corpus.
    P regel: Nu niet direct op te lossen.                                             Most existing de-identification approaches are either rule-based
    ICPC code A45.00 (Advies/observatie/voorlichting/dieet)                        or machine learning based. Rule-based methods combine various
    Patiënt   Dhr.    Jan P. Jansen NAME      (M),    06-11-1956 DATE       Arts   heuristics in form of patterns, lookup lists and fuzzy string match-
                                                                                   ing to identify PHI [11, 21]. The majority of machine learning ap-
    J.O. Besteman NAME Adres Wite Mar 782 Kamerik ADDRESS
                                                                                   proaches employ feature-based CRFs [1, 12], ensembles combining
    Verrichting Telefonisch consult ANW (t: 06-7802651 PHONE/FAX )                 CRFs with rules [30] and most recently also neural networks [5, 18].
    ==================== English Translation ====================                  A thorough overview of the different de-identification methods
    Medical transfer date 26-04-2017 DATE (patient no. 64088 ID )                  is given in Meystre [20]. In this study, we compare several state-
    Institution Duinendaal CARE INSTITUTE                                          of-the-art de-identification methods. With respect to rule-based
                                                                                   approaches, we apply DEDUCE, a recently developed method for
    Date 24-04-2017 DATE Time 23:45
                                                                                   Dutch data [19]. To the best of our knowledge, this is the only
    Subjective (S): VG ALS got feeding tube removed, already received all med-
                                                                                   openly available de-identification method tailored to Dutch data.
    ication. Family is upset, Mr. suffers from increased mucus formation. Can      For a feature-based machine learning method, we re-implement
    anything be done about that?                                                   the token-level CRF by Liu et al. [17]. Previous work on neural de-
    Objective (O): NV                                                              identification used a BiLSTM-CRF architecture with character-level
    Evaluation (E): Mucus formation                                                and ELMo embeddings [5, 15]. Similarly, we use a BiLSTM-CRF
    Plan (P): Cannot be solved immediately.                                        but apply recent advances in neural sequence modeling by using
    ICPC code A45.00 (Advice/observation/information/diet)                         contextual string embeddings [3].
    Patient Mr.      Jan P. Jansen NAME     (M),     06-11-1956 DATE      Doctor      To the best of our knowledge, we are the first study to offer a
                                                                                   comparison of de-identification methods across languages. With
    J.O. Besteman NAME Address Wite Mar 782 Kamerik ADDRESS
                                                                                   respect to de-identification in languages other than English, only
    Provided phone consult ANW (t: 06-7802651 PHONE/FAX )                          three studies consider Dutch data. Scheurwegs et al. [27] applied a
                                                                                   Support Vector Machine and a Random Forest classifier to a dataset
Figure 1: Excerpt of a medical record in our dataset with an-                      of 200 clinical records. Menger et al. [19] developed and released a
notated protected health information (PHI). Sensitive PHI                          rule-based method on 400 psychiatric nursing notes and treatment
was replaced with automatically generated surrogates.                              plans of a single Dutch hospital. Tjong Kim Sang et al. [33] eval-
                                                                                   uated an existing named entity tagger for the de-identification of
                                                                                   autobiographic emails on publicly available Wikipedia texts. Fur-
                                                                                   thermore, de-identification in several other languages has been
still experience a substantially lower performance in new domains.                 studied including German, French, Korean and Swedish [22, 26].
A direct consequence for de-identification practitioners is that pre-                 With respect to cross-domain de-identification, the 2016 CEGS
trained models require additional fine-tuning to be fully applicable               N-GRID shared task evaluated the portability of pre-trained de-
to new domains. Third, we share our pre-trained models and code                    identification methods to a new set of English psychiatric records [29].
with the research community. The creation of these resources is                    Overall, the existing systems did not perform well on the new data.
connected to a significant time effort and requires access to sensi-               Here, we provide a similar comparison by cross-testing on three
tive medical data. We anticipate that this resource is of direct value             domains of Dutch healthcare.
to text mining researchers.
    This work was presented at the first Health Search and Data
Mining Workshop (HSDM 2020) [8]. The implementation of the                         3   DATASETS
de-identification systems, pre-trained models and code for running                 This section describes the construction of our Dutch benchmark
the experiments is available at: github.com/nedap/deidentify.                      dataset called NUT (Nedap/University of Twente). The data was
                                                                                   sampled from 9 healthcare institutes and annotated for PHI ac-
2      RELATED WORK                                                                cording to a tagging scheme derived from Stubbs and Uzuner [31].
Previous work on de-identification can be roughly organized into                   Furthermore, following common practice in the preparation of de-
four groups: (1) creation of benchmark corpora, (2) approaches to                  identification corpora, we replaced PHI instances with realistic
de-identification, (3) work on languages other than English, and (4)               surrogates to comply with privacy regulations. To compare the per-
cross-domain de-identification.                                                    formance of the de-identification methods across languages, we use
   Various English benchmark corpora have been created including                   the English i2b2/UTHealth and the nursing notes corpus [21, 31].
nursing notes, longitudinal patient records and psychiatric intake                 An overview of the three datasets can be found in Table 1.
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records


      Table 1: Overview of the datasets used in this study.                           Table 2: PHI tags used to annotate our dataset (NUT). The
                                                                                      tagging scheme was derived from the i2b2 tags.
Datset                                 NUT      i2b2 [31]     Nursing [21]
                                                                                      Category        i2b2 [31]                     NUT
Language                            Dutch         English           English
Domain(s)                  elderly, mental        clinical          clinical          Name            Patient, Doctor, Username     Name
                        and disabled care                                                                                           Initials
Institutes               9 (3 per domain)               2                  1          Profession      Profession                    Profession
Documents                            1260           1304               2434           Location        Room, Department              Internal Location
Patients                             1260             296                148                          Hospital, Organization        Hospital, Organization
Tokens                             448,795      1,057,302            444,484                                                        Care Institute
Vocabulary                          25,429         36,743             19,482                          Street, City, State, ZIP,     Address
PHI categories                          16             32                 10                          Country
PHI instances                       17,464         28,872              1779           Age             Over 90, Under 90             Age
Median PHI/doc.                          9             18                  0          Date            Date                          Date
                                                                                      Contact         Phone, FAX, Email             Phone/FAX, Email
                                                                                                      URL, IP                       URL/IP
3.1     Data Sampling                                                                 IDs             SSN, 8 fine-grained ID tags   SSN, ID
We sample data from a snapshot of the databases of 9 healthcare in-                   Other           Other                         Other
stitutes with a total of 83,000 patients. Three domains of healthcare
are equally represented in this snapshot: elderly care, mental care
and disabled care. We consider two classes of documents to sample                     consists of 32 PHI tags among 8 classes: Name, Profession, Location,
from: surveys and progress reports. Surveys are questionnaire-like                    Age, Date, Contact Information, IDs and Other. The Other category
forms which are used by the medical staff to take notes during in-                    is used for information that can be used to identify a patient, but
take interviews, record the outcomes of medical tests or to formalize                 which does not fall into any of the remaining categories. For exam-
the treatment plan of a patient. Progress reports are short docu-                     ple, the sentence “the patient was a guest speaker on the subject of
ments describing the current conditions of a patient receiving care,                  diabetes in the Channel 2 talkshow.” would be tagged as Other. It is
sometimes on a daily basis. The use of surveys and progress reports                   worth mentioning that this tagging scheme does not only capture
differs strongly across healthcare institute and domain. In total,                    direct identifiers relating to a patient (e.g., name and date of birth),
this snapshot consists of 630,000 surveys and 13 million progress                     but also indirect identifiers that could be used in combination with
reports.                                                                              other information to reveal the identity of a patient. Indirect iden-
   When sampling from the snapshot described above, we aim to                         tifiers include, for example, the doctor’s name, information about
maximize both the variety of document types, and the variety of PHI,                  the hospital and a patient’s profession.
two essential properties of a de-identification benchmark corpus                          We made two adjustments to the tagging scheme by Stubbs and
[4]. First, to ensure a wide variety of document types, we select                     Uzuner [31]. First, to reduce the annotation effort, we merged some
surveys in a stratified fashion according to their type label provided                of the 32 fine-grained PHI tags to a more generic set of 16 tags
by the EHR system (e.g., intake interview, care plan, etc.). Second, to               (see Table 2). For example, the fine-grained location tags Street,
maximize the variety in PHI, we sample medical reports on a patient                   City, State, ZIP, and Country were merged into a generic Address
basis: for each patient, a random selection of 10 medical reports is                  tag. While this simplifies the annotation process, it complicates the
combined into a patient file. We then select patient files uniformly at               generation of realistic surrogates. Given an address string, one has
random to ensure that no patient appears multiple times within the                    to infer its format to replace the individual parts with surrogates of
sample. Furthermore, to control the annotation effort, we impose                      the same semantic type. We address this issue in Section 3.4. Second,
two subjective limits on the document length. A document has                          due to the high frequency of care institutes in our dataset, we
to contain at least 50 tokens, but no more than 1000 tokens to be                     decided to introduce a separate Care Institute tag that complements
included in the sample. For each of the 9 healthcare institutes, we                   the Organization tag. This allows for a straightforward surrogate
sample 140 documents (70 surveys and 70 patient files), which yields                  generation where names of care institute are replaced with another
a total sample size of 1260 documents (see Table 1).                                  care institute rather than with more generic company names (e.g.,
   We received approval for the collection and use of our dataset                     Google).
from the ethics review board of our institution. Due to privacy
regulations, the dataset constructed in this paper cannot be shared.                  3.3     Annotation Process
                                                                                      Following previous work on the construction of de-identification
3.2     Annotation Scheme                                                             benchmark corpora [4, 31], we employ a double-annotation strat-
Since the GDPR does not provide any strict rules about which types                    egy: two annotators read and tag the same documents. In total, 12
of PHI should be removed during de-identification, we base our PHI                    non-domain experts annotated the sample of 1260 medical records
tagging scheme on the guidelines defined by the US HIPAA regu-                        independently and in parallel. The documents were randomly split
lations. In particular, we closely follow the annotation guidelines                   into 6 sets and we randomly assigned a pair of annotators to each
and the tagging scheme used by Stubbs and Uzuner [31] which                           set. To ensure that the annotators had a common understanding of
                                                                                                        Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra


Table 3: Distribution of PHI tags in our dataset. The inter-                            missed this potential PHI instance. In addition to the manual ad-
annotator agreement (IAA) as measured by the micro-                                     judication, we performed two automatic checks: (1) we ensured
averaged F1 score is shown per category.                                                that PHI instances occurring in multiple files received the same
                                                                                        PHI tag, and (2) any instances that were tagged in one part of the
             PHI Tag                   Count       Frac. (%)     IAA                    corpora but not in the other were manually reviewed and added
                                                                                        to the gold-standard. We used the BRAT annotation tool for both
             Name                        9558          54.73     0.96
                                                                                        annotation and adjudication [28].
             Date                        3676          21.05     0.86
             Care Institute               997           5.71     0.52                   3.4     Surrogate Generation
             Initials                     778           4.45     0.46
             Address                      748           4.28     0.75                   As the annotated dataset consists of personally identifying infor-
             Organization                 712           4.08     0.38                   mation which is protected by the GDPR, we generate artificial
             Internal Location            242           1.39     0.29                   replacements for each of the PHI instances before using the data
             Age                          175           1.00     0.39                   for the development of de-identification methods. This process is
             Profession                   122           0.70     0.31                   known as surrogate generation, a common practice in the prepara-
             ID                           114           0.65     0.43                   tion of de-identification corpora [32]. As surrogate generation will
             Phone/Fax                     97           0.56     0.93                   inevitably alter the semantics of the corpus to an extent where it
             Email                         95           0.54     0.94                   affects the de-identification performance, it is important that this
             Hospital                      92           0.53     0.42                   step is done as thoroughly as possible [36]. Here, we follow the
             Other                         33           0.19     0.03                   semi-automatic surrogate generation procedure that has been used
             URL/IP                        23           0.13     0.70                   to prepare the i2b2/UTHealth shared task corpora. Below, we sum-
             SSN                            2           0.01     0.50                   marize this procedure and mention the language specific resources
                                                                                        we used. We refer the reader to Stubbs et al. [32] for a thorough
             Total                     17,464            100     0.84                   discussion of the method. After running the automatic replacement
                                                                                        scripts, we reviewed each of the surrogates to ensure that continu-
                                                                                        ity within a document is preserved and no PHI is leaked into the
the annotation instructions, an evaluation session was held after                       new dataset.
each pair of annotators completed the first 20 documents.1 In total,                       We adapt the surrogate generation method of Stubbs et al. [32] to
it took 77 hours to double-annotate the entire dataset of 1260 docu-                    the Dutch language as follows. A list of 10,000 most common family
ments, or approximately 3.7 minutes per document. We measured                           names and given names is used to generate random surrogates
the inter-annotator agreement (IAA) using entity-level F1 scores.2                      for name PHI instances.3 We replace dates by first parsing the
Table 3 shows the IAA per PHI category. Overall, the agreement                          format (e.g., “12 nov. 2018” → “%d %b. %Y”),4 and then randomly
level is fairly high (0.84). However, we find that location names (i.e.,                shifting all dates within a document by the same amount of years
care institutes, hospitals, organizations and internal locations) are                   and days into the future. For addresses, we match names of cities,
often highly ambiguous which is reflected by the low agreement                          streets, and countries with a dictionary of Dutch locations,5 and
scores of these categories (between 0.29 and 0.52).                                     then pick random replacements from that dictionary. As Dutch ZIP
    To improve annotation efficiency, we integrated the rule-based                      codes follow a standard format (“1234AB”), their replacement is
de-identification tool DEDUCE [19] with our annotation software                         straightforward. Names of hospitals, care institutes, organizations
to pre-annotate each document. This functionality could be ac-                          and internal locations are randomly shuffled within the dataset. PHI
tivated on a document basis by each annotator. If an annotator                          instances of type Age are capped at 89 years. Finally, alphanumeric
used this functionality, they had to review the pre-annotations,                        strings such as Phone/FAX, Email, URL/IP, SSN and IDs are replaced
correct potential errors and check for missed PHI instances. During                     by substituting each alphanumeric character with another character
the evaluation sessions, annotators mentioned that the existing                         of the same class. We manually rewrite Profession and Other tags,
tool proved helpful when annotating repetitive names, dates and                         as an automatic replacement is not applicable.
email addresses. Note that this pre-annotation strategy might give
DEDUCE a slight advantage. However, the low performance of                              4     METHODS
DEDUCE in the formal benchmark in Section 5 does not reflect this.                      This section presents the three de-identification methods and the
    After annotation, the main author of this paper reviewed 19,165                     evaluation procedure.
annotations and resolved any disagreements between the two anno-
tators to form the gold-standard of 17,464 PHI annotations. Table 3                     4.1     Rule-based Method: DEDUCE
shows the distribution of PHI tags after adjudication. Overall the
                                                                                        DEDUCE is an unsupervised de-identification method specifically
adjudication has been done risk-averse: if only one annotator iden-
                                                                                        developed for Dutch medical records [19]. It is based on lookup
tified a piece of text as PHI, we assume that the other annotator
                                                                                        tables, decision rules and fuzzy string matching and has been vali-
1 We include the annotation instructions that were provided to the annotators in        dated on a corpus of 400 psychiatric nursing notes and treatment
the online repository of this paper. The instructions are in large parts based on the
                                                                                        3 See www.naamkunde.net, accessed 2019-12-09
annotation guidelines in Stubbs and Uzuner [31].
2 It has been shown that the F-score is more suitable to quantify IAA in sequence-      4 Rule-based date parser: github.com/nedap/dateinfer, accessed 2019-12-09

tagging scenarios compared to other measures such as the Kappa score [4].               5 See openov.nl, accessed 2019-12-09
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records


Table 4: Features used by the CRF method. The features are                             state-of-the-art results for sequence labeling tasks [3]. Hyperpa-
identical to the one by Liu et al. [17], but we exclude word-                          rameters are set to the best performing configuration in Akbik et al.
representation features.                                                               [3]: we use stochastic gradient descent with no momentum and
                                                                                       an initial learning rate of 0.1. If the training loss does not decrease
 Group                            Description                                          for 3 consecutive epochs, the learning rate is halved. Training is
                                                                                       stopped if the learning rate falls below 10−4 or 150 epochs are
 Bag-of-words (BOW)    Token unigrams, bigrams and trigrams
                                                                                       reached. Furthermore, the number of hidden layers in the LSTM
                       within a window of [−2, 2] of the cur-
                                                                                       is set to 1 with 256 recurrent units. We employ locked dropout
                       rent token.
                                                                                       with a value of 0.5 and use a mini-batch size of 32. With respect to
 Part-of-speech (POS)  Same as above but with POS n-grams.
                                                                                       the embedding layer, we use the pre-trained GloVe (English) and
 BOW + POS             Combinations of the previous, current
                                                                                       fasttext (Dutch) embedding on a word-level, and concatenate them
                       and next token and their POS tags.
                                                                                       with the pre-trained contextualized string embeddings included in
 Sentence              Length in tokens, presence of end-mark
                                                                                       Flair7 [2, 10, 24].
                       such as ’.’, ’?’, ’!’ and whether sentence
                       contains unmatched brackets.
                                                                                       4.4     Preprocessing and Sequence Tagging
 Affixes               Prefix and suffix of length 1 to 5.
 Orthographic          Binary indicators about word shape: is                          We use a common preprocessing routine for all three datasets. For
                       all caps, is capitalized, capital letters in-                   tokenization and sentence segmentation, the spaCy tokenizer is
                       side, contains digit, contains punctua-                         used.8 The POS/NER features of the CRF method are generated
                       tion, consists of only ASCII characters.                        by the built-in spaCy models. After sentence segmentation, we
 Word Shapes           The abstract shape of a token. For ex-                          tag each token according to the Beginning, Inside, Outside (BIO)
                       ample, “7534-Df” becomes “####-Aa”.                             scheme. In rare occasions, sequence labeling methods may produce
 Named-entity recogni- NER tag assigned by the spaCy tagger.                           invalid transitions (e.g., O- → I-). In a post-processing step, we
 tion (NER)                                                                            replace invalid I- tags with B- tags [25].

                                                                                       4.5     Evaluation
plans of a single hospital. Following the authors’ recommendations,                    The de-identification methods are assessed according to precision,
we customize the method to include a list of 1200 institutions that                    recall and F1 computed on an entity-level, the standard evalua-
are common in our domain. Also, we resolve two incompatibilities                       tion approach for NER systems [34]. In an entity-level evaluation,
between the PHI coding schemes of our dataset and the DEDUCE                           predicted PHI offsets and types have to match exactly. Following
output. First, as DEDUCE does not distinguish between hospitals,                       the evaluation of de-identification shared tasks, we use the micro-
care institutes, organizations and internal locations, we group these                  averaged entity-level F1 score as primary metric [30].9
four PHI tags under a single Named Location tag. Second, our Name                         We randomly split our dataset and the nursing notes corpus into
annotations do not include titles (e.g., “Dr.” or “Ms.”). Therefore,                   training, validation and testing sets with a 60/20/20 ratio. As the
titles are stripped from the DEDUCE output.                                            i2b2 corpus has a pre-defined test set of 40%, a random set of 20% of
                                                                                       the training documents serves as validation data. Finally, we test for
4.2     Feature-based Method: Conditional                                              statistical significance using two-sided approximate randomization
        Random Field                                                                   with N = 9999 [35].
CRFs and hybrid rule-based systems provide state-of-the-art per-
                                                                                       5     RESULTS
formance in recent shared tasks [29, 30]. Therefore, we implement
a CRF approach to contrast with the unsupervised rule-based sys-                       In this section, we first discuss the de-identification results obtained
tem. In particular, we re-implement the token-based CRF method                         on our Dutch dataset (Section 5.1). Afterwards, we present an error
by Liu et al. [17] and re-use a subset6 of their features (see Ta-                     analysis of the best performing method (Section 5.2). This section is
ble 4). The linear-chain CRF is trained using LBFGS and elastic net                    concluded with the benchmark for the English datasets (Section 5.3)
regularization [37]. Using a validation set, we optimize the two                       and the cross-domain de-identification (Section 5.4).
regularization coefficients of the L 1 and L 2 norms with a random
search in the loд10 space of [10−4, 101 ] with 250 trials. We use the                  5.1     De-identification of Dutch Dataset
CRFSuite implementation by Okazaki [23].                                               Both machine learning methods outperform the rule-based system
                                                                                       DEDUCE by a large margin (see Table 5). Furthermore, the BiLSTM-
4.3     Neural Method: BiLSTM-CRF                                                      CRF provides a substantial improvement of 10% points in recall over
To reduce the need for hand-crafted features in traditional CRF-                       the traditional CRF method, while maintaining precision. Overall,
based de-identification, recent work applies neural methods [5, 15,                    the neural method has an entity-level recall of 87.1% while achieving
18]. Here, we re-implement a BiLSTM-CRF architecture with con-                         7 github.com/zalandoresearch/flair, accessed 2019-12-09
textual string embeddings, which has recently shown to provide                         8 spacy.io, accessed 2019-12-09
                                                                                       9 De-identification systems are often also evaluated on a less strict token-level. As a
6 We disregard word-representation features as Liu et al. [17] found that they had a   system that scores high on an entity-level will also score high on a token-level, we
negative performance impact.                                                           only measure according to the stricter level of evaluation.
                                                                                               Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra


Table 5: Evaluation summary: micro-averaged scores are shown for each dataset and method. Statistically significant improve-
ments over the score on the previous line are marked with ▲ (p < 0.01), and ◦ depicts no significance. The rule-based method
DEDUCE is not applicable to the English datasets.

                                             NUT (Dutch)                  i2b2 (English)                  Nursing Notes (English)
                  Method            Prec.      Rec.      F1       Prec.        Rec.           F1         Prec.        Rec.         F1
                  DEDUCE            0.807      0.564     0.664    -            -              -          -            -            -
                  CRF               0.919▲     0.775▲    0.841▲   0.952        0.796          0.867      0.914        0.685        0.783
                  BiLSTM-CRF        0.917◦     0.871▲    0.893▲   0.959▲       0.869▲         0.912▲     0.886◦       0.797▲       0.839▲


Table 6: Entity-level precision and recall per PHI category
on the NUT dataset. Scores are compared between the rule-                             0.90
based tagger DEDUCE [19] and the BiLSTM-CRF model. The
Named Loc. tag is the union of the 4 specific location tags                           0.85
which are not supported by DEDUCE. Tags are ordered by
frequency with location tags fixated at the bottom.                                   0.80                                                      BiLSTM-CRF


                                                                           F1-score
                                                                                                                                                CRF
                                                                                                                                                DEDUCE
                           BiLSTM-CRF            DEDUCE                               0.75
        PHI Tag             Prec.     Rec.     Prec.     Rec.
        Name               0.965    0.956      0.849    0.805
                                                                                      0.70
        Date               0.926    0.920      0.857    0.441
        Initials           0.828    0.624      0.000    0.000                         0.65
                           0.835    0.846
                                                                                             10%     25%        40%          55%        70%    85%       100%
        Address                                0.804    0.526                                                    Training set size (%)
        Age                0.789    0.732      0.088    0.122
        Profession         0.917    0.262      0.000    0.000
        ID                 0.800    0.480      0.000    0.000              Figure 2: Entity-level F1-score for varying training set sizes.
        Phone/Fax          0.889    1.000      0.929    0.812              The full training set (100%) consists of all training and vali-
        Email              0.909    1.000      1.000    0.900              dation sentences in NUT (34,714). The F1-score is measured
        Other              0.000    0.000      0.000    0.000              on the test set. For each subset size, we draw 3 random sam-
        URL/IP             1.000    0.750      0.750    0.750              ples and train/test each model 3 times. The lines show the
        Named Loc.         0.797    0.659      0.279    0.058              averaged scores along with the 95% confidence interval. The
                                                                           rule-based tagger DEDUCE is shown as a baseline.
          Care Institute   0.686     0.657       n/a      n/a
          Organization     0.780     0.522       n/a      n/a
          Internal Loc.    0.737     0.509       n/a      n/a
          Hospital         0.778     0.700       n/a      n/a              instances are likely to directly reveal the identity of an individual,
                                                                           their removal is essential. However, DEDUCE does not generalize
                                                                           beyond the PHI types mentioned above. Especially named locations
a recall of 95.6% for names, showing that the neural method is             are non-trivial to capture with a rule-based system as their identifi-
operational for many de-identification scenarios. In addition, we          cation strongly relies on the availability of exhaustive lookup lists.
make the following observations.                                           In contrast, the neural method provides a significant improvement
   Neural method performs at least as good as rule-based                   for named locations (5.8% vs. 65.9% recall). We assume that word-
method. By inspecting the model performance on a PHI-tag level,            level and character-level embeddings provide an effective tool to
we observe that the neural method outperforms DEDUCE for all               capture these entities.
classes of PHI (see Table 6). Only for the Phone and Email cate-              Initials, IDs and professions are hard to detect. During an-
gory, the rule-based method has a slightly higher precision. Sim-          notation, we observed a low F1 annotator agreement of 0.46, 0.43,
ilarly, we studied the impact of the training data set size on the         and 0.31 for initials, IDs and professions, respectively. This shows
de-identification performance. Both machine learning methods out-          that these PHI types are among the hardest to identify, even for
perform DEDUCE even with as little training data as 10% of the             humans (see Table 3). One possible cause for this is that IDs and
total sentences (see Figure 2). This suggests that in most environ-        initials are often hard to discriminate from abbreviations and medi-
ments where training data are available (or can be obtained), the          cal measurements. We observe that the BiLSTM-CRF detects those
machine learning methods are to be preferred.                              PHI classes with high precision but low recall. With respect to
   Rule-based method can provide a “safety net.” It can be ob-             professions, we find that phrases are often wrongly tagged. For
served that DEDUCE performs reasonably well for names, phone               example, colloquial job descriptions (e.g., “works behind the cash
numbers, email addresses and URLs (see Table 6). As these PHI              desk”) as opposed to the job title (e.g., “cashier”) make it infeasible
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records


to tackle this problem with lookup lists, while a machine learner                     Table 7: Summary of the manual error analysis of false nega-
likely requires more training data to capture this PHI.                               tives (FNs) and false positives (FPs) produced by the BiLSTM-
                                                                                      CRF. All error categories are mutually exclusive.
5.2     Error Analysis on Dutch Dataset
To gain a better understanding of the best performing model and an                                                        FNs (n = 469)    FPs (n = 288)
intuition for its limitations, we conduct a manual error analysis of                    Category                        Count      Part    Count      Part
the false positives (FPs) and false negatives (FNs) produced by the
BiLSTM-CRF on the test set. We discuss the error categorization                         Model Errors
scheme in Section 5.2.1 and present the results in Section 5.2.2.                        Abbreviation                        65   13.9%        28    9.7%
                                                                                         Ambiguity                           15    3.2%         7    2.4%
5.2.1 Error Categorization. We distinguish between two error groups:                     Debatable                            7    1.5%         4    1.4%
(1) modeling errors, and (2) annotation/preprocessing errors. We                         Prefix                              10    2.1%        10    3.5%
define modeling errors to be problems that can be addressed with                         Common language                     35    7.5%         9    3.1%
different de-identification techniques and additional training data.                     Other reason                       275   58.6%       159   55.2%
In contrast, annotation and preprocessing errors are not directly
caused by the sequence labeling model, but are issues in the train-                     Annotation/Preprocessing Errors
ing data or the preprocessing pipeline which need to be addressed                        Missing Annotation                   -       -        33   11.5%
manually. Inspired by the classification scheme of Dernoncourt                           Annotation Error                    21    4.5%        18    6.3%
et al. [5], we consider the following sources of modeling errors:                        Tokenization Error                  41    8.7%        20    6.9%
     • Abbreviation. PHI instances which are abbreviations or                           Total                               469    100%       288    100%
        acronyms for names, care institutes and companies. These
        are hard to detect and can be ambiguous as they are easily
        confused with medical terms and measurements.                                 training data will likely not in itself help to correctly identify this
     • Ambiguity. A human reader may be unable to decide whether                      type of PHI. It is conceivable to design custom features (e.g., based
        a given text fragment is PHI.                                                 on shape, positioning in a sentence, presence/absence in a medical
     • Debatable. It can be argued that the token should not have                     dictionary) to increase precision. However, it is an open question
        been annotated as PHI.                                                        how recall can be improved.
     • Prefix. Names of internal locations, organizations and com-                    PHI instances consisting of common language are likely to
        panies are often prefixed with articles (i.e., “de” and “het”).               be wrongly tagged (7.5% FNs, 3.1% FPs). This is caused by the
        Sometimes, it is unclear whether the prefix is part of the                    fact that there are insufficient training examples where common
        official name or part of the sentence construction. This ambi-                language is used to refer to PHI. For example, the organization
        guity is reflected in the training data which causes the model                name in the sentence “Vandaag heb ik Beter Horen gebeld” (Eng:
        to inconsistently include or exclude those prefixes.                          “I called Beter Horen today”) was incorrectly classified as non-PHI.
     • Common Language. PHI instances consisting of common                            Each individual word, and also the combination of the two words,
        language are hard to discriminate from the surrounding text.                  can be used in different contexts without referring to PHI. However,
     • Other. Remaining modeling errors that do not fall into the                     in this specific context, it is apparent that “Beter Horen” must refer
        categories mentioned above. In those cases, it is not immedi-                 to an organization.
        ately apparent why the misclassification occurs.                              A substantial amount of errors is due to annotation and pre-
Preprocessing errors are categorized as follows:                                      processing issues. Annotation errors (4.5% FNs, 6.3% FPs) can
                                                                                      be resolved by correcting the respective PHI offsets in the gold
     • Missing Annotation. The text fragment is PHI, but was
                                                                                      standard. Tokenization errors (8.7% FNs, 6.9% FPs) need to be fixed
        missed during the annotation phase.
                                                                                      through a different preprocessing routine. For example, the annota-
     • Annotation Error. The annotator assigned an invalid entity
                                                                                      tion <DATE 2016>/<DATE 2017> should have been split into [2016,
        boundary.
                                                                                      /, 2017] with BIO tagging [B, O, B]. However, the spaCy tok-
     • Tokenization Error. The annotated text span could not
                                                                                      enizer segmented this text into a single token [2016/2017]. In this
        be split into a compatible token span. Those tokens were
                                                                                      case, entity boundaries do no longer align with token boundaries
        marked as “Outside (O)” during BIO tagging.
                                                                                      which results in an invalid BIO tagging of [O] for the entire span.
We consider all error categories to be mutually exclusive.                            Several false positives are in fact PHI and should be anno-
5.2.2 Results of Error Analysis. Table 7 summarizes the error anal-                   tated. The model identifies several PHI instances which were missed
ysis results and shows the absolute and relative frequency of each                    during the annotation phase (11.5% of the FPs). Once more, this
error category. Overall, we find that the majority of modeling er-                    demonstrates that proper de-identification is an error-prone task
rors cannot be easily explained through human inspection (“Other                      for human annotators.
reason” in Table 7). The remaining errors are mainly caused by
ambiguous PHI instances and preprocessing errors. In more detail,                     5.3       De-identification of English Datasets
we make the following observations:                                                   When training and testing both machine learning methods on the
Abbreviations are the second most common cause for mod-                               English i2b2 and the nursing notes datasets, we can observe that
eling errors (13.9% of FNs, 9.7% of FPs). We hypothesize that more                    the BiLSTM-CRF significantly outperforms the CRF in both cases
                                                                                       Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra


Table 8: Summary of the transfer learning experiment on                 Table 9: Detailed performance analysis of the BiLSTM-CRF
our Dutch dataset. Each method is trained on data of one                method in the transfer learning experiment. In-domain test
care domain and tested on the other two domains. All scores             scores are shown on the diagonal. All scores are micro-
are micro-averaged entity-level F1.                                     averaged entity-level F1.

                                Training Domain                                                               Training Domain
           Method          Elderly    Disabled    Mental                             Test Domain        Elderly      Disabled      Mental
           DEDUCE            0.683       0.565     0.675                             Elderly               0.746         0.698        0.703
           CRF               0.414       0.697     0.719                             Disabled              0.796         0.919        0.879
           BiLSTM-CRF        0.775       0.775     0.839                             Mental                0.744         0.806        0.871


(see Table 5). Similar to our Dutch dataset, the neural method pro-
vides an increase of up to 11.2% points in recall (nursing notes)       the transfer domain (elderly care, 0.698 F1) is 0.221 in F1. This raises
while the precision remains relatively stable. This shows that the      an important point when performing de-identification in practice:
neural method has the best generalization capabilities even across      while the neural method shows the best generalization capabilities
languages. More importantly, it does not require the development of     compared to the other de-identification methods, the performance
domain-specific lookup lists or sophisticated pattern matching rules.   can still be significantly lower when applying a pre-trained model
To put the results into perspective: the second-highest ranked team     in new domains.
in the i2b2 2014 challenge used a sophisticated ensemble combining
a CRF with domain-specific rules [30]. Their system obtained an         5.5     Limitations
entity-level F1 score of 0.9124 which is on-par with the performance    While the contextual string embeddings used in this paper have
of our neural method that requires no configuration. We can expect      shown to provide state-of-the-art results for NER [3], transformer-
that the performance of the neural method further improves after        based architectures for contextual embeddings have also gained
hyperparameter optimization. Finally, note that both machine learn-     significant attention (e.g., BERT [6]). It would make an interesting
ing methods can be easily applied to a new PHI tagging scheme,          experiment to benchmark different types of pre-trained embed-
whereas rule-based methods are limited to the PHI definition they       dings for the task of de-identification. Furthermore, we observe
were developed for.                                                     that the neural method provides strong performance even with
                                                                        limited training data (see Figure 2). It is unclear what contribution
5.4    Cross-domain De-identification                                   large pre-trained embeddings have in those scenarios which war-
In many de-identification scenarios, heterogeneous training data        rants an ablation study testing different model configurations. We
from multiple medical institutes and domains are rarely available.      leave the exploration of those ideas to future research.
This raises the question, how well a model that has been trained
on a homogeneous set of medical records generalizes to records          6     CONCLUSION
of other medical domains. We trained the three de-identification        This paper presents the construction of a novel Dutch dataset and
methods on one domain of Dutch healthcare (e.g., elderly care) and      a comparison of state-of-the-art de-identification methods across
tested each model on the records of the remaining two domains (e.g.,    Dutch and English medical records. Our experiments show the
disabled care and mental care). We followed the same training and       following. (1) An existing rule-based method for the Dutch language
evaluation procedures described in Section 4.5. Table 8 summarizes      does not generalize well to new domains. (2) If one is looking for an
the performance of each method on the different tasks.                  out-of-the-box de-identification method, neural approaches show
   Again, the neural method consistently outperforms the rule-          the best generalization performance across languages and domains.
based and feature-based methods in all three domains which sug-         (3) When testing across different domains, a substantial decrease of
gests that it is a fair default choice for de-identification. This is   performance has to be expected, an important consideration when
underlined by the fact that the amount of training data is severely     applying de-identification in practice.
limited in this experiment: each domain only has 420 documents             There are several directions for future work. Motivated by the
of which 20% of the records are reserved for testing. Interestingly,    limited generalizability of pre-trained models across different do-
DEDUCE performs rather stable and even outperforms the CRF              mains, transfer learning techniques can provide a way forward. A
within the domain of elderly care.                                      preliminary study by Lee et al. [16] shows that they can be benefi-
   Given an ideal de-identification method, one would expect that       cial for de-identification. Finally, our experiments show that phrases
performance on unseen data of a different domain is similar to          such as professions are among the most difficult information to
the test score obtained on the available (homogeneous) data. Ta-        de-identify. It is an open challenge how to design methods that can
ble 9 shows a performance breakdown for each of the three testing       capture this type of information.
domains for the neural method. It can be seen that in 4 out of 6
cases, the test score in a new domain is lower than the test score      REFERENCES
obtained on the in-domain data. The largest delta of the observed in-    [1] John S. Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Benjamin Wellner, Cheryl
domain test score (disabled care, 0.919 F1) and the performance in           Clark, David A. Hanauer, Bradley Malin, and Lynette Hirschman. 2010. The
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records


     MITRE Identification Scrubber Toolkit: Design, training, and assessment. I. J.        [19] Vincent Menger, Floor Scheepers, Lisette Maria van Wijk, and Marco Spruit. 2018.
     Medical Informatics 79, 12 (2010), 849–859.                                                DEDUCE: A pattern matching method for automatic de-identification of Dutch
 [2] Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled Contextualized               medical text. Telematics and Informatics 35, 4 (2018), 727–736.
     Embeddings for Named Entity Recognition. In Proceedings of the 2019 Conference        [20] Stephane M. Meystre. 2015. De-identification of Unstructured Clinical Data for
     of the North American Chapter of the Association for Computational Linguistics:            Patient Privacy Protection. In Medical Data Privacy Handbook, Aris Gkoulalas-
     Human Language Technologies, Volume 1 (Long and Short Papers). Association for             Divanis and Grigorios Loukides (Eds.). Springer International Publishing, 697–
     Computational Linguistics, Minneapolis, Minnesota, 724–728. https://doi.org/10.            716.
     18653/v1/N19-1078                                                                     [21] Ishna Neamatullah, Margaret M. Douglass, Li-Wei H. Lehman, Andrew T. Reisner,
 [3] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embed-             Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G.
     dings for Sequence Labeling. In Proceedings of the 27th International Conference on        Mark, and Gari D. Clifford. 2008. Automated de-identification of free-text medical
     Computational Linguistics. Association for Computational Linguistics, 1638–1649.           records. BMC Med. Inf. & Decision Making 8 (2008), 32.
     https://www.aclweb.org/anthology/C18-1139                                             [22] Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and
 [4] Louise Deléger, Qi Li, Todd Lingren, Megan Kaiser, Katalin Molnár, Laura Stouten-          Pierre Zweigenbaum. 2018. Clinical Natural Language Processing in languages
     borough, Michal Kouril, Keith Marsolo, and Imre Solti. 2012. Building Gold                 other than English: opportunities and challenges. Journal of Biomedical Semantics
     Standard Corpora for Medical Natural Language Processing Tasks. In AMIA 2012,              9, 1 (2018), 12:1–12:13.
     American Medical Informatics Association Annual Symposium, Chicago, Illinois,         [23] Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random
     USA, November 3-7, 2012.                                                                   Fields (CRFs). Retrieved December 09, 2019 from http://www.chokkan.org/
 [5] Franck Dernoncourt, Ji Young Lee, Özlem Uzuner, and Peter Szolovits. 2017.                 software/crfsuite/
     De-identification of patient notes with recurrent neural networks. JAMIA 24, 3        [24] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
     (2017), 596–606.                                                                           Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
 [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:              guage Processing (EMNLP). 1532–1543. https://doi.org/10.3115/v1/D14-1162
     Pre-training of Deep Bidirectional Transformers for Language Understanding.           [25] Nils Reimers and Iryna Gurevych. 2017. Optimal Hyperparameters for Deep
     Computing Research Repository arXiv:1810.04805 (2018). http://arxiv.org/abs/               LSTM-Networks for Sequence Labeling Tasks. Computing Research Repository
     1810.04805                                                                                 arXiv:1707.06799 (2017). http://arxiv.org/abs/1707.06799
 [7] Margaret Douglass, Gari D. Clifford, Andrew Reisner, George B. Moody, and             [26] Phillip Richter-Pechanski, Stefan Riezler, and Christoph Dieterich. 2018. De-
     Roger G. Mark. 2004. Computer-assisted de-identification of free text in the               Identification of German Medical Admission Notes. Studies in health technology
     MIMIC II database. In Computers in Cardiology, 2004. IEEE, 341–344.                        and informatics 253 (2018), 165–169.
 [8] Carsten Eickhoff, Yubin Kim, and Ryen White. 2020. Overview of the Health             [27] Elyne Scheurwegs, Kim Luyckx, Filip Van der Schueren, and Tim Van den Bulcke.
     Search and Data Mining (HSDM 2020) Workshop. In Proceedings of the Thirteenth              2013. De-Identification of Clinical Free Text in Dutch with Limited Training Data:
     ACM International Conference on Web Search and Data Mining (WSDM ’20). ACM,                A Case Study. In Proceedings of the Workshop on NLP for Medicine and Biology
     New York, NY, USA. https://doi.org/10.1145/3336191.3371879                                 associated with RANLP 2013. 18–23. https://www.aclweb.org/anthology/W13-
 [9] GDPR. 2016. Regulation on the protection of natural persons with regard to the             5103
     processing of personal data and on the free movement of such data, and repealing      [28] Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ana-
     Directive 95/46/EC (Data Protection Directive). Official Journal of the European           niadou, and Jun’ichi Tsujii. 2012. BRAT: A Web-based Tool for NLP-assisted
     Union L119 (2016), 1–88.                                                                   Text Annotation. In Proceedings of the Demonstrations at the 13th Conference
[10] Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas                   of the European Chapter of the Association for Computational Linguistics (EACL
     Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of                  ’12). Association for Computational Linguistics, Stroudsburg, PA, USA, 102–107.
     the International Conference on Language Resources and Evaluation (LREC 2018).             https://www.aclweb.org/anthology/E12-2021
     https://www.aclweb.org/anthology/L18-1550                                             [29] Amber Stubbs, Michele Filannino, and Özlem Uzuner. 2017. De-identification of
[11] Dilip Gupta, Melissa Saul, and John Gilbertson. 2004. Evaluation of a Deidentifica-        psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track
     tion (De-Id) Software Engine to Share Pathology Reports and Clinical Documents             1. Journal of Biomedical Informatics 75 (2017), S4–S18.
     for Research. American Journal of Clinical Pathology 121, 2 (2004), 176–186.          [30] Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems
[12] Bin He, Yi Guan, Jianyi Cheng, Keting Cen, and Wenlan Hua. 2015. CRFs Based                for the de-identification of longitudinal clinical narratives: Overview of 2014
     De-identification of Medical Records. Journal of Biomedical Informatics 58, S              i2b2/UTHealth shared task Track 1. Journal of Biomedical Informatics 58 (2015),
     (2015), S39–S46.                                                                           S11–S19.
[13] HIPAA. 1996. Health Insurance Portability and Accountability Act. Public Law          [31] Amber Stubbs and Özlem Uzuner. 2015. Annotating longitudinal clinical narra-
     104-191 (1996).                                                                            tives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of Biomedical
[14] HIPAA. 2012. Guidance regarding methods for de-identification of protected                 Informatics 58 (2015), S20–S29.
     health information in accordance with the Health Insurance Portability and            [32] Amber Stubbs, Özlem Uzuner, Christopher Kotfila, Ira Goldstein, and Peter
     Accountability Act (HIPAA) Privacy Rule.          Retrieved December 09, 2019              Szolovits. 2015. Challenges in Synthesizing Surrogate PHI in Narrative EMRs. In
     from https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-                Medical Data Privacy Handbook, Aris Gkoulalas-Divanis and Grigorios Loukides
     identification/index.html                                                                  (Eds.). Springer International Publishing, 717–735.
[15] Kaung Khin, Philipp Burckhardt, and Rema Padman. 2018. A Deep Learn-                  [33] Erik Tjong Kim Sang, Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben
     ing Architecture for De-identification of Patient Notes: Implementation and                Westerhof, and Anneke Sools. 2019. De-identification of Dutch Medical Text. In
     Evaluation. Computing Research Repository arXiv:1810.01570 (2018). http:                   2nd Healthcare Text Analytics Conference (HealTAC2019). Cardiff, Wales, UK.
     //arxiv.org/abs/1810.01570                                                            [34] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-
[16] Ji Young Lee, Franck Dernoncourt, and Peter Szolovits. 2018. Transfer Learning             2003 Shared Task: Language-independent Named Entity Recognition. In Proceed-
     for Named-Entity Recognition with Neural Networks. In Proceedings of the 11th              ings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003
     Language Resources and Evaluation Conference. Miyazaki, Japan, 4470–4473. https:           - Volume 4 (CONLL ’03). Association for Computational Linguistics, Stroudsburg,
     //www.aclweb.org/anthology/L18-1708                                                        PA, USA, 142–147. https://doi.org/10.3115/1119176.1119195
[17] Zengjian Liu, Yangxin Chen, Buzhou Tang, Xiaolong Wang, Qingcai Chen,                 [35] Alexander Yeh. 2000. More Accurate Tests for the Statistical Significance of Result
     Haodi Li, Jingfeng Wang, Qiwen Deng, and Suisong Zhu. 2015. Automatic                      Differences. In Proceedings of the 18th Conference on Computational Linguistics -
     de-identification of electronic medical records using token-level and character-           Volume 2 (COLING ’00). Association for Computational Linguistics, Stroudsburg,
     level conditional random fields. Journal of Biomedical Informatics 58 (2015),              PA, USA, 947–953. https://doi.org/10.3115/992730.992783
     S47–S52.                                                                              [36] Reyyan Yeniterzi, John S. Aberdeen, Samuel Bayer, Benjamin Wellner, Lynette
[18] Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-                      Hirschman, and Bradley Malin. 2010. Effects of personal identifier resynthesis
     identification of Clinical Notes via Recurrent Neural Network and Conditional              on clinical text de-identification. JAMIA 17, 2 (2010), 159–168.
     Random Field. Journal of Biomedical Informatics 75, S (2017), S34–S42.                [37] Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the
                                                                                                Elastic Net. Journal of the Royal Statistical Society, Series B 67 (2005), 301–320.