Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records Jan Trienes Dolf Trieschnigg Christin Seifert Djoerd Hiemstra Nedap Healthcare Nedap Healthcare University of Twente Radboud University Groenlo, Netherlands Groenlo, Netherlands Enschede, Netherlands Nijmegen, Netherlands jan.trienes@nedap.com dolf.trieschnigg@nedap.com c.seifert@utwente.nl djoerd.hiemstra@ru.nl ABSTRACT For instance, the HIPAA regulation defines 18 categories of PHI Unstructured information in electronic health records provide an including names, geographic locations, and phone numbers [14]. invaluable resource for medical research. To protect the confi- According to the HIPAA safe-harbor rule, data is no longer person- dentiality of patients and to conform to privacy regulations, de- ally identifying and subject to the privacy regulation if these 18 PHI identification methods automatically remove personally identifying categories have been removed. As the GDPR does not provide such information from these medical records. However, due to the un- clear PHI definitions, we employ the HIPAA definitions throughout availability of labeled data, most existing research is constrained this paper. to English medical text and little is known about the generalizabil- As most EHRs consist of unstructured, free-form text, manual de- ity of de-identification methods across languages and domains. In identification is a time-consuming and error-prone process which this study, we construct a varied dataset consisting of the medical does not scale to the amounts of data needed for many data mining records of 1260 patients by sampling data from 9 institutes and and machine learning scenarios [7, 21]. Therefore, automatic de- three domains of Dutch healthcare. We test the generalizability of identification methods are desirable. Previous research proposed three de-identification methods across languages and domains. Our a wide range of methods that make use of natural language pro- experiments show that an existing rule-based method specifically cessing techniques including rule-based matching and machine developed for the Dutch language fails to generalize to this new learning [20]. However, most evaluations are constrained to medi- data. Furthermore, a state-of-the-art neural architecture performs cal records written in the English language. The generalizability of strongly across languages and domains, even with limited training de-identification methods across languages and domains is largely data. Compared to feature-based and rule-based methods the neural unexplored. method requires significantly less configuration effort and domain- To test the generalizability of existing de-identification methods, knowledge. We make all code and pre-trained de-identification we annotated a new dataset of 1260 medical records from three models available to the research community, allowing practitioners sectors of Dutch healthcare: elderly care, mental care and disabled to apply them to their datasets and to enable future benchmarks. care (Section 3). Figure 1 shows an example record with annotated PHI. We then compare the performance of the following three KEYWORDS de-identification methods on this data (Section 4): natural language processing, machine learning, privacy protection, (1) A rule-based system named DEDUCE developed for Dutch medical records psychiatric clinical notes [19] (2) A feature-based Conditional Random Field (CRF) as described in Liu et al. [17] (3) A deep neural network with a bidirectional long short-term 1 INTRODUCTION memory architecture and a CRF output layer (BiLSTM-CRF) [3] With the strong adoption of electronic health records (EHRs), large We test the transferability of each method across three domains quantities of unstructured medical patient data become available. of Dutch healthcare. Finally, the generalizability of the methods is This data offers significant opportunities to advance medical re- compared across languages using two widely used English bench- search and to improve healthcare related services. However, it has mark corpora (Section 5). to be ensured that the privacy of a patient is protected when per- This paper makes three main contributions. First, our experi- forming secondary analysis of medical data. This is not only an ments show that the only openly available de-identification method ethical prerequisite, but also a legal requirement imposed by pri- for the Dutch language fails to generalize to other Dutch medical do- vacy legislations such as the US Health Insurance Portability and mains. This highlights the importance of a thorough evaluation of Accountability Act (HIPAA) [13] and the European General Data the generalizability of de-identification methods. Second, we offer a Protection Regulation (GDPR) [9]. To facilitate privacy protection, novel comparison of several state-of-the-art de-identification meth- de-identification has been proposed as a process that removes or ods both across languages and domains. Our experiments show that masks any kind of protected health information (PHI) of a patient a popular neural architecture generalizes best even when limited such that it becomes difficult to establish a link between an indi- amounts of training data are available. The neural method only con- vidual and the data [20]. What type of information constitutes PHI siders word/character sequences which we find to be sufficient and is in part defined by privacy laws of the corresponding country. more robust across languages and domains compared to the struc- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons tural features employed by traditional machine learning approaches. License Attribution 4.0 International (CC BY 4.0). However, our experiments also reveal that the neural method may Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra Medische overdracht Datum 26-04-2017 DATE (patiënt nr. 64088 ID ) notes [21, 29, 31]. Furthermore, Deléger et al. [4] created a heteroge- Instelling Duinendaal CARE INSTITUTE neous dataset comprised of 22 different document types. Contrary to the existing datasets which only contain records from at most two Datum verrichting 24-04-2017 DATE Tijdstip 23:45 different medical institutes, the data used in this paper was sampled S regel: VG ALS Heeft sonde deze is eruit, alle medicatie al gekregen. from a total of 9 institutes that are active in the Dutch healthcare Familie is boos, dhr heeft last van slijmvorming. Is hier iets aan te doen? sector. The contents, structure and writing style of the documents O regel: NV strongly depend on the processes and individuals specific to an E regel: Slijmvorming institute which contributes to a heterogeneous corpus. P regel: Nu niet direct op te lossen. Most existing de-identification approaches are either rule-based ICPC code A45.00 (Advies/observatie/voorlichting/dieet) or machine learning based. Rule-based methods combine various Patiënt Dhr. Jan P. Jansen NAME (M), 06-11-1956 DATE Arts heuristics in form of patterns, lookup lists and fuzzy string match- ing to identify PHI [11, 21]. The majority of machine learning ap- J.O. Besteman NAME Adres Wite Mar 782 Kamerik ADDRESS proaches employ feature-based CRFs [1, 12], ensembles combining Verrichting Telefonisch consult ANW (t: 06-7802651 PHONE/FAX ) CRFs with rules [30] and most recently also neural networks [5, 18]. ==================== English Translation ==================== A thorough overview of the different de-identification methods Medical transfer date 26-04-2017 DATE (patient no. 64088 ID ) is given in Meystre [20]. In this study, we compare several state- Institution Duinendaal CARE INSTITUTE of-the-art de-identification methods. With respect to rule-based approaches, we apply DEDUCE, a recently developed method for Date 24-04-2017 DATE Time 23:45 Dutch data [19]. To the best of our knowledge, this is the only Subjective (S): VG ALS got feeding tube removed, already received all med- openly available de-identification method tailored to Dutch data. ication. Family is upset, Mr. suffers from increased mucus formation. Can For a feature-based machine learning method, we re-implement anything be done about that? the token-level CRF by Liu et al. [17]. Previous work on neural de- Objective (O): NV identification used a BiLSTM-CRF architecture with character-level Evaluation (E): Mucus formation and ELMo embeddings [5, 15]. Similarly, we use a BiLSTM-CRF Plan (P): Cannot be solved immediately. but apply recent advances in neural sequence modeling by using ICPC code A45.00 (Advice/observation/information/diet) contextual string embeddings [3]. Patient Mr. Jan P. Jansen NAME (M), 06-11-1956 DATE Doctor To the best of our knowledge, we are the first study to offer a comparison of de-identification methods across languages. With J.O. Besteman NAME Address Wite Mar 782 Kamerik ADDRESS respect to de-identification in languages other than English, only Provided phone consult ANW (t: 06-7802651 PHONE/FAX ) three studies consider Dutch data. Scheurwegs et al. [27] applied a Support Vector Machine and a Random Forest classifier to a dataset Figure 1: Excerpt of a medical record in our dataset with an- of 200 clinical records. Menger et al. [19] developed and released a notated protected health information (PHI). Sensitive PHI rule-based method on 400 psychiatric nursing notes and treatment was replaced with automatically generated surrogates. plans of a single Dutch hospital. Tjong Kim Sang et al. [33] eval- uated an existing named entity tagger for the de-identification of autobiographic emails on publicly available Wikipedia texts. Fur- thermore, de-identification in several other languages has been still experience a substantially lower performance in new domains. studied including German, French, Korean and Swedish [22, 26]. A direct consequence for de-identification practitioners is that pre- With respect to cross-domain de-identification, the 2016 CEGS trained models require additional fine-tuning to be fully applicable N-GRID shared task evaluated the portability of pre-trained de- to new domains. Third, we share our pre-trained models and code identification methods to a new set of English psychiatric records [29]. with the research community. The creation of these resources is Overall, the existing systems did not perform well on the new data. connected to a significant time effort and requires access to sensi- Here, we provide a similar comparison by cross-testing on three tive medical data. We anticipate that this resource is of direct value domains of Dutch healthcare. to text mining researchers. This work was presented at the first Health Search and Data Mining Workshop (HSDM 2020) [8]. The implementation of the 3 DATASETS de-identification systems, pre-trained models and code for running This section describes the construction of our Dutch benchmark the experiments is available at: github.com/nedap/deidentify. dataset called NUT (Nedap/University of Twente). The data was sampled from 9 healthcare institutes and annotated for PHI ac- 2 RELATED WORK cording to a tagging scheme derived from Stubbs and Uzuner [31]. Previous work on de-identification can be roughly organized into Furthermore, following common practice in the preparation of de- four groups: (1) creation of benchmark corpora, (2) approaches to identification corpora, we replaced PHI instances with realistic de-identification, (3) work on languages other than English, and (4) surrogates to comply with privacy regulations. To compare the per- cross-domain de-identification. formance of the de-identification methods across languages, we use Various English benchmark corpora have been created including the English i2b2/UTHealth and the nursing notes corpus [21, 31]. nursing notes, longitudinal patient records and psychiatric intake An overview of the three datasets can be found in Table 1. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records Table 1: Overview of the datasets used in this study. Table 2: PHI tags used to annotate our dataset (NUT). The tagging scheme was derived from the i2b2 tags. Datset NUT i2b2 [31] Nursing [21] Category i2b2 [31] NUT Language Dutch English English Domain(s) elderly, mental clinical clinical Name Patient, Doctor, Username Name and disabled care Initials Institutes 9 (3 per domain) 2 1 Profession Profession Profession Documents 1260 1304 2434 Location Room, Department Internal Location Patients 1260 296 148 Hospital, Organization Hospital, Organization Tokens 448,795 1,057,302 444,484 Care Institute Vocabulary 25,429 36,743 19,482 Street, City, State, ZIP, Address PHI categories 16 32 10 Country PHI instances 17,464 28,872 1779 Age Over 90, Under 90 Age Median PHI/doc. 9 18 0 Date Date Date Contact Phone, FAX, Email Phone/FAX, Email URL, IP URL/IP 3.1 Data Sampling IDs SSN, 8 fine-grained ID tags SSN, ID We sample data from a snapshot of the databases of 9 healthcare in- Other Other Other stitutes with a total of 83,000 patients. Three domains of healthcare are equally represented in this snapshot: elderly care, mental care and disabled care. We consider two classes of documents to sample consists of 32 PHI tags among 8 classes: Name, Profession, Location, from: surveys and progress reports. Surveys are questionnaire-like Age, Date, Contact Information, IDs and Other. The Other category forms which are used by the medical staff to take notes during in- is used for information that can be used to identify a patient, but take interviews, record the outcomes of medical tests or to formalize which does not fall into any of the remaining categories. For exam- the treatment plan of a patient. Progress reports are short docu- ple, the sentence “the patient was a guest speaker on the subject of ments describing the current conditions of a patient receiving care, diabetes in the Channel 2 talkshow.” would be tagged as Other. It is sometimes on a daily basis. The use of surveys and progress reports worth mentioning that this tagging scheme does not only capture differs strongly across healthcare institute and domain. In total, direct identifiers relating to a patient (e.g., name and date of birth), this snapshot consists of 630,000 surveys and 13 million progress but also indirect identifiers that could be used in combination with reports. other information to reveal the identity of a patient. Indirect iden- When sampling from the snapshot described above, we aim to tifiers include, for example, the doctor’s name, information about maximize both the variety of document types, and the variety of PHI, the hospital and a patient’s profession. two essential properties of a de-identification benchmark corpus We made two adjustments to the tagging scheme by Stubbs and [4]. First, to ensure a wide variety of document types, we select Uzuner [31]. First, to reduce the annotation effort, we merged some surveys in a stratified fashion according to their type label provided of the 32 fine-grained PHI tags to a more generic set of 16 tags by the EHR system (e.g., intake interview, care plan, etc.). Second, to (see Table 2). For example, the fine-grained location tags Street, maximize the variety in PHI, we sample medical reports on a patient City, State, ZIP, and Country were merged into a generic Address basis: for each patient, a random selection of 10 medical reports is tag. While this simplifies the annotation process, it complicates the combined into a patient file. We then select patient files uniformly at generation of realistic surrogates. Given an address string, one has random to ensure that no patient appears multiple times within the to infer its format to replace the individual parts with surrogates of sample. Furthermore, to control the annotation effort, we impose the same semantic type. We address this issue in Section 3.4. Second, two subjective limits on the document length. A document has due to the high frequency of care institutes in our dataset, we to contain at least 50 tokens, but no more than 1000 tokens to be decided to introduce a separate Care Institute tag that complements included in the sample. For each of the 9 healthcare institutes, we the Organization tag. This allows for a straightforward surrogate sample 140 documents (70 surveys and 70 patient files), which yields generation where names of care institute are replaced with another a total sample size of 1260 documents (see Table 1). care institute rather than with more generic company names (e.g., We received approval for the collection and use of our dataset Google). from the ethics review board of our institution. Due to privacy regulations, the dataset constructed in this paper cannot be shared. 3.3 Annotation Process Following previous work on the construction of de-identification 3.2 Annotation Scheme benchmark corpora [4, 31], we employ a double-annotation strat- Since the GDPR does not provide any strict rules about which types egy: two annotators read and tag the same documents. In total, 12 of PHI should be removed during de-identification, we base our PHI non-domain experts annotated the sample of 1260 medical records tagging scheme on the guidelines defined by the US HIPAA regu- independently and in parallel. The documents were randomly split lations. In particular, we closely follow the annotation guidelines into 6 sets and we randomly assigned a pair of annotators to each and the tagging scheme used by Stubbs and Uzuner [31] which set. To ensure that the annotators had a common understanding of Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra Table 3: Distribution of PHI tags in our dataset. The inter- missed this potential PHI instance. In addition to the manual ad- annotator agreement (IAA) as measured by the micro- judication, we performed two automatic checks: (1) we ensured averaged F1 score is shown per category. that PHI instances occurring in multiple files received the same PHI tag, and (2) any instances that were tagged in one part of the PHI Tag Count Frac. (%) IAA corpora but not in the other were manually reviewed and added to the gold-standard. We used the BRAT annotation tool for both Name 9558 54.73 0.96 annotation and adjudication [28]. Date 3676 21.05 0.86 Care Institute 997 5.71 0.52 3.4 Surrogate Generation Initials 778 4.45 0.46 Address 748 4.28 0.75 As the annotated dataset consists of personally identifying infor- Organization 712 4.08 0.38 mation which is protected by the GDPR, we generate artificial Internal Location 242 1.39 0.29 replacements for each of the PHI instances before using the data Age 175 1.00 0.39 for the development of de-identification methods. This process is Profession 122 0.70 0.31 known as surrogate generation, a common practice in the prepara- ID 114 0.65 0.43 tion of de-identification corpora [32]. As surrogate generation will Phone/Fax 97 0.56 0.93 inevitably alter the semantics of the corpus to an extent where it Email 95 0.54 0.94 affects the de-identification performance, it is important that this Hospital 92 0.53 0.42 step is done as thoroughly as possible [36]. Here, we follow the Other 33 0.19 0.03 semi-automatic surrogate generation procedure that has been used URL/IP 23 0.13 0.70 to prepare the i2b2/UTHealth shared task corpora. Below, we sum- SSN 2 0.01 0.50 marize this procedure and mention the language specific resources we used. We refer the reader to Stubbs et al. [32] for a thorough Total 17,464 100 0.84 discussion of the method. After running the automatic replacement scripts, we reviewed each of the surrogates to ensure that continu- ity within a document is preserved and no PHI is leaked into the the annotation instructions, an evaluation session was held after new dataset. each pair of annotators completed the first 20 documents.1 In total, We adapt the surrogate generation method of Stubbs et al. [32] to it took 77 hours to double-annotate the entire dataset of 1260 docu- the Dutch language as follows. A list of 10,000 most common family ments, or approximately 3.7 minutes per document. We measured names and given names is used to generate random surrogates the inter-annotator agreement (IAA) using entity-level F1 scores.2 for name PHI instances.3 We replace dates by first parsing the Table 3 shows the IAA per PHI category. Overall, the agreement format (e.g., “12 nov. 2018” → “%d %b. %Y”),4 and then randomly level is fairly high (0.84). However, we find that location names (i.e., shifting all dates within a document by the same amount of years care institutes, hospitals, organizations and internal locations) are and days into the future. For addresses, we match names of cities, often highly ambiguous which is reflected by the low agreement streets, and countries with a dictionary of Dutch locations,5 and scores of these categories (between 0.29 and 0.52). then pick random replacements from that dictionary. As Dutch ZIP To improve annotation efficiency, we integrated the rule-based codes follow a standard format (“1234AB”), their replacement is de-identification tool DEDUCE [19] with our annotation software straightforward. Names of hospitals, care institutes, organizations to pre-annotate each document. This functionality could be ac- and internal locations are randomly shuffled within the dataset. PHI tivated on a document basis by each annotator. If an annotator instances of type Age are capped at 89 years. Finally, alphanumeric used this functionality, they had to review the pre-annotations, strings such as Phone/FAX, Email, URL/IP, SSN and IDs are replaced correct potential errors and check for missed PHI instances. During by substituting each alphanumeric character with another character the evaluation sessions, annotators mentioned that the existing of the same class. We manually rewrite Profession and Other tags, tool proved helpful when annotating repetitive names, dates and as an automatic replacement is not applicable. email addresses. Note that this pre-annotation strategy might give DEDUCE a slight advantage. However, the low performance of 4 METHODS DEDUCE in the formal benchmark in Section 5 does not reflect this. This section presents the three de-identification methods and the After annotation, the main author of this paper reviewed 19,165 evaluation procedure. annotations and resolved any disagreements between the two anno- tators to form the gold-standard of 17,464 PHI annotations. Table 3 4.1 Rule-based Method: DEDUCE shows the distribution of PHI tags after adjudication. Overall the DEDUCE is an unsupervised de-identification method specifically adjudication has been done risk-averse: if only one annotator iden- developed for Dutch medical records [19]. It is based on lookup tified a piece of text as PHI, we assume that the other annotator tables, decision rules and fuzzy string matching and has been vali- 1 We include the annotation instructions that were provided to the annotators in dated on a corpus of 400 psychiatric nursing notes and treatment the online repository of this paper. The instructions are in large parts based on the 3 See www.naamkunde.net, accessed 2019-12-09 annotation guidelines in Stubbs and Uzuner [31]. 2 It has been shown that the F-score is more suitable to quantify IAA in sequence- 4 Rule-based date parser: github.com/nedap/dateinfer, accessed 2019-12-09 tagging scenarios compared to other measures such as the Kappa score [4]. 5 See openov.nl, accessed 2019-12-09 Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records Table 4: Features used by the CRF method. The features are state-of-the-art results for sequence labeling tasks [3]. Hyperpa- identical to the one by Liu et al. [17], but we exclude word- rameters are set to the best performing configuration in Akbik et al. representation features. [3]: we use stochastic gradient descent with no momentum and an initial learning rate of 0.1. If the training loss does not decrease Group Description for 3 consecutive epochs, the learning rate is halved. Training is stopped if the learning rate falls below 10−4 or 150 epochs are Bag-of-words (BOW) Token unigrams, bigrams and trigrams reached. Furthermore, the number of hidden layers in the LSTM within a window of [−2, 2] of the cur- is set to 1 with 256 recurrent units. We employ locked dropout rent token. with a value of 0.5 and use a mini-batch size of 32. With respect to Part-of-speech (POS) Same as above but with POS n-grams. the embedding layer, we use the pre-trained GloVe (English) and BOW + POS Combinations of the previous, current fasttext (Dutch) embedding on a word-level, and concatenate them and next token and their POS tags. with the pre-trained contextualized string embeddings included in Sentence Length in tokens, presence of end-mark Flair7 [2, 10, 24]. such as ’.’, ’?’, ’!’ and whether sentence contains unmatched brackets. 4.4 Preprocessing and Sequence Tagging Affixes Prefix and suffix of length 1 to 5. Orthographic Binary indicators about word shape: is We use a common preprocessing routine for all three datasets. For all caps, is capitalized, capital letters in- tokenization and sentence segmentation, the spaCy tokenizer is side, contains digit, contains punctua- used.8 The POS/NER features of the CRF method are generated tion, consists of only ASCII characters. by the built-in spaCy models. After sentence segmentation, we Word Shapes The abstract shape of a token. For ex- tag each token according to the Beginning, Inside, Outside (BIO) ample, “7534-Df” becomes “####-Aa”. scheme. In rare occasions, sequence labeling methods may produce Named-entity recogni- NER tag assigned by the spaCy tagger. invalid transitions (e.g., O- → I-). In a post-processing step, we tion (NER) replace invalid I- tags with B- tags [25]. 4.5 Evaluation plans of a single hospital. Following the authors’ recommendations, The de-identification methods are assessed according to precision, we customize the method to include a list of 1200 institutions that recall and F1 computed on an entity-level, the standard evalua- are common in our domain. Also, we resolve two incompatibilities tion approach for NER systems [34]. In an entity-level evaluation, between the PHI coding schemes of our dataset and the DEDUCE predicted PHI offsets and types have to match exactly. Following output. First, as DEDUCE does not distinguish between hospitals, the evaluation of de-identification shared tasks, we use the micro- care institutes, organizations and internal locations, we group these averaged entity-level F1 score as primary metric [30].9 four PHI tags under a single Named Location tag. Second, our Name We randomly split our dataset and the nursing notes corpus into annotations do not include titles (e.g., “Dr.” or “Ms.”). Therefore, training, validation and testing sets with a 60/20/20 ratio. As the titles are stripped from the DEDUCE output. i2b2 corpus has a pre-defined test set of 40%, a random set of 20% of the training documents serves as validation data. Finally, we test for 4.2 Feature-based Method: Conditional statistical significance using two-sided approximate randomization Random Field with N = 9999 [35]. CRFs and hybrid rule-based systems provide state-of-the-art per- 5 RESULTS formance in recent shared tasks [29, 30]. Therefore, we implement a CRF approach to contrast with the unsupervised rule-based sys- In this section, we first discuss the de-identification results obtained tem. In particular, we re-implement the token-based CRF method on our Dutch dataset (Section 5.1). Afterwards, we present an error by Liu et al. [17] and re-use a subset6 of their features (see Ta- analysis of the best performing method (Section 5.2). This section is ble 4). The linear-chain CRF is trained using LBFGS and elastic net concluded with the benchmark for the English datasets (Section 5.3) regularization [37]. Using a validation set, we optimize the two and the cross-domain de-identification (Section 5.4). regularization coefficients of the L 1 and L 2 norms with a random search in the loд10 space of [10−4, 101 ] with 250 trials. We use the 5.1 De-identification of Dutch Dataset CRFSuite implementation by Okazaki [23]. Both machine learning methods outperform the rule-based system DEDUCE by a large margin (see Table 5). Furthermore, the BiLSTM- 4.3 Neural Method: BiLSTM-CRF CRF provides a substantial improvement of 10% points in recall over To reduce the need for hand-crafted features in traditional CRF- the traditional CRF method, while maintaining precision. Overall, based de-identification, recent work applies neural methods [5, 15, the neural method has an entity-level recall of 87.1% while achieving 18]. Here, we re-implement a BiLSTM-CRF architecture with con- 7 github.com/zalandoresearch/flair, accessed 2019-12-09 textual string embeddings, which has recently shown to provide 8 spacy.io, accessed 2019-12-09 9 De-identification systems are often also evaluated on a less strict token-level. As a 6 We disregard word-representation features as Liu et al. [17] found that they had a system that scores high on an entity-level will also score high on a token-level, we negative performance impact. only measure according to the stricter level of evaluation. Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra Table 5: Evaluation summary: micro-averaged scores are shown for each dataset and method. Statistically significant improve- ments over the score on the previous line are marked with ▲ (p < 0.01), and ◦ depicts no significance. The rule-based method DEDUCE is not applicable to the English datasets. NUT (Dutch) i2b2 (English) Nursing Notes (English) Method Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 DEDUCE 0.807 0.564 0.664 - - - - - - CRF 0.919▲ 0.775▲ 0.841▲ 0.952 0.796 0.867 0.914 0.685 0.783 BiLSTM-CRF 0.917◦ 0.871▲ 0.893▲ 0.959▲ 0.869▲ 0.912▲ 0.886◦ 0.797▲ 0.839▲ Table 6: Entity-level precision and recall per PHI category on the NUT dataset. Scores are compared between the rule- 0.90 based tagger DEDUCE [19] and the BiLSTM-CRF model. The Named Loc. tag is the union of the 4 specific location tags 0.85 which are not supported by DEDUCE. Tags are ordered by frequency with location tags fixated at the bottom. 0.80 BiLSTM-CRF F1-score CRF DEDUCE BiLSTM-CRF DEDUCE 0.75 PHI Tag Prec. Rec. Prec. Rec. Name 0.965 0.956 0.849 0.805 0.70 Date 0.926 0.920 0.857 0.441 Initials 0.828 0.624 0.000 0.000 0.65 0.835 0.846 10% 25% 40% 55% 70% 85% 100% Address 0.804 0.526 Training set size (%) Age 0.789 0.732 0.088 0.122 Profession 0.917 0.262 0.000 0.000 ID 0.800 0.480 0.000 0.000 Figure 2: Entity-level F1-score for varying training set sizes. Phone/Fax 0.889 1.000 0.929 0.812 The full training set (100%) consists of all training and vali- Email 0.909 1.000 1.000 0.900 dation sentences in NUT (34,714). The F1-score is measured Other 0.000 0.000 0.000 0.000 on the test set. For each subset size, we draw 3 random sam- URL/IP 1.000 0.750 0.750 0.750 ples and train/test each model 3 times. The lines show the Named Loc. 0.797 0.659 0.279 0.058 averaged scores along with the 95% confidence interval. The rule-based tagger DEDUCE is shown as a baseline. Care Institute 0.686 0.657 n/a n/a Organization 0.780 0.522 n/a n/a Internal Loc. 0.737 0.509 n/a n/a Hospital 0.778 0.700 n/a n/a instances are likely to directly reveal the identity of an individual, their removal is essential. However, DEDUCE does not generalize beyond the PHI types mentioned above. Especially named locations a recall of 95.6% for names, showing that the neural method is are non-trivial to capture with a rule-based system as their identifi- operational for many de-identification scenarios. In addition, we cation strongly relies on the availability of exhaustive lookup lists. make the following observations. In contrast, the neural method provides a significant improvement Neural method performs at least as good as rule-based for named locations (5.8% vs. 65.9% recall). We assume that word- method. By inspecting the model performance on a PHI-tag level, level and character-level embeddings provide an effective tool to we observe that the neural method outperforms DEDUCE for all capture these entities. classes of PHI (see Table 6). Only for the Phone and Email cate- Initials, IDs and professions are hard to detect. During an- gory, the rule-based method has a slightly higher precision. Sim- notation, we observed a low F1 annotator agreement of 0.46, 0.43, ilarly, we studied the impact of the training data set size on the and 0.31 for initials, IDs and professions, respectively. This shows de-identification performance. Both machine learning methods out- that these PHI types are among the hardest to identify, even for perform DEDUCE even with as little training data as 10% of the humans (see Table 3). One possible cause for this is that IDs and total sentences (see Figure 2). This suggests that in most environ- initials are often hard to discriminate from abbreviations and medi- ments where training data are available (or can be obtained), the cal measurements. We observe that the BiLSTM-CRF detects those machine learning methods are to be preferred. PHI classes with high precision but low recall. With respect to Rule-based method can provide a “safety net.” It can be ob- professions, we find that phrases are often wrongly tagged. For served that DEDUCE performs reasonably well for names, phone example, colloquial job descriptions (e.g., “works behind the cash numbers, email addresses and URLs (see Table 6). As these PHI desk”) as opposed to the job title (e.g., “cashier”) make it infeasible Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records to tackle this problem with lookup lists, while a machine learner Table 7: Summary of the manual error analysis of false nega- likely requires more training data to capture this PHI. tives (FNs) and false positives (FPs) produced by the BiLSTM- CRF. All error categories are mutually exclusive. 5.2 Error Analysis on Dutch Dataset To gain a better understanding of the best performing model and an FNs (n = 469) FPs (n = 288) intuition for its limitations, we conduct a manual error analysis of Category Count Part Count Part the false positives (FPs) and false negatives (FNs) produced by the BiLSTM-CRF on the test set. We discuss the error categorization Model Errors scheme in Section 5.2.1 and present the results in Section 5.2.2. Abbreviation 65 13.9% 28 9.7% Ambiguity 15 3.2% 7 2.4% 5.2.1 Error Categorization. We distinguish between two error groups: Debatable 7 1.5% 4 1.4% (1) modeling errors, and (2) annotation/preprocessing errors. We Prefix 10 2.1% 10 3.5% define modeling errors to be problems that can be addressed with Common language 35 7.5% 9 3.1% different de-identification techniques and additional training data. Other reason 275 58.6% 159 55.2% In contrast, annotation and preprocessing errors are not directly caused by the sequence labeling model, but are issues in the train- Annotation/Preprocessing Errors ing data or the preprocessing pipeline which need to be addressed Missing Annotation - - 33 11.5% manually. Inspired by the classification scheme of Dernoncourt Annotation Error 21 4.5% 18 6.3% et al. [5], we consider the following sources of modeling errors: Tokenization Error 41 8.7% 20 6.9% • Abbreviation. PHI instances which are abbreviations or Total 469 100% 288 100% acronyms for names, care institutes and companies. These are hard to detect and can be ambiguous as they are easily confused with medical terms and measurements. training data will likely not in itself help to correctly identify this • Ambiguity. A human reader may be unable to decide whether type of PHI. It is conceivable to design custom features (e.g., based a given text fragment is PHI. on shape, positioning in a sentence, presence/absence in a medical • Debatable. It can be argued that the token should not have dictionary) to increase precision. However, it is an open question been annotated as PHI. how recall can be improved. • Prefix. Names of internal locations, organizations and com- PHI instances consisting of common language are likely to panies are often prefixed with articles (i.e., “de” and “het”). be wrongly tagged (7.5% FNs, 3.1% FPs). This is caused by the Sometimes, it is unclear whether the prefix is part of the fact that there are insufficient training examples where common official name or part of the sentence construction. This ambi- language is used to refer to PHI. For example, the organization guity is reflected in the training data which causes the model name in the sentence “Vandaag heb ik Beter Horen gebeld” (Eng: to inconsistently include or exclude those prefixes. “I called Beter Horen today”) was incorrectly classified as non-PHI. • Common Language. PHI instances consisting of common Each individual word, and also the combination of the two words, language are hard to discriminate from the surrounding text. can be used in different contexts without referring to PHI. However, • Other. Remaining modeling errors that do not fall into the in this specific context, it is apparent that “Beter Horen” must refer categories mentioned above. In those cases, it is not immedi- to an organization. ately apparent why the misclassification occurs. A substantial amount of errors is due to annotation and pre- Preprocessing errors are categorized as follows: processing issues. Annotation errors (4.5% FNs, 6.3% FPs) can be resolved by correcting the respective PHI offsets in the gold • Missing Annotation. The text fragment is PHI, but was standard. Tokenization errors (8.7% FNs, 6.9% FPs) need to be fixed missed during the annotation phase. through a different preprocessing routine. For example, the annota- • Annotation Error. The annotator assigned an invalid entity tion / should have been split into [2016, boundary. /, 2017] with BIO tagging [B, O, B]. However, the spaCy tok- • Tokenization Error. The annotated text span could not enizer segmented this text into a single token [2016/2017]. In this be split into a compatible token span. Those tokens were case, entity boundaries do no longer align with token boundaries marked as “Outside (O)” during BIO tagging. which results in an invalid BIO tagging of [O] for the entire span. We consider all error categories to be mutually exclusive. Several false positives are in fact PHI and should be anno- 5.2.2 Results of Error Analysis. Table 7 summarizes the error anal- tated. The model identifies several PHI instances which were missed ysis results and shows the absolute and relative frequency of each during the annotation phase (11.5% of the FPs). Once more, this error category. Overall, we find that the majority of modeling er- demonstrates that proper de-identification is an error-prone task rors cannot be easily explained through human inspection (“Other for human annotators. reason” in Table 7). The remaining errors are mainly caused by ambiguous PHI instances and preprocessing errors. In more detail, 5.3 De-identification of English Datasets we make the following observations: When training and testing both machine learning methods on the Abbreviations are the second most common cause for mod- English i2b2 and the nursing notes datasets, we can observe that eling errors (13.9% of FNs, 9.7% of FPs). We hypothesize that more the BiLSTM-CRF significantly outperforms the CRF in both cases Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra Table 8: Summary of the transfer learning experiment on Table 9: Detailed performance analysis of the BiLSTM-CRF our Dutch dataset. Each method is trained on data of one method in the transfer learning experiment. In-domain test care domain and tested on the other two domains. All scores scores are shown on the diagonal. All scores are micro- are micro-averaged entity-level F1. averaged entity-level F1. Training Domain Training Domain Method Elderly Disabled Mental Test Domain Elderly Disabled Mental DEDUCE 0.683 0.565 0.675 Elderly 0.746 0.698 0.703 CRF 0.414 0.697 0.719 Disabled 0.796 0.919 0.879 BiLSTM-CRF 0.775 0.775 0.839 Mental 0.744 0.806 0.871 (see Table 5). Similar to our Dutch dataset, the neural method pro- vides an increase of up to 11.2% points in recall (nursing notes) the transfer domain (elderly care, 0.698 F1) is 0.221 in F1. This raises while the precision remains relatively stable. This shows that the an important point when performing de-identification in practice: neural method has the best generalization capabilities even across while the neural method shows the best generalization capabilities languages. More importantly, it does not require the development of compared to the other de-identification methods, the performance domain-specific lookup lists or sophisticated pattern matching rules. can still be significantly lower when applying a pre-trained model To put the results into perspective: the second-highest ranked team in new domains. in the i2b2 2014 challenge used a sophisticated ensemble combining a CRF with domain-specific rules [30]. Their system obtained an 5.5 Limitations entity-level F1 score of 0.9124 which is on-par with the performance While the contextual string embeddings used in this paper have of our neural method that requires no configuration. We can expect shown to provide state-of-the-art results for NER [3], transformer- that the performance of the neural method further improves after based architectures for contextual embeddings have also gained hyperparameter optimization. Finally, note that both machine learn- significant attention (e.g., BERT [6]). It would make an interesting ing methods can be easily applied to a new PHI tagging scheme, experiment to benchmark different types of pre-trained embed- whereas rule-based methods are limited to the PHI definition they dings for the task of de-identification. Furthermore, we observe were developed for. that the neural method provides strong performance even with limited training data (see Figure 2). It is unclear what contribution 5.4 Cross-domain De-identification large pre-trained embeddings have in those scenarios which war- In many de-identification scenarios, heterogeneous training data rants an ablation study testing different model configurations. We from multiple medical institutes and domains are rarely available. leave the exploration of those ideas to future research. This raises the question, how well a model that has been trained on a homogeneous set of medical records generalizes to records 6 CONCLUSION of other medical domains. We trained the three de-identification This paper presents the construction of a novel Dutch dataset and methods on one domain of Dutch healthcare (e.g., elderly care) and a comparison of state-of-the-art de-identification methods across tested each model on the records of the remaining two domains (e.g., Dutch and English medical records. Our experiments show the disabled care and mental care). We followed the same training and following. (1) An existing rule-based method for the Dutch language evaluation procedures described in Section 4.5. Table 8 summarizes does not generalize well to new domains. (2) If one is looking for an the performance of each method on the different tasks. out-of-the-box de-identification method, neural approaches show Again, the neural method consistently outperforms the rule- the best generalization performance across languages and domains. based and feature-based methods in all three domains which sug- (3) When testing across different domains, a substantial decrease of gests that it is a fair default choice for de-identification. This is performance has to be expected, an important consideration when underlined by the fact that the amount of training data is severely applying de-identification in practice. limited in this experiment: each domain only has 420 documents There are several directions for future work. Motivated by the of which 20% of the records are reserved for testing. Interestingly, limited generalizability of pre-trained models across different do- DEDUCE performs rather stable and even outperforms the CRF mains, transfer learning techniques can provide a way forward. A within the domain of elderly care. preliminary study by Lee et al. [16] shows that they can be benefi- Given an ideal de-identification method, one would expect that cial for de-identification. Finally, our experiments show that phrases performance on unseen data of a different domain is similar to such as professions are among the most difficult information to the test score obtained on the available (homogeneous) data. Ta- de-identify. It is an open challenge how to design methods that can ble 9 shows a performance breakdown for each of the three testing capture this type of information. domains for the neural method. It can be seen that in 4 out of 6 cases, the test score in a new domain is lower than the test score REFERENCES obtained on the in-domain data. The largest delta of the observed in- [1] John S. Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Benjamin Wellner, Cheryl domain test score (disabled care, 0.919 F1) and the performance in Clark, David A. Hanauer, Bradley Malin, and Lynette Hirschman. 2010. The Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records MITRE Identification Scrubber Toolkit: Design, training, and assessment. I. J. [19] Vincent Menger, Floor Scheepers, Lisette Maria van Wijk, and Marco Spruit. 2018. Medical Informatics 79, 12 (2010), 849–859. DEDUCE: A pattern matching method for automatic de-identification of Dutch [2] Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled Contextualized medical text. Telematics and Informatics 35, 4 (2018), 727–736. Embeddings for Named Entity Recognition. In Proceedings of the 2019 Conference [20] Stephane M. Meystre. 2015. De-identification of Unstructured Clinical Data for of the North American Chapter of the Association for Computational Linguistics: Patient Privacy Protection. In Medical Data Privacy Handbook, Aris Gkoulalas- Human Language Technologies, Volume 1 (Long and Short Papers). Association for Divanis and Grigorios Loukides (Eds.). Springer International Publishing, 697– Computational Linguistics, Minneapolis, Minnesota, 724–728. https://doi.org/10. 716. 18653/v1/N19-1078 [21] Ishna Neamatullah, Margaret M. Douglass, Li-Wei H. Lehman, Andrew T. Reisner, [3] Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embed- Mauricio Villarroel, William J. Long, Peter Szolovits, George B. Moody, Roger G. dings for Sequence Labeling. In Proceedings of the 27th International Conference on Mark, and Gari D. Clifford. 2008. Automated de-identification of free-text medical Computational Linguistics. Association for Computational Linguistics, 1638–1649. records. BMC Med. Inf. & Decision Making 8 (2008), 32. https://www.aclweb.org/anthology/C18-1139 [22] Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and [4] Louise Deléger, Qi Li, Todd Lingren, Megan Kaiser, Katalin Molnár, Laura Stouten- Pierre Zweigenbaum. 2018. Clinical Natural Language Processing in languages borough, Michal Kouril, Keith Marsolo, and Imre Solti. 2012. Building Gold other than English: opportunities and challenges. Journal of Biomedical Semantics Standard Corpora for Medical Natural Language Processing Tasks. In AMIA 2012, 9, 1 (2018), 12:1–12:13. American Medical Informatics Association Annual Symposium, Chicago, Illinois, [23] Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random USA, November 3-7, 2012. Fields (CRFs). Retrieved December 09, 2019 from http://www.chokkan.org/ [5] Franck Dernoncourt, Ji Young Lee, Özlem Uzuner, and Peter Szolovits. 2017. software/crfsuite/ De-identification of patient notes with recurrent neural networks. JAMIA 24, 3 [24] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: (2017), 596–606. Global Vectors for Word Representation. In Empirical Methods in Natural Lan- [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: guage Processing (EMNLP). 1532–1543. https://doi.org/10.3115/v1/D14-1162 Pre-training of Deep Bidirectional Transformers for Language Understanding. [25] Nils Reimers and Iryna Gurevych. 2017. Optimal Hyperparameters for Deep Computing Research Repository arXiv:1810.04805 (2018). http://arxiv.org/abs/ LSTM-Networks for Sequence Labeling Tasks. Computing Research Repository 1810.04805 arXiv:1707.06799 (2017). http://arxiv.org/abs/1707.06799 [7] Margaret Douglass, Gari D. Clifford, Andrew Reisner, George B. Moody, and [26] Phillip Richter-Pechanski, Stefan Riezler, and Christoph Dieterich. 2018. De- Roger G. Mark. 2004. Computer-assisted de-identification of free text in the Identification of German Medical Admission Notes. Studies in health technology MIMIC II database. In Computers in Cardiology, 2004. IEEE, 341–344. and informatics 253 (2018), 165–169. [8] Carsten Eickhoff, Yubin Kim, and Ryen White. 2020. Overview of the Health [27] Elyne Scheurwegs, Kim Luyckx, Filip Van der Schueren, and Tim Van den Bulcke. Search and Data Mining (HSDM 2020) Workshop. In Proceedings of the Thirteenth 2013. De-Identification of Clinical Free Text in Dutch with Limited Training Data: ACM International Conference on Web Search and Data Mining (WSDM ’20). ACM, A Case Study. In Proceedings of the Workshop on NLP for Medicine and Biology New York, NY, USA. https://doi.org/10.1145/3336191.3371879 associated with RANLP 2013. 18–23. https://www.aclweb.org/anthology/W13- [9] GDPR. 2016. Regulation on the protection of natural persons with regard to the 5103 processing of personal data and on the free movement of such data, and repealing [28] Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ana- Directive 95/46/EC (Data Protection Directive). Official Journal of the European niadou, and Jun’ichi Tsujii. 2012. BRAT: A Web-based Tool for NLP-assisted Union L119 (2016), 1–88. Text Annotation. In Proceedings of the Demonstrations at the 13th Conference [10] Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas of the European Chapter of the Association for Computational Linguistics (EACL Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of ’12). Association for Computational Linguistics, Stroudsburg, PA, USA, 102–107. the International Conference on Language Resources and Evaluation (LREC 2018). https://www.aclweb.org/anthology/E12-2021 https://www.aclweb.org/anthology/L18-1550 [29] Amber Stubbs, Michele Filannino, and Özlem Uzuner. 2017. De-identification of [11] Dilip Gupta, Melissa Saul, and John Gilbertson. 2004. Evaluation of a Deidentifica- psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track tion (De-Id) Software Engine to Share Pathology Reports and Clinical Documents 1. Journal of Biomedical Informatics 75 (2017), S4–S18. for Research. American Journal of Clinical Pathology 121, 2 (2004), 176–186. [30] Amber Stubbs, Christopher Kotfila, and Özlem Uzuner. 2015. Automated systems [12] Bin He, Yi Guan, Jianyi Cheng, Keting Cen, and Wenlan Hua. 2015. CRFs Based for the de-identification of longitudinal clinical narratives: Overview of 2014 De-identification of Medical Records. Journal of Biomedical Informatics 58, S i2b2/UTHealth shared task Track 1. Journal of Biomedical Informatics 58 (2015), (2015), S39–S46. S11–S19. [13] HIPAA. 1996. Health Insurance Portability and Accountability Act. Public Law [31] Amber Stubbs and Özlem Uzuner. 2015. Annotating longitudinal clinical narra- 104-191 (1996). tives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of Biomedical [14] HIPAA. 2012. Guidance regarding methods for de-identification of protected Informatics 58 (2015), S20–S29. health information in accordance with the Health Insurance Portability and [32] Amber Stubbs, Özlem Uzuner, Christopher Kotfila, Ira Goldstein, and Peter Accountability Act (HIPAA) Privacy Rule. Retrieved December 09, 2019 Szolovits. 2015. Challenges in Synthesizing Surrogate PHI in Narrative EMRs. In from https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de- Medical Data Privacy Handbook, Aris Gkoulalas-Divanis and Grigorios Loukides identification/index.html (Eds.). Springer International Publishing, 717–735. [15] Kaung Khin, Philipp Burckhardt, and Rema Padman. 2018. A Deep Learn- [33] Erik Tjong Kim Sang, Ben de Vries, Wouter Smink, Bernard Veldkamp, Gerben ing Architecture for De-identification of Patient Notes: Implementation and Westerhof, and Anneke Sools. 2019. De-identification of Dutch Medical Text. In Evaluation. Computing Research Repository arXiv:1810.01570 (2018). http: 2nd Healthcare Text Analytics Conference (HealTAC2019). Cardiff, Wales, UK. //arxiv.org/abs/1810.01570 [34] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL- [16] Ji Young Lee, Franck Dernoncourt, and Peter Szolovits. 2018. Transfer Learning 2003 Shared Task: Language-independent Named Entity Recognition. In Proceed- for Named-Entity Recognition with Neural Networks. In Proceedings of the 11th ings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 Language Resources and Evaluation Conference. Miyazaki, Japan, 4470–4473. https: - Volume 4 (CONLL ’03). Association for Computational Linguistics, Stroudsburg, //www.aclweb.org/anthology/L18-1708 PA, USA, 142–147. https://doi.org/10.3115/1119176.1119195 [17] Zengjian Liu, Yangxin Chen, Buzhou Tang, Xiaolong Wang, Qingcai Chen, [35] Alexander Yeh. 2000. More Accurate Tests for the Statistical Significance of Result Haodi Li, Jingfeng Wang, Qiwen Deng, and Suisong Zhu. 2015. Automatic Differences. In Proceedings of the 18th Conference on Computational Linguistics - de-identification of electronic medical records using token-level and character- Volume 2 (COLING ’00). Association for Computational Linguistics, Stroudsburg, level conditional random fields. Journal of Biomedical Informatics 58 (2015), PA, USA, 947–953. https://doi.org/10.3115/992730.992783 S47–S52. [36] Reyyan Yeniterzi, John S. Aberdeen, Samuel Bayer, Benjamin Wellner, Lynette [18] Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De- Hirschman, and Bradley Malin. 2010. Effects of personal identifier resynthesis identification of Clinical Notes via Recurrent Neural Network and Conditional on clinical text de-identification. JAMIA 17, 2 (2010), 159–168. Random Field. Journal of Biomedical Informatics 75, S (2017), S34–S42. [37] Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B 67 (2005), 301–320.