=Paper=
{{Paper
|id=Vol-3416/paper_5
|storemode=property
|title=Automatic Annotation of Training Data for Deep Learning Based De-identification of Narrative Clinical Text
|pdfUrl=https://ceur-ws.org/Vol-3416/paper_5.pdf
|volume=Vol-3416
|authors=Martin Sundahl Laursen,Jannik Skyttegaard Pedersen,Pernille Just Vinholt,Thiusius Rajeeth Savarimuthu
|dblpUrl=https://dblp.org/rec/conf/icon-nlp/LaursenPVS22
}}
==Automatic Annotation of Training Data for Deep Learning Based De-identification of Narrative Clinical Text==
<pdf width="1500px">https://ceur-ws.org/Vol-3416/paper_5.pdf</pdf>
<pre>
Automatic Annotation of Training Data for Deep
Learning Based De-identification of Narrative Clinical
Text
Martin Sundahl Laursen1 , Jannik Skyttegaard Pedersen1 , Pernille Just Vinholt2 and
Thiusius Rajeeth Savarimuthu1
1
    The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Denmark
2
    Department of Clinical Biochemistry, Odense University Hospital, Denmark


                                         Abstract
                                         Electronic health records contain information about patients’ medical history which is important for
                                         research but the text must be de-identified before use. This study utilized dictionaries constructed
                                         from publicly available lists of identifiers to automatically annotate a training dataset for a named
                                         entity recognition model to de-identify names, streets, and locations in Danish narrative clinical text.
                                         Ambiguous identifiers were not annotated if they occurred more than expected for an identifier. The
                                         model had recall 93.43%, precision 86.10%, and F1 89.62%. We found that the model generalized from the
                                         training data to achieve better performance than simply using the dictionaries to directly annotate text.

                                         Keywords
                                         de-identification, electronic health records, named entity recognition, automatic annotation, deep learning


1. Introduction
Electronic health records (EHR) contain information about patients’ contact with the healthcare
system including important information about medical history, e.g. symptoms, diagnoses, and
treatments. Diagnoses are also registered using International Classification of Diseases 10 codes
for administrative purposes. However, not all relevant patient information is represented in
codes, e.g symptoms. Further, codes are often incorrect [1, 2, 3, 4, 5] and can therefore not
replace the narrative clinical text in EHRs as a source of information.
   Apart from treatment of patients, the EHR data are important for e.g. research and education
but as they contain personally identifiable information, explicit consent from the affected
individual must be given, or the data must be de-identified before being used for secondary
purposes [6, 7]. The US Health Insurance Portability and Accountability Act (HIPAA) defines
which identifiers must be removed according to the Safe Harbor method for de-identification1 .

WNLPe-Health 2022, December 18, 2022, Delhi, India
Envelope-Open msla@mmmi.sdu.dk (M. S. Laursen); jasp@mmmi.sdu.dk (J. S. Pedersen); pernille.vinholt@rsyd.dk (P. J. Vinholt);
trs@mmmi.sdu.dk (T. R. Savarimuthu)
Orcid 0000-0001-5684-1325 (M. S. Laursen); 0000-0002-7066-1563 (J. S. Pedersen); 0000-0002-2035-0169 (P. J. Vinholt);
0000-0002-2478-8694 (T. R. Savarimuthu)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                      Guide available at https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.
html
The identifiers include, among others, names, street addresses, and locations including city,
county, and precinct.
   Manual de-identification is a time consuming task and, therefore, large datasets are impractical
and expensive to de-identify manually. Natural language processing techniques for automatic
de-identification may alleviate this task.
   This study utilizes dictionaries of identifiers and a novel way of dealing with ambiguous
identifiers to automatically annotate a training dataset for a named entity recognition (NER)
model to de-identify names, streets, and locations in Danish narrative clinical text.
   A method for automatic annotation of training datasets is useful for developing de-identification
deep learning models for low-resource languages like Danish where annotated datasets and
trained models for de-identification of specific identifier types are not always publicly available.
   The main contributions of this paper are:

    • We train a NER model to de-identify names, streets, and locations in Danish narrative
      clinical text with recall 93.43%, precision 86.10%, and F1 89.62%.
    • We use dictionary-based automatic annotation of training data for the NER model utilizing
      our novel method for annotation of ambiguous identifiers guided by occurrence rates in
      the text and population.
    • We find that the NER model can generalize from the dataset to achieve better performance
      than simply using the dictionaries to directly annotate text.


2. Related Work
Previous studies on automatic de-identification of narrative clinical text used rule-based methods,
machine learning methods, and hybrid methods combining both. We found no studies that,
similar to ours, used automatic annotation of data for training a machine learning model to
de-identify narrative clinical text.
   In studies that used rule-based methods, pattern matching or dictionaries were used to search
for identifiers in the text [8, 9, 10, 11, 12, 13]. Rule-based methods rely on domain experts to
define the rules and it is difficult to cover all cases. They generally cannot distinguish ambiguous
identifiers, i.e. words that can both be an identifier and a non-identifier depending on the context.
Pantazos et al. [12] de-identified Danish text using dictionaries with an F1 score of 95.7% on a
random sample of 369 EHRs. They identified ambiguous identifiers by matching identifiers to a
database of non-identifiers. As they replaced identifiers with pseudo-identifiers, their approach
to ambiguous identifiers was to delete the record unless the identifier appeared more than 200
times to not disclose their replacement rule.
   Studies that used machine learning methods mainly used recurrent neural networks, condi-
tional random fields, and combinations of the two [14, 15, 16, 17, 18, 19]. All studies that utilized
machine learning used a manually annotated dataset for training the models. Machine learning
methods and in particular deep learning architectures such as Long Short-Term Memory [20]
and transformer [21] networks are able to distinguish ambiguous identifiers based on the context
of the whole sentence. Some recent studies have used transformer networks for automatic
de-identification of narrative clinical text [22, 19, 23, 24, 25, 26]. A disadvantage of machine
learning methods is their need for a large expert-annotated training dataset specific to the
domain.
   Finally, the studies that were most similar to ours used hybrid methods, combining rule-based
and machine learning methods in ensembles or pipelines to improve the annotation workload
and model performance [27, 28, 29, 30, 25]. Two studies used rule-based methods in other ways
than for directly classifying identifiers. McMurry et al. [27] used pattern matching to contribute
part of a feature set for classification by a machine learning model which resulted in a F1 score
of 76% on a custom test set of 220 discharge summaries. Jian et al. [28] used pattern matching
to create a dense corpus of identifiers for manual annotation before being input to a machine
learning model. It had an F1 score of 94.6% when cross-validating on 3,000 clinical documents.


3. Methods
In this paper, we first constructed lists of name, street, and location identifiers. We compared
the identifiers to a database of non-identifying words to determine which identifiers were
ambiguous—e.g. the name ‘Hans’ is ambiguous because it is also a pronoun (Danish for ‘his’).
This dictionary-based method is similar to that of e.g. Pantazos et al. [12] except in this paper,
we used it to annotate training data for a deep learning model instead of using direct dictionary-
based de-identification. Additionally, we utilized a novel method for annotation of ambiguous
identifiers and tested different ceiling values above which words were removed from the list of
identifiers if they occured in the text at a higher rate than would be expected for an identifier.
We searched for and annotated identifiers in Danish narrative clinical text and constructed
a training set of sentences with no or only unambiguous identifiers. Finally, the training set
was used to train a NER de-identification model. The goal was for the model to generalize
from the training samples with no or only unambiguous identifiers to also correctly classify
ambiguous identifiers. This process is detailed in the rest of this section. We make our code
publicly available2 .

3.1. Data
3.1.1. Corpus
We extracted 150,000 random sentences with a length between 8 and 70 words from EHRs from
Odense University Hospital between 2015 and 2020. Sentences were lowercased and tokenized,
and consecutive underscores and hyphens were reduced to a single instance.

3.1.2. Identifiers
The identifier types were names, streets, and locations. Locations included cities, municipalities,
regions, and provinces.
  For the name identifiers, we obtained lists of all male first names, female first names, and last
names in the Danish population as of January 2021 from Statistics Denmark.


   2
       https://github.com/jannikskytt/clinical_de-identification
   For the street identifiers, we used a database of all Danish addresses from the Address Web
Services of the Agency for Data Supply and Efficiency of Denmark3 . Each address included
street name, addressing street name (could be identical to street name), city name, potential
supplemental city name, municipality, region, and province. From the database, a list of unique
street names including addressing street names was constructed.
   For the location identifiers, we used the same database of all Danish addresses. A list of
unique locations including city names, supplemental city names, municipalities, regions, and
provinces was constructed.
   Data cleaning consisted of lowercasing and removing single-letter and empty and corrupted
identifiers including various placeholders.
   A rate of occurrence in the Danish population was calculated for each identifier by dividing
their occurrence in the population by the sum of all occurrences for that identifier type. For
each identifier type, duplicates were merged by adding the rates of occurrence.

3.1.3. Non-identifiers
Non-identifiers were words that in their context did not identify names, streets, or locations.
Such words included both common general domain words and specialized words from the
clinical domain such as symptoms, diseases, and treatments. The database of non-identifiers
was constructed from multiple text sources from the general and clinical domains which did
not contain any of the three identifier types.
   The text sources were:

    • The Danish orthographic dictionary containing all Danish words, their conjugations, and
      abbreviations [31].
    • Product names from the list of authorized medicinal products in Denmark4 .
    • Medical abbreviations collected from different electronic sources (Appendix A)
    • All term entries in the Description tables of the SNOMED CT vocabulary of clinical
      terminology (international version with Danish extension).
    • The Danish healthcare system’s classification system for symptoms, diagnoses, and
      operations5 .

3.2. Ambiguous Identifiers
An identifier could be ambiguous for two reasons. One reason was that it had multiple different
identifier types, e.g. ‘Kolding’ is both a location and a name. In that case, the identifiers’ rates of
occurrence were added. Another reason was that it was also a non-identifier. To find those cases,
identifiers were matched against the database of non-identifiers using a regular expression that
ignored case (regex). If an identifier was matched to a non-identifier, it was ambiguous.


    3
      All datasets were downloaded from https://download.aws.dk/
    4
      Available at https://laegemiddelstyrelsen.dk/en/
    5
      Available at https://sundhedsdatastyrelsen.dk/
3.3. Automatic Annotation
For the automatic annotation of identifiers in sentences, specifically dealing with ambiguous
identifiers, we introduced a measure for the likelihood of a word being a non-identifier vs.
identifier for the specific corpus. The measure was the ratio between the rate with which the
word occurred in sentences in the corpus as either identifier or non-identifier, and the rate of
occurrence in the Danish population as identifier: 𝑟𝑎𝑡𝑖𝑜 = 𝑟𝑐𝑜𝑟𝑝𝑢𝑠 /𝑟𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 . A ratio above 1
meant that the word had a higher rate of occurrence in the corpus than as an identifier in the
Danish population. This could indicate that it in most cases occurred as a non-identifier in the
corpus. A ratio below 1 could indicate that the word in most cases occurred as an identifier.
   The rate of occurrence in the corpus was calculated for each identifier by searching through
all sentences using a regex, counting the number of occurrences, and dividing by the total
number of sentences. The ratio was then calculated using the equation.
   Next, a regex was used to search for and annotate identifiers in the sentences. Words that
were unambiguous identifiers were annotated with their single identifier type. Words that were
ambiguous because they had multiple identifier types were annotated with both. Words that
were ambiguous because they were both an identifier and a non-identifier were annotated with
their identifier type and a non-identifier tag with two exceptions: (1) if the ratio was below 1,
they were annotated only with their identifier type, and (2) if the ratio was above a set ratio
ceiling, the identifier was not annotated, i.e. kept as a non-identifying word.
   Finally, all annotated sentences were postprocessed in the following order:

   1. If an ambiguous identifier was the same type as a neighbor identifier, it was converted to
      that type.
   2. If a single letter was between two name identifiers, it was taken as a middle initial and
      converted to a name identifier.
   3. Identifiers of the same type which were next to each other were converted to a single
      identifier consisting of multiple words.

  We tested values for the ratio ceiling on a binary logarithmic scale from 1 to 262,144.

3.4. Named Entity Recognition Model
We used the automatically annotated sentences to create multiple datasets, based on different
values for the ratio ceiling, for training Princeton University Relation Extraction system (PURE)
[32] NER models to de-identify name, street, and location identifiers in the corpus.

3.4.1. Datasets
The validation and test sets each contained 1,500 sentences. They were annotated for names,
streets, and locations by one of the authors using the CLAMP software [33]. The sentences for
the validation and test sets were selected by setting the ratio ceiling to the median ratio of all
identifiers and choosing 500 sentences with no identifiers, 500 with only unambiguous identifiers,
and 500 with at least one ambiguous identifier. The distributions of types of ambiguous and
unambiguous identifiers were approximately the same as in the entire corpus. Selecting the
validation and test sets in this way ensured that as many models as possible would experience
varying sentences with regards to types, ambiguity, and number of identifiers.
   While the validation and test sets were human annotated and fixed for all models, the training
sets were annotated automatically using the described method and varied with each of the
tested ratio ceilings used for the automatic annotation. Training sets were constructed from all
sentences not used for the validation and test sets. Only sentences with no or unambiguous
identifiers were selected for the training sets since the NER model was only trained with
unambiguous samples. In cases where the number of sentences containing no identifiers was
higher than the number containing identifiers, the former was downsampled to the latter.
   All datasets were converted to the structure used by PURE.

3.4.2. Training of Model
For each training set automatically annotated with the different ratio ceilings, a PURE NER
model was trained with a publicly available uncased Danish pretrained BERT [34] model6 as
base. The default hyperparameters of PURE were used (see [32]) except a context window of 0.
Models were trained until convergence (maximum 100 epochs). The F1 score on the validation
set was used to select the best model checkpoint from each training.

3.4.3. Evaluation of Model
The best performing model on the current data was found by evaluating the F1 scores on the test
set. Performance on the three identifier types was evaluated in a confusion matrix. Additionally,
for each model, we compared its test set performance to that of the dictionary-based method
used for annotating its training set to see if the model generalized from its training data to
improve performance.
   The ratio ceiling used for automatic annotation of the training set for the best performing
model was tested for model training with less available data to evaluate the minimum amount
needed for top model performance.
   Finally, we analyzed the effect of lowering the ratio ceiling to produce more training samples
when there was less data than the minimum amount needed for top model performance.


4. Results
4.1. Identifiers
The list of identifiers had 449,997 unambiguous identifiers: 397,348 names, 48,859 streets, and
3,790 locations. 18,057 identifiers were ambiguous: 16,582 had a name type, 2,505 a street type,
and 3,133 a location type. 3,859 of the ambiguous identifiers had more than one identifier type.
7,148 ambiguous identifiers matched a non-identifier in the Danish orthographic dictionary,
312 in authorized medicinal products, 406 in medical abbreviations, 9,890 in SNOMED CT, and
2,013 in the healthcare system’s classification system. Identifiers had rates of occurrence in the
population between 8.58e-06% and 30.39% with median 1.72e-05%.
   6
       Available at https://github.com/certainlyio/nordic_bert
                                   50,000                                                                                                       100
                                                                                               89.62
                                   45,000                                                                                                       90
                                                                                                     74.43
   Generated training samples, n

                                   40,000                                                                                                       80

                                   35,000                                                                                                       70
                                                                                                      32,970
                                   30,000                                                                                                       60


                                                                                                                                                      F1 score, %
                                   25,000                                                                                                       50

                                   20,000                                                                                                       40

                                   15,000                                                                                                       30
                                                                                                                             Training samples
                                   10,000                                                                                                       20
                                                                                                                             Dic�onary-based
                                    5,000                                                                                                       10
                                                                                                                             Model
                                       0                                                                                                        0
                                            1

                                                2

                                                    4

                                                        8


                                                                                                                               84

                                                                                                                               68

                                                                                                                               36
                                                            16

                                                                  32

                                                                       64


                                                                                                                                 2

                                                                                                                                 4
                                                                              8

                                                                                     6

                                                                                               2

                                                                                               4

                                                                                                          8

                                                                                                                    6

                                                                                                                                2


                                                                                                                              07

                                                                                                                              14
                                                                            12

                                                                                   25

                                                                                            51

                                                                                            02

                                                                                                       04

                                                                                                                 09

                                                                                                                             19

                                                                                                                            ,3

                                                                                                                            ,7

                                                                                                                            ,5

                                                                                                                           1,

                                                                                                                           2,
                                                                                         1,

                                                                                                    2,

                                                                                                              4,

                                                                                                                          8,

                                                                                                                         16

                                                                                                                         32

                                                                                                                         65

                                                                                                                        13

                                                                                                                        26
                                                                                     Ra�o ceiling

Figure 1: Model F1 and dictionary-based F1 in blue, right axis. Amount of samples in the training set
in green, left axis.


   Automatically annotated identifiers had corpus vs. population ratios between 4.42e-03 and
9.82e+06 (median 306.73). The highest ratio was ‘københavn’ (ambiguous location) while the
lowest ratio was ‘og’ (ambiguous name and conjunction ‘and’).

4.2. Named Entity Recognition Model

Table 1
Identifiers in the validation and test sets.
                                                                                  Validation set               Test set
                                                                 Name                   943                      965
                                                                 Street                 77                       97
                                                                 Location               353                      324
                                                                 Total                 1,373                    1,386

   Table 1 shows the distribution of identifier types in the human annotated validation and test
sets.
   Figure 1 shows the test set F1 scores for the models which training sets were automatically
annotated with different ratio ceilings. The F1 score of the dictionary-based method and the
amount of samples in the training sets are also plotted (further details in Appendix B).
   The best model was trained with data automatically annotated with a 512 ratio ceiling (32,970
training set samples). It had a recall of 93.43%, a precision of 86.10% and a F1 of 89.62%. There
was an upwards trend in the 1–512 ratio ceilings and downwards in 512–262,144. All model
F1 scores were higher than the corresponding dictionary-based F1 scores. Most training set
samples were produced with the ratio ceiling at 2 (42,894) while the least were produced by
ratio ceiling 262,144 (16,614). Training time for the best model was 2 hours 4 minutes for 20
Figure 2: Test set confusion matrix. ‘O’ is non-identifiers for which only errors were counted.


epochs on a Nvidia Tesla v100 GPU.
   Figure 2 shows the confusion matrix for model performance. 94% of street and name identifiers,
and 91% of location identifiers were classified correctly. Non-identifiers were most often
misclassified as names (75% of misclassifications).
   Comparing test set performance to the dictionary-based method, the model correctly classified
283 identifiers that the dictionary-based method misclassified. The dictionary-based method
correctly classified 13 identifiers that the model misclassified. 70 identifiers were misclassified
by both the model and the dictionary-based method. Appendix C shows the performance of
the model and the dictionary-based method on words that occurred in the test set both as
non-identifiers and identifiers. E.g., for the word ‘per’, the model correctly classified it as an
identifier (name) in 100% of cases and as a non-identifier (preposition: ‘per’) in 91% of cases. For
the dictionary-based method, it was 57% and 100%, respectively. Note that the dictionary-based
method could classify the same word differently because of the postprocessing steps where an
ambiguous identifier could be converted to an unambiguous identifier under certain conditions.
Among all words that occurred both as non-identifiers and identifiers, the model classified 92%
of non-identifiers and 84% of identifiers correctly. For the dictionary-based method, it was 96%
and 50%, respectively.

4.3. Analysis of Ratio Ceiling
We analyzed the effect of lowering the ratio ceiling to produce more training samples when
there was less data than needed for top model performance.
   The best performing model was trained on data automatically annotated with ratio ceiling
512 and had 147,000 sentences available from which 32,970 sentences were used for the training
set. We lowered the amount of available data for automatic annotation with ratio ceiling 512
from 147,000 through to 12,000 sentences without any reduction in performance.
   Next, we tested the effects on which ratio ceiling was the best when lowering the amount of
available data below 12,000 sentences. We included ratio ceilings between 512 and 16 since they
                                                            3,000


                            Generated training samples, n
                                                            2,500


                                                            2,000                                            512
                                                                                                             256
                                                            1,500
                                                                                                             128

                                                            1,000                                            64
                                                                                                             32
                                                             500                                             16

                                                                 0
                                                                       2,000 4,000 6,000 8,000 10,000
                                                                              Available sentences, n


Figure 3: Amount of training set samples generated from the available data by automatically annotating
with ratio ceilings 16–512.


                                                            95


                                                            85


                                                            75                                               512
                            F1 score, %


                                                                                                             256
                                                            65
                                                                                                             128

                                                            55                                               64
                                                                                                             32
                                                            45                                               16

                                                            35
                                                                     2,000    4,000   6,000   8,000 10,000
                                                                             Available sentences, n


Figure 4: F1 scores for models with training sets automatically annotated with ratio ceilings 16–512 by
amount of available data.


generated increasingly more samples for the training set (Figure 3). When the available data
was less than 8,000 sentences, performance with lower ratio ceilings surpassed that of the 512
ratio ceiling in some cases (Figure 4).


5. Discussion
We used an automatically annotated training set to train a PURE NER deep learning model to
de-identify names, streets and locations in Danish narrative clinical text with a recall of 93.43%,
precision of 86.10%, and F1 score of 89.62%. Non-identifiers were most often misclassified as
names which may be caused by greater variability than for streets and locations.
   We took a similar approach as Pantazos et al. [12] to identify ambiguous identifiers through
matching of identifiers to a database of non-identifiers. While, for de-identification, they deleted
records of ambiguous identifiers that occurred less than 200 times, we trained a deep learning
model from an automatically annotated training set to de-identify ambiguous identifiers. For the
automatic annotation, we handled ambiguous identifiers by calculating the ratio between the
rate of occurrence in the corpus and the rate of occurrence in the population for every identifier.
This method allowed an individual assessment if they should be annotated as an identifier or
not in the training data—increasing the chance of model generalization. The ratio ceiling also
allowed to balance the quality and amount of training data. Analyzing the ratio ceiling, we
found that when less than 8,000 sentences were available, the extra samples provided by a lower
ratio ceiling became more important than using the ratio ceiling that gave the highest quality
of the training data. Lower ratio ceilings produced more training data because more ambiguous
identifiers were considered non-identifiers resulting in fewer ambiguous sentences that had to
be discarded from the training set.
   We saw an increase in F1 from dictionary-based de-identification to annotating a training set
with the dictionary-based method, training a NER model, and de-identifying with the trained
model. This showed that the model generalized from the training data to better classify the
ambiguous identifiers that the dictionary-based approach could not differentiate, and achieve
better performance than simply using the dictionaries to directly annotate text. This is supported
by the model correctly de-identifying 84% of words that occurred in the test set both as identifier
and non-identifier. Only 50% of these words were de-identified by the dictionary-based method.

5.1. Limitations
It is a limitation to the study that the data came only from Odense University Hospital but the
ratios were calculated using the rate of occurrence in the entire population of Denmark.
    Future work includes de-identification of the rest of the HIPAA Safe Harbor identifiers since
there is no guarantee that the presented methods will generalize to other identifiers. Since
this study used lowercased data because only a lowercased Danish BERT base was available,
exploring performance when keeping the case of training data is also part of future work.


6. Conclusions
We trained a NER deep learning model using automatically annotated data to de-identify names,
streets, and locations in Danish narrative clinical text with recall 93.43%, precision 86.10%, and
F1 89.62%. A model trained on data annotated with a dictionary-based method can generalize
and surpass the performance of the dictionary-based method. A ratio ceiling of 512 works best
for Danish narrative clinical text when more than 8,000 sentences are available.
   The automatic de-identification method presented in this study can be adapted to all languages
and domains if lists of identifiers and non-identifiers are available. Apart from the lists, the
method does not need any external data as the input data to the de-identification model is used
to train the model itself. This makes the method particularly useful for low-resource languages
where annotated datasets and trained models for de-identification of specific identifier types
are not always publicly available.


References
 [1] V. E. Valkhoff, P. M. Coloma, G. M. Masclee, R. Gini, F. Innocenti, F. Lapi, M. Molokhia,
     M. Mosseveld, M. S. Nielsson, M. Schuemie, F. Thiessard, J. van der Lei, M. C. Sturken-
     boom, G. Trifirò, Validation study in four health-care databases: upper gastrointesti-
     nal bleeding misclassification affects precision but not magnitude of drug-related up-
     per gastrointestinal bleeding risk, Journal of Clinical Epidemiology 67 (2014) 921–931.
     URL: https://www.sciencedirect.com/science/article/pii/S0895435614000845. doi:https:
     //doi.org/10.1016/j.jclinepi.2014.02.020 .
 [2] L. R. Øie, M. A. Madsbu, C. Giannadakis, A. Vorhaug, H. Jensberg, Ø. Salvesen,
     S. Gulati,         Validation of intracranial hemorrhage in the norwegian pa-
     tient registry,    Brain and Behavior 8 (2018) e00900. URL: https://onlinelibrary.
     wiley.com/doi/abs/10.1002/brb3.900.               doi:https://doi.org/10.1002/brb3.900 .
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/brb3.900 .
 [3] J. Delekta, S. M. Hansen, K. S. AlZuhairi, C. S. Bork, A. M. Joensen, The validity of the
     diagnosis of heart failure (I50.0-I50.9) in the danish national patient register, Dan Med J 65
     (2018).
 [4] T. L. Higgins, A. Deshpande, M. D. Zilberberg, P. K. Lindenauer, P. B. Imrey, P.-C.
     Yu, S. D. Haessler, S. S. Richter, M. B. Rothberg, Assessment of the Accuracy of
     Using ICD-9 Diagnosis Codes to Identify Pneumonia Etiology in Patients Hospitalized
     With Pneumonia, JAMA Network Open 3 (2020) e207750–e207750. URL: https://doi.
     org/10.1001/jamanetworkopen.2020.7750. doi:10.1001/jamanetworkopen.2020.7750 .
     arXiv:https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2768537/higgins_2020_oi
 [5] N. Wabe, L. Li, R. Lindeman, J. J. Post, M. R. Dahm, J. Li, J. I. Westbrook, A. Georgiou,
     Evaluation of the accuracy of diagnostic coding for influenza compared to laboratory
     results: the availability of test results before hospital discharge facilitates improved coding
     accuracy, BMC Medical Informatics and Decision Making 21 (2021) 168. URL: https:
     //doi.org/10.1186/s12911-021-01531-9. doi:10.1186/s12911- 021- 01531- 9 .
 [6] GDPR, Regulation (eu) 2016/679 (general data protection regulation), ???? URL: https:
     //gdpr-info.eu/.
 [7] HIPAA, Health insurance portability and accountability act of 1996 (hipaa), public law
     104-191, ???? URL: https://www.hhs.gov/hipaa/for-professionals/index.html.
 [8] B. A. Beckwith, R. Mahaadevan, U. J. Balis, F. Kuo, Development and evaluation of an open
     source software tool for deidentification of pathology reports, BMC Medical Informatics
     and Decision Making 6 (2006) 12.
 [9] I. Neamatullah, M. M. Douglass, L.-W. H. Lehman, A. Reisner, M. Villarroel, W. J. Long,
     P. Szolovits, G. B. Moody, R. G. Mark, G. D. Clifford, Automated de-identification of
     free-text medical records, BMC Medical Informatics and Decision Making 8 (2008) 32.
[10] E. Chazard, C. Mouret, G. Ficheur, A. Schaffar, J.-B. Beuscart, R. Beuscart, Proposal and
     evaluation of fasdim, a fast and simple de-identification method for unstructured free-
     text clinical records, International Journal of Medical Informatics 83 (2014) 303–312.
     URL: https://www.sciencedirect.com/science/article/pii/S1386505613002463. doi:https:
     //doi.org/10.1016/j.ijmedinf.2013.11.005 .
[11] S. Y. C. H. J. P. J. L. Y. L. M.-S. C. C.-M. K. W.-S. L. J. H. Shin Soo-Yong, Park Yu Rang, A de-
     identification method for bilingual clinical texts of various note types, jkms 30 (2015) 7–15.
     URL: http://www.e-sciencecentral.org/articles/?scid=1022920. doi:10.3346/jkms.2015.
     30.1.7 . arXiv:http://www.e-sciencecentral.org/articles/?scid=1022920 .
[12] K. Pantazos, S. Lauesen, S. Lippert,               Preserving medical correctness, readabil-
     ity and consistency in de-identified health records,                   Health Informatics Jour-
     nal 23 (2017) 291–303. URL: https://doi.org/10.1177/1460458216647760. doi:10.1177/
     1460458216647760 . arXiv:https://doi.org/10.1177/1460458216647760 .
[13] V. Menger, F. Scheepers, L. M. van Wijk, M. Spruit, Deduce: A pattern matching
     method for automatic de-identification of dutch medical text, Telematics and In-
     formatics 35 (2018) 727–736. URL: https://www.sciencedirect.com/science/article/pii/
     S0736585316307365. doi:https://doi.org/10.1016/j.tele.2017.08.002 .
[14] H. Fabregat, A. Duque, J. Martinez-Romo, L. Araujo, De-identification through named
     entity recognition for medical document anonymization, Proceedings of the Iberian
     Languages Evaluation Forum (IberLEF 2019) (2019).
[15] K. Kajiyama, H. Horiguchi, T. Okumura, M. Morita, Y. Kano, De-identifying free text of
     japanese electronic health records, Journal of Biomedical Semantics 11 (2020) 11.
[16] L. Lange, H. Adel, J. Strötgen, Closing the gap: Joint de-identification and concept
     extraction in the clinical domain, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, Association for Computational Linguistics,
     Online, 2020, pp. 6945–6952. URL: https://aclanthology.org/2020.acl-main.621. doi:10.
     18653/v1/2020.acl- main.621 .
[17] J. L. Leevy, T. M. Khoshgoftaar, F. Villanustre, Survey on RNN and CRF models for
     de-identification of medical free text, Journal of Big Data 7 (2020) 73.
[18] I. Pérez-Díez, R. Pérez-Moraga, A. López-Cerdán, J.-M. Salinas-Serrano, M. d. la Iglesia-
     Vayá, De-identifying spanish medical texts - named entity recognition applied to radiology
     reports, Journal of Biomedical Semantics 12 (2021) 6.
[19] R. Catelli, V. Casola, G. De Pietro, H. Fujita, M. Esposito, Combining contextualized
     word representation and sub-document level analysis through bi-lstm+crf architecture
     for clinical de-identification, Knowledge-Based Systems 213 (2021) 106649. URL: https:
     //www.sciencedirect.com/science/article/pii/S0950705120307784. doi:https://doi.org/
     10.1016/j.knosys.2020.106649 .
[20] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997)
     1735–1780. doi:10.1162/neco.1997.9.8.1735 .
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
     I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
     Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
     paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[22] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita, M. Esposito, Crosslingual named
     entity recognition for clinical de-identification applied to a covid-19 italian data set, Applied
     Soft Computing 97 (2020) 106779. URL: https://www.sciencedirect.com/science/article/pii/
     S1568494620307171. doi:https://doi.org/10.1016/j.asoc.2020.106779 .
[23] R. Catelli, F. Gargiulo, E. Damiano, M. Esposito, G. De Pietro, Clinical de-identification
     using sub-document analysis and electra, in: 2021 IEEE International Conference on
     Digital Health (ICDH), 2021, pp. 266–275. doi:10.1109/ICDH52753.2021.00050 .
[24] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita, M. Esposito, A novel covid-19
     data set and an effective deep learning approach for the de-identification of italian medical
     records, IEEE Access 9 (2021) 19097–19110. doi:10.1109/ACCESS.2021.3054479 .
[25] K. Murugadoss, A. Rajasekharan, B. Malin, V. Agarwal, S. Bade, J. R. Anderson, J. L.
     Ross, W. A. Faubion, J. D. Halamka, V. Soundararajan, S. Ardhanari, Building a best-in-
     class automated de-identification tool for electronic health records through ensemble
     learning, Patterns 2 (2021) 100255. URL: https://www.sciencedirect.com/science/article/
     pii/S2666389921000817. doi:https://doi.org/10.1016/j.patter.2021.100255 .
[26] C. Meaney, W. Hakimpour, S. Kalia, R. Moineddin, A comparative evaluation of transformer
     models for de-identification of clinical text data, 2022. URL: https://arxiv.org/abs/2204.
     07056. doi:10.48550/ARXIV.2204.07056 .
[27] A. J. McMurry, B. Fitch, G. Savova, I. S. Kohane, B. Y. Reis, Improved de-identification
     of physician notes through integrative modeling of both public and private medical text,
     BMC Medical Informatics and Decision Making 13 (2013) 112.
[28] Z. Jian, X. Guo, S. Liu, H. Ma, S. Zhang, R. Zhang, J. Lei, A cascaded approach for
     chinese clinical text de-identification with less annotation effort, Journal of Biomedi-
     cal Informatics 73 (2017) 76–83. URL: https://www.sciencedirect.com/science/article/pii/
     S1532046417301776. doi:https://doi.org/10.1016/j.jbi.2017.07.017 .
[29] Y. Kim, P. Heider, S. Meystre, Ensemble-based methods to improve de-identification of
     electronic health record narratives, AMIA Annu Symp Proc 2018 (2018) 663–672.
[30] P. Richter-Pechanski, S. Riezler, C. Dieterich, De-Identification of german medical admis-
     sion notes, Stud Health Technol Inform 253 (2018) 165–169.
[31] Danish Language Council, Retskrivningsordbogen, 4th edition, 2017, including 8 digital
     issues, Danish Language Council, Bogense, Denmark, 2012.
[32] Z. Zhong, D. Chen, A frustratingly easy approach for entity and relation extraction, in:
     Proceedings of the 2021 Conference of the North American Chapter of the Association
     for Computational Linguistics: Human Language Technologies, Association for Computa-
     tional Linguistics, Online, 2021, pp. 50–61. URL: https://aclanthology.org/2021.naacl-main.5.
     doi:10.18653/v1/2021.naacl- main.5 .
[33] E. Soysal, J. Wang, M. Jiang, Y. Wu, S. Pakhomov, H. Liu, H. Xu,                     CLAMP
     – a toolkit for efficiently building customized clinical natural language
     processing pipelines,           Journal of the American Medical Informatics As-
     sociation      25    (2017)     331–336.      URL:     https://doi.org/10.1093/jamia/ocx132.
     doi:10.1093/jamia/ocx132 .           arXiv:https://academic.oup.com/jamia/article-
     pdf/25/3/331/34150625/ocx132.pdf .
[34] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
               Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
               N19-1423. doi:10.18653/v1/N19- 1423 .


A. Medical Abbreviations

Table 2
Sources of medical abbreviations
Description                                                                    Link                     Deep link
List of recognized abbreviations and symbols used in Hospital Sønderjylland    skoletube.dk             https://www.skoletube.dk/download.php?key=MGVhNTVjOTFkZDdiYjIyNDU0ZGI
Symbols and abbreviations used at the laboratory, Hospital Sønderjylland       regionsyddanmark.dk      https://shs.regionsyddanmark.dk/wm272641
Abbreviations and symbols in patient related data                              regionsjaellend.dk       http://dok.regionsjaelland.dk/view.aspx?DokID=217238
Definitions and abbreviations, Department of clinical biochemistry,
Herlev and Gentofte Hospital                                                   gentoftehospital.dk      https://www.gentoftehospital.dk/afdelinger-og-klinikker/klinisk-biokemisk-afdeling/metodeblade/Documents/Definitioner%20og%20forkortelser%20%E2%80%93%20Klinisk%20Biokemisk%20Afdeling_2020_11_03.pdf
Abbreviations and symbols, psychiatry in Region Nordjylland                    pri.rn.dk                https://pri.rn.dk/Assets/15764/Forkortelser-og-symboler-gaeldende-for-sygeplejefaglige-optegnelser.pdf
Medical abbreviations                                                          medicinskefagudtryk.dk   http://medicinskefagudtryk.dk/sidebar/forkortelser/
Ordinary medical abbreviations                                                 medviden.dk              https://www.medviden.dk/vaerktoejer/andre/almindelige-medicinske-forkortelser/
Guidelines for journal writing                                                 gyldendal.dk             http://guga.gyldendal.dk/Munksgaard/$\sim$/media/Munksgaard/Medicinsk%20Fagsprog%202/Retningslinjer.ashx
Abbreviations and designations                                                 medicin.dk               https://pro.medicin.dk/Artikler/Artikel/215
Abbreviations used in Department R, FMB                                        docplayer.dk             https://docplayer.dk/5008471-Forkortelser-anvendt-i-afd-r-s-instrukser-og-vejledninger-120204-fmb.html
Clinical biochemistry test results                                             pri.rn.dk                https://pri.rn.dk/Sider/10307.aspx
Family tree abbreviations                                                      pri.rn.dk                https://pri.rn.dk/Sider/29580.aspx
Abbreviations and symbols in the eye speciality                                pri.rn.dk                https://pri.rn.dk/Sider/17827.aspx
Accepted abbreviations at vascular surgery department                          pri.rn.dk                https://pri.rn.dk/Sider/17293.aspx
Renal abbreviations                                                            pri.rn.dk                https://pri.rn.dk/Sider/16695.aspx
Heart Lung surgery abbreviations                                               pri.rn.dk                https://pri.rn.dk/Sider/10530.aspx
Intensive Therapy abbreviations                                                docplayer.dk             https://docplayer.dk/11055021-Hyppigt-anvendte-forkortelser-og-termer-i-intensiv-terapi.html


B. Results Table

Table 3
Distribution of identifiers in each of the training sets automatically annotated with different ratio
ceilings and the resulting model and dictionary-based F1 scores.
 Ratio ceiling             Sentences with identifiers                         Sentences without identifiers                    Total sentences              Dictionary-based F1 %                  Model F1 %            Name tags            Street tags          Location tags            Total tags
 1                                  21,343                                               21,343                                     42,686                          33.45                            43.86                 26,775                1,015                 4,309                  32,099
 2                                  21,447                                               21,447                                     42,894                          48.80                            65.17                 29,003                1,128                 3,870                  34,001
 4                                  21,131                                               21,131                                     42,262                          56.02                            72.79                 28,596                1,217                 3,781                  33,594
 8                                  21,185                                               21,185                                     42,370                          61.52                            79.19                 28,388                1,437                 3,749                  33,574
 16                                 20,923                                               20,923                                     41,846                          64.56                            82.21                 27,988                1,508                 3,803                  33,299
 32                                 20,597                                               20,597                                     41,194                          69.11                            85.52                 27,342                1,660                 3,707                  32,709
 64                                 19,750                                               19,750                                     39,500                          71.05                            87.36                 26,382                2,130                 3,533                  32,045
 128                                19,354                                               19,354                                     38,708                          72.19                            87.73                 25,963                2,248                 3,446                  31,657
 256                                18,058                                               18,058                                     36,116                          74.02                            89.10                 24,191                2,391                 3,138                  29,720
 512                                16,485                                               16,485                                     32,970                          74.43                            89.62                 22,277                2,396                 2,897                  27,570
 1,024                              15,490                                               15,490                                     30,980                          74.42                            88.24                 21,001                2,323                 2,652                  25,976
 2,048                              14,246                                               14,246                                     28,492                          74.42                            88.25                 19,617                2,290                 2,475                  24,382
 4,096                              13,041                                               13,041                                     26,082                          73.97                            87.85                 18,127                2,205                 2,332                  22,664
 8,192                              12,424                                               12,424                                     24,848                          73.82                            86.35                 17,450                2,196                 2,244                  21,890
 16,384                             11,205                                               11,205                                     22,410                          72.84                            87.48                 15,826                2,133                 1,931                  19,890
 32,768                             10,278                                               10,278                                     20,556                          72.63                            87.06                 14,732                2,090                 1,756                  18,578
 65,536                              8,879                                               8,879                                      17,758                          69.77                            85.34                 12,848                1,915                 1,462                  16,225
 131,072                             8,482                                               8,482                                      16,964                          68.98                            86.01                 12,330                1,903                 1,380                  15,613
 262,144                             8,307                                               8,307                                      16,614                          68.46                            85.37                 12,295                1,893                 1,291                  15,479


C. Ambiguous Performance
Table 4
A comparison of the model and dictionary-based performance on words that occurred both as non-
identifiers and identifiers in the test set. Only rows where there is a difference in performance are
shown. The total is calculated over all words.
                           Non-identifiers                                 Identifiers
             Dictionary-based % (total) Model % (total)   Dictionary-based % (total) Model % (total)
 per                 100% (32)              91% (32)               57% (7)             100% (7)
 hans                100% (25)              88% (25)               20% (5)             100% (5)
 maria                100% (4)              75% (4)                38% (13)            100% (13)
 ringe               100% (11)             100% (11)                0% (3)              100% (3)
 plads                100% (9)              100% (9)                0% (1)              100% (1)
 rask                 100% (8)              88% (8)                50% (2)             100% (2)
 bak                  100% (1)               0% (1)                100% (4)             100% (4)
 bo                   100% (1)              100% (1)               50% (4)             100% (4)
 do                   100% (3)              100% (3)               100% (1)              0% (1)
 tønder               100% (1)              100% (1)                0% (3)              100% (3)
 hammer               100% (1)               0% (1)                100% (2)             100% (2)
 rene                 100% (1)              100% (1)               50% (2)             100% (2)
 slagelse             100% (2)               0% (2)                 0% (1)              100% (1)
 hammel               100% (1)              100% (1)                0% (1)              100% (1)
 land                 100% (1)              100% (1)                0% (1)              100% (1)
 langeland            100% (1)               0% (1)                 0% (1)              100% (1)
 stokke               100% (1)              100% (1)                0% (1)              100% (1)
 Total               96% (296)             92% (296)              50% (88)             84% (88)

</pre>