=Paper=
{{Paper
|id=Vol-3416/paper_5
|storemode=property
|title=Automatic Annotation of Training Data for Deep Learning Based De-identification of Narrative Clinical Text
|pdfUrl=https://ceur-ws.org/Vol-3416/paper_5.pdf
|volume=Vol-3416
|authors=Martin Sundahl Laursen,Jannik Skyttegaard Pedersen,Pernille Just Vinholt,Thiusius Rajeeth Savarimuthu
|dblpUrl=https://dblp.org/rec/conf/icon-nlp/LaursenPVS22
}}
==Automatic Annotation of Training Data for Deep Learning Based De-identification of Narrative Clinical Text==
Automatic Annotation of Training Data for Deep
Learning Based De-identification of Narrative Clinical
Text
Martin Sundahl Laursen1 , Jannik Skyttegaard Pedersen1 , Pernille Just Vinholt2 and
Thiusius Rajeeth Savarimuthu1
1
The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Denmark
2
Department of Clinical Biochemistry, Odense University Hospital, Denmark
Abstract
Electronic health records contain information about patients’ medical history which is important for
research but the text must be de-identified before use. This study utilized dictionaries constructed
from publicly available lists of identifiers to automatically annotate a training dataset for a named
entity recognition model to de-identify names, streets, and locations in Danish narrative clinical text.
Ambiguous identifiers were not annotated if they occurred more than expected for an identifier. The
model had recall 93.43%, precision 86.10%, and F1 89.62%. We found that the model generalized from the
training data to achieve better performance than simply using the dictionaries to directly annotate text.
Keywords
de-identification, electronic health records, named entity recognition, automatic annotation, deep learning
1. Introduction
Electronic health records (EHR) contain information about patients’ contact with the healthcare
system including important information about medical history, e.g. symptoms, diagnoses, and
treatments. Diagnoses are also registered using International Classification of Diseases 10 codes
for administrative purposes. However, not all relevant patient information is represented in
codes, e.g symptoms. Further, codes are often incorrect [1, 2, 3, 4, 5] and can therefore not
replace the narrative clinical text in EHRs as a source of information.
Apart from treatment of patients, the EHR data are important for e.g. research and education
but as they contain personally identifiable information, explicit consent from the affected
individual must be given, or the data must be de-identified before being used for secondary
purposes [6, 7]. The US Health Insurance Portability and Accountability Act (HIPAA) defines
which identifiers must be removed according to the Safe Harbor method for de-identification1 .
WNLPe-Health 2022, December 18, 2022, Delhi, India
Envelope-Open msla@mmmi.sdu.dk (M. S. Laursen); jasp@mmmi.sdu.dk (J. S. Pedersen); pernille.vinholt@rsyd.dk (P. J. Vinholt);
trs@mmmi.sdu.dk (T. R. Savarimuthu)
Orcid 0000-0001-5684-1325 (M. S. Laursen); 0000-0002-7066-1563 (J. S. Pedersen); 0000-0002-2035-0169 (P. J. Vinholt);
0000-0002-2478-8694 (T. R. Savarimuthu)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
1
Guide available at https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.
html
The identifiers include, among others, names, street addresses, and locations including city,
county, and precinct.
Manual de-identification is a time consuming task and, therefore, large datasets are impractical
and expensive to de-identify manually. Natural language processing techniques for automatic
de-identification may alleviate this task.
This study utilizes dictionaries of identifiers and a novel way of dealing with ambiguous
identifiers to automatically annotate a training dataset for a named entity recognition (NER)
model to de-identify names, streets, and locations in Danish narrative clinical text.
A method for automatic annotation of training datasets is useful for developing de-identification
deep learning models for low-resource languages like Danish where annotated datasets and
trained models for de-identification of specific identifier types are not always publicly available.
The main contributions of this paper are:
• We train a NER model to de-identify names, streets, and locations in Danish narrative
clinical text with recall 93.43%, precision 86.10%, and F1 89.62%.
• We use dictionary-based automatic annotation of training data for the NER model utilizing
our novel method for annotation of ambiguous identifiers guided by occurrence rates in
the text and population.
• We find that the NER model can generalize from the dataset to achieve better performance
than simply using the dictionaries to directly annotate text.
2. Related Work
Previous studies on automatic de-identification of narrative clinical text used rule-based methods,
machine learning methods, and hybrid methods combining both. We found no studies that,
similar to ours, used automatic annotation of data for training a machine learning model to
de-identify narrative clinical text.
In studies that used rule-based methods, pattern matching or dictionaries were used to search
for identifiers in the text [8, 9, 10, 11, 12, 13]. Rule-based methods rely on domain experts to
define the rules and it is difficult to cover all cases. They generally cannot distinguish ambiguous
identifiers, i.e. words that can both be an identifier and a non-identifier depending on the context.
Pantazos et al. [12] de-identified Danish text using dictionaries with an F1 score of 95.7% on a
random sample of 369 EHRs. They identified ambiguous identifiers by matching identifiers to a
database of non-identifiers. As they replaced identifiers with pseudo-identifiers, their approach
to ambiguous identifiers was to delete the record unless the identifier appeared more than 200
times to not disclose their replacement rule.
Studies that used machine learning methods mainly used recurrent neural networks, condi-
tional random fields, and combinations of the two [14, 15, 16, 17, 18, 19]. All studies that utilized
machine learning used a manually annotated dataset for training the models. Machine learning
methods and in particular deep learning architectures such as Long Short-Term Memory [20]
and transformer [21] networks are able to distinguish ambiguous identifiers based on the context
of the whole sentence. Some recent studies have used transformer networks for automatic
de-identification of narrative clinical text [22, 19, 23, 24, 25, 26]. A disadvantage of machine
learning methods is their need for a large expert-annotated training dataset specific to the
domain.
Finally, the studies that were most similar to ours used hybrid methods, combining rule-based
and machine learning methods in ensembles or pipelines to improve the annotation workload
and model performance [27, 28, 29, 30, 25]. Two studies used rule-based methods in other ways
than for directly classifying identifiers. McMurry et al. [27] used pattern matching to contribute
part of a feature set for classification by a machine learning model which resulted in a F1 score
of 76% on a custom test set of 220 discharge summaries. Jian et al. [28] used pattern matching
to create a dense corpus of identifiers for manual annotation before being input to a machine
learning model. It had an F1 score of 94.6% when cross-validating on 3,000 clinical documents.
3. Methods
In this paper, we first constructed lists of name, street, and location identifiers. We compared
the identifiers to a database of non-identifying words to determine which identifiers were
ambiguous—e.g. the name ‘Hans’ is ambiguous because it is also a pronoun (Danish for ‘his’).
This dictionary-based method is similar to that of e.g. Pantazos et al. [12] except in this paper,
we used it to annotate training data for a deep learning model instead of using direct dictionary-
based de-identification. Additionally, we utilized a novel method for annotation of ambiguous
identifiers and tested different ceiling values above which words were removed from the list of
identifiers if they occured in the text at a higher rate than would be expected for an identifier.
We searched for and annotated identifiers in Danish narrative clinical text and constructed
a training set of sentences with no or only unambiguous identifiers. Finally, the training set
was used to train a NER de-identification model. The goal was for the model to generalize
from the training samples with no or only unambiguous identifiers to also correctly classify
ambiguous identifiers. This process is detailed in the rest of this section. We make our code
publicly available2 .
3.1. Data
3.1.1. Corpus
We extracted 150,000 random sentences with a length between 8 and 70 words from EHRs from
Odense University Hospital between 2015 and 2020. Sentences were lowercased and tokenized,
and consecutive underscores and hyphens were reduced to a single instance.
3.1.2. Identifiers
The identifier types were names, streets, and locations. Locations included cities, municipalities,
regions, and provinces.
For the name identifiers, we obtained lists of all male first names, female first names, and last
names in the Danish population as of January 2021 from Statistics Denmark.
2
https://github.com/jannikskytt/clinical_de-identification
For the street identifiers, we used a database of all Danish addresses from the Address Web
Services of the Agency for Data Supply and Efficiency of Denmark3 . Each address included
street name, addressing street name (could be identical to street name), city name, potential
supplemental city name, municipality, region, and province. From the database, a list of unique
street names including addressing street names was constructed.
For the location identifiers, we used the same database of all Danish addresses. A list of
unique locations including city names, supplemental city names, municipalities, regions, and
provinces was constructed.
Data cleaning consisted of lowercasing and removing single-letter and empty and corrupted
identifiers including various placeholders.
A rate of occurrence in the Danish population was calculated for each identifier by dividing
their occurrence in the population by the sum of all occurrences for that identifier type. For
each identifier type, duplicates were merged by adding the rates of occurrence.
3.1.3. Non-identifiers
Non-identifiers were words that in their context did not identify names, streets, or locations.
Such words included both common general domain words and specialized words from the
clinical domain such as symptoms, diseases, and treatments. The database of non-identifiers
was constructed from multiple text sources from the general and clinical domains which did
not contain any of the three identifier types.
The text sources were:
• The Danish orthographic dictionary containing all Danish words, their conjugations, and
abbreviations [31].
• Product names from the list of authorized medicinal products in Denmark4 .
• Medical abbreviations collected from different electronic sources (Appendix A)
• All term entries in the Description tables of the SNOMED CT vocabulary of clinical
terminology (international version with Danish extension).
• The Danish healthcare system’s classification system for symptoms, diagnoses, and
operations5 .
3.2. Ambiguous Identifiers
An identifier could be ambiguous for two reasons. One reason was that it had multiple different
identifier types, e.g. ‘Kolding’ is both a location and a name. In that case, the identifiers’ rates of
occurrence were added. Another reason was that it was also a non-identifier. To find those cases,
identifiers were matched against the database of non-identifiers using a regular expression that
ignored case (regex). If an identifier was matched to a non-identifier, it was ambiguous.
3
All datasets were downloaded from https://download.aws.dk/
4
Available at https://laegemiddelstyrelsen.dk/en/
5
Available at https://sundhedsdatastyrelsen.dk/
3.3. Automatic Annotation
For the automatic annotation of identifiers in sentences, specifically dealing with ambiguous
identifiers, we introduced a measure for the likelihood of a word being a non-identifier vs.
identifier for the specific corpus. The measure was the ratio between the rate with which the
word occurred in sentences in the corpus as either identifier or non-identifier, and the rate of
occurrence in the Danish population as identifier: 𝑟𝑎𝑡𝑖𝑜 = 𝑟𝑐𝑜𝑟𝑝𝑢𝑠 /𝑟𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 . A ratio above 1
meant that the word had a higher rate of occurrence in the corpus than as an identifier in the
Danish population. This could indicate that it in most cases occurred as a non-identifier in the
corpus. A ratio below 1 could indicate that the word in most cases occurred as an identifier.
The rate of occurrence in the corpus was calculated for each identifier by searching through
all sentences using a regex, counting the number of occurrences, and dividing by the total
number of sentences. The ratio was then calculated using the equation.
Next, a regex was used to search for and annotate identifiers in the sentences. Words that
were unambiguous identifiers were annotated with their single identifier type. Words that were
ambiguous because they had multiple identifier types were annotated with both. Words that
were ambiguous because they were both an identifier and a non-identifier were annotated with
their identifier type and a non-identifier tag with two exceptions: (1) if the ratio was below 1,
they were annotated only with their identifier type, and (2) if the ratio was above a set ratio
ceiling, the identifier was not annotated, i.e. kept as a non-identifying word.
Finally, all annotated sentences were postprocessed in the following order:
1. If an ambiguous identifier was the same type as a neighbor identifier, it was converted to
that type.
2. If a single letter was between two name identifiers, it was taken as a middle initial and
converted to a name identifier.
3. Identifiers of the same type which were next to each other were converted to a single
identifier consisting of multiple words.
We tested values for the ratio ceiling on a binary logarithmic scale from 1 to 262,144.
3.4. Named Entity Recognition Model
We used the automatically annotated sentences to create multiple datasets, based on different
values for the ratio ceiling, for training Princeton University Relation Extraction system (PURE)
[32] NER models to de-identify name, street, and location identifiers in the corpus.
3.4.1. Datasets
The validation and test sets each contained 1,500 sentences. They were annotated for names,
streets, and locations by one of the authors using the CLAMP software [33]. The sentences for
the validation and test sets were selected by setting the ratio ceiling to the median ratio of all
identifiers and choosing 500 sentences with no identifiers, 500 with only unambiguous identifiers,
and 500 with at least one ambiguous identifier. The distributions of types of ambiguous and
unambiguous identifiers were approximately the same as in the entire corpus. Selecting the
validation and test sets in this way ensured that as many models as possible would experience
varying sentences with regards to types, ambiguity, and number of identifiers.
While the validation and test sets were human annotated and fixed for all models, the training
sets were annotated automatically using the described method and varied with each of the
tested ratio ceilings used for the automatic annotation. Training sets were constructed from all
sentences not used for the validation and test sets. Only sentences with no or unambiguous
identifiers were selected for the training sets since the NER model was only trained with
unambiguous samples. In cases where the number of sentences containing no identifiers was
higher than the number containing identifiers, the former was downsampled to the latter.
All datasets were converted to the structure used by PURE.
3.4.2. Training of Model
For each training set automatically annotated with the different ratio ceilings, a PURE NER
model was trained with a publicly available uncased Danish pretrained BERT [34] model6 as
base. The default hyperparameters of PURE were used (see [32]) except a context window of 0.
Models were trained until convergence (maximum 100 epochs). The F1 score on the validation
set was used to select the best model checkpoint from each training.
3.4.3. Evaluation of Model
The best performing model on the current data was found by evaluating the F1 scores on the test
set. Performance on the three identifier types was evaluated in a confusion matrix. Additionally,
for each model, we compared its test set performance to that of the dictionary-based method
used for annotating its training set to see if the model generalized from its training data to
improve performance.
The ratio ceiling used for automatic annotation of the training set for the best performing
model was tested for model training with less available data to evaluate the minimum amount
needed for top model performance.
Finally, we analyzed the effect of lowering the ratio ceiling to produce more training samples
when there was less data than the minimum amount needed for top model performance.
4. Results
4.1. Identifiers
The list of identifiers had 449,997 unambiguous identifiers: 397,348 names, 48,859 streets, and
3,790 locations. 18,057 identifiers were ambiguous: 16,582 had a name type, 2,505 a street type,
and 3,133 a location type. 3,859 of the ambiguous identifiers had more than one identifier type.
7,148 ambiguous identifiers matched a non-identifier in the Danish orthographic dictionary,
312 in authorized medicinal products, 406 in medical abbreviations, 9,890 in SNOMED CT, and
2,013 in the healthcare system’s classification system. Identifiers had rates of occurrence in the
population between 8.58e-06% and 30.39% with median 1.72e-05%.
6
Available at https://github.com/certainlyio/nordic_bert
50,000 100
89.62
45,000 90
74.43
Generated training samples, n
40,000 80
35,000 70
32,970
30,000 60
F1 score, %
25,000 50
20,000 40
15,000 30
Training samples
10,000 20
Dic�onary-based
5,000 10
Model
0 0
1
2
4
8
84
68
36
16
32
64
2
4
8
6
2
4
8
6
2
07
14
12
25
51
02
04
09
19
,3
,7
,5
1,
2,
1,
2,
4,
8,
16
32
65
13
26
Ra�o ceiling
Figure 1: Model F1 and dictionary-based F1 in blue, right axis. Amount of samples in the training set
in green, left axis.
Automatically annotated identifiers had corpus vs. population ratios between 4.42e-03 and
9.82e+06 (median 306.73). The highest ratio was ‘københavn’ (ambiguous location) while the
lowest ratio was ‘og’ (ambiguous name and conjunction ‘and’).
4.2. Named Entity Recognition Model
Table 1
Identifiers in the validation and test sets.
Validation set Test set
Name 943 965
Street 77 97
Location 353 324
Total 1,373 1,386
Table 1 shows the distribution of identifier types in the human annotated validation and test
sets.
Figure 1 shows the test set F1 scores for the models which training sets were automatically
annotated with different ratio ceilings. The F1 score of the dictionary-based method and the
amount of samples in the training sets are also plotted (further details in Appendix B).
The best model was trained with data automatically annotated with a 512 ratio ceiling (32,970
training set samples). It had a recall of 93.43%, a precision of 86.10% and a F1 of 89.62%. There
was an upwards trend in the 1–512 ratio ceilings and downwards in 512–262,144. All model
F1 scores were higher than the corresponding dictionary-based F1 scores. Most training set
samples were produced with the ratio ceiling at 2 (42,894) while the least were produced by
ratio ceiling 262,144 (16,614). Training time for the best model was 2 hours 4 minutes for 20
Figure 2: Test set confusion matrix. ‘O’ is non-identifiers for which only errors were counted.
epochs on a Nvidia Tesla v100 GPU.
Figure 2 shows the confusion matrix for model performance. 94% of street and name identifiers,
and 91% of location identifiers were classified correctly. Non-identifiers were most often
misclassified as names (75% of misclassifications).
Comparing test set performance to the dictionary-based method, the model correctly classified
283 identifiers that the dictionary-based method misclassified. The dictionary-based method
correctly classified 13 identifiers that the model misclassified. 70 identifiers were misclassified
by both the model and the dictionary-based method. Appendix C shows the performance of
the model and the dictionary-based method on words that occurred in the test set both as
non-identifiers and identifiers. E.g., for the word ‘per’, the model correctly classified it as an
identifier (name) in 100% of cases and as a non-identifier (preposition: ‘per’) in 91% of cases. For
the dictionary-based method, it was 57% and 100%, respectively. Note that the dictionary-based
method could classify the same word differently because of the postprocessing steps where an
ambiguous identifier could be converted to an unambiguous identifier under certain conditions.
Among all words that occurred both as non-identifiers and identifiers, the model classified 92%
of non-identifiers and 84% of identifiers correctly. For the dictionary-based method, it was 96%
and 50%, respectively.
4.3. Analysis of Ratio Ceiling
We analyzed the effect of lowering the ratio ceiling to produce more training samples when
there was less data than needed for top model performance.
The best performing model was trained on data automatically annotated with ratio ceiling
512 and had 147,000 sentences available from which 32,970 sentences were used for the training
set. We lowered the amount of available data for automatic annotation with ratio ceiling 512
from 147,000 through to 12,000 sentences without any reduction in performance.
Next, we tested the effects on which ratio ceiling was the best when lowering the amount of
available data below 12,000 sentences. We included ratio ceilings between 512 and 16 since they
3,000
Generated training samples, n
2,500
2,000 512
256
1,500
128
1,000 64
32
500 16
0
2,000 4,000 6,000 8,000 10,000
Available sentences, n
Figure 3: Amount of training set samples generated from the available data by automatically annotating
with ratio ceilings 16–512.
95
85
75 512
F1 score, %
256
65
128
55 64
32
45 16
35
2,000 4,000 6,000 8,000 10,000
Available sentences, n
Figure 4: F1 scores for models with training sets automatically annotated with ratio ceilings 16–512 by
amount of available data.
generated increasingly more samples for the training set (Figure 3). When the available data
was less than 8,000 sentences, performance with lower ratio ceilings surpassed that of the 512
ratio ceiling in some cases (Figure 4).
5. Discussion
We used an automatically annotated training set to train a PURE NER deep learning model to
de-identify names, streets and locations in Danish narrative clinical text with a recall of 93.43%,
precision of 86.10%, and F1 score of 89.62%. Non-identifiers were most often misclassified as
names which may be caused by greater variability than for streets and locations.
We took a similar approach as Pantazos et al. [12] to identify ambiguous identifiers through
matching of identifiers to a database of non-identifiers. While, for de-identification, they deleted
records of ambiguous identifiers that occurred less than 200 times, we trained a deep learning
model from an automatically annotated training set to de-identify ambiguous identifiers. For the
automatic annotation, we handled ambiguous identifiers by calculating the ratio between the
rate of occurrence in the corpus and the rate of occurrence in the population for every identifier.
This method allowed an individual assessment if they should be annotated as an identifier or
not in the training data—increasing the chance of model generalization. The ratio ceiling also
allowed to balance the quality and amount of training data. Analyzing the ratio ceiling, we
found that when less than 8,000 sentences were available, the extra samples provided by a lower
ratio ceiling became more important than using the ratio ceiling that gave the highest quality
of the training data. Lower ratio ceilings produced more training data because more ambiguous
identifiers were considered non-identifiers resulting in fewer ambiguous sentences that had to
be discarded from the training set.
We saw an increase in F1 from dictionary-based de-identification to annotating a training set
with the dictionary-based method, training a NER model, and de-identifying with the trained
model. This showed that the model generalized from the training data to better classify the
ambiguous identifiers that the dictionary-based approach could not differentiate, and achieve
better performance than simply using the dictionaries to directly annotate text. This is supported
by the model correctly de-identifying 84% of words that occurred in the test set both as identifier
and non-identifier. Only 50% of these words were de-identified by the dictionary-based method.
5.1. Limitations
It is a limitation to the study that the data came only from Odense University Hospital but the
ratios were calculated using the rate of occurrence in the entire population of Denmark.
Future work includes de-identification of the rest of the HIPAA Safe Harbor identifiers since
there is no guarantee that the presented methods will generalize to other identifiers. Since
this study used lowercased data because only a lowercased Danish BERT base was available,
exploring performance when keeping the case of training data is also part of future work.
6. Conclusions
We trained a NER deep learning model using automatically annotated data to de-identify names,
streets, and locations in Danish narrative clinical text with recall 93.43%, precision 86.10%, and
F1 89.62%. A model trained on data annotated with a dictionary-based method can generalize
and surpass the performance of the dictionary-based method. A ratio ceiling of 512 works best
for Danish narrative clinical text when more than 8,000 sentences are available.
The automatic de-identification method presented in this study can be adapted to all languages
and domains if lists of identifiers and non-identifiers are available. Apart from the lists, the
method does not need any external data as the input data to the de-identification model is used
to train the model itself. This makes the method particularly useful for low-resource languages
where annotated datasets and trained models for de-identification of specific identifier types
are not always publicly available.
References
[1] V. E. Valkhoff, P. M. Coloma, G. M. Masclee, R. Gini, F. Innocenti, F. Lapi, M. Molokhia,
M. Mosseveld, M. S. Nielsson, M. Schuemie, F. Thiessard, J. van der Lei, M. C. Sturken-
boom, G. Trifirò, Validation study in four health-care databases: upper gastrointesti-
nal bleeding misclassification affects precision but not magnitude of drug-related up-
per gastrointestinal bleeding risk, Journal of Clinical Epidemiology 67 (2014) 921–931.
URL: https://www.sciencedirect.com/science/article/pii/S0895435614000845. doi:https:
//doi.org/10.1016/j.jclinepi.2014.02.020 .
[2] L. R. Øie, M. A. Madsbu, C. Giannadakis, A. Vorhaug, H. Jensberg, Ø. Salvesen,
S. Gulati, Validation of intracranial hemorrhage in the norwegian pa-
tient registry, Brain and Behavior 8 (2018) e00900. URL: https://onlinelibrary.
wiley.com/doi/abs/10.1002/brb3.900. doi:https://doi.org/10.1002/brb3.900 .
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/brb3.900 .
[3] J. Delekta, S. M. Hansen, K. S. AlZuhairi, C. S. Bork, A. M. Joensen, The validity of the
diagnosis of heart failure (I50.0-I50.9) in the danish national patient register, Dan Med J 65
(2018).
[4] T. L. Higgins, A. Deshpande, M. D. Zilberberg, P. K. Lindenauer, P. B. Imrey, P.-C.
Yu, S. D. Haessler, S. S. Richter, M. B. Rothberg, Assessment of the Accuracy of
Using ICD-9 Diagnosis Codes to Identify Pneumonia Etiology in Patients Hospitalized
With Pneumonia, JAMA Network Open 3 (2020) e207750–e207750. URL: https://doi.
org/10.1001/jamanetworkopen.2020.7750. doi:10.1001/jamanetworkopen.2020.7750 .
arXiv:https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2768537/higgins_2020_oi
[5] N. Wabe, L. Li, R. Lindeman, J. J. Post, M. R. Dahm, J. Li, J. I. Westbrook, A. Georgiou,
Evaluation of the accuracy of diagnostic coding for influenza compared to laboratory
results: the availability of test results before hospital discharge facilitates improved coding
accuracy, BMC Medical Informatics and Decision Making 21 (2021) 168. URL: https:
//doi.org/10.1186/s12911-021-01531-9. doi:10.1186/s12911- 021- 01531- 9 .
[6] GDPR, Regulation (eu) 2016/679 (general data protection regulation), ???? URL: https:
//gdpr-info.eu/.
[7] HIPAA, Health insurance portability and accountability act of 1996 (hipaa), public law
104-191, ???? URL: https://www.hhs.gov/hipaa/for-professionals/index.html.
[8] B. A. Beckwith, R. Mahaadevan, U. J. Balis, F. Kuo, Development and evaluation of an open
source software tool for deidentification of pathology reports, BMC Medical Informatics
and Decision Making 6 (2006) 12.
[9] I. Neamatullah, M. M. Douglass, L.-W. H. Lehman, A. Reisner, M. Villarroel, W. J. Long,
P. Szolovits, G. B. Moody, R. G. Mark, G. D. Clifford, Automated de-identification of
free-text medical records, BMC Medical Informatics and Decision Making 8 (2008) 32.
[10] E. Chazard, C. Mouret, G. Ficheur, A. Schaffar, J.-B. Beuscart, R. Beuscart, Proposal and
evaluation of fasdim, a fast and simple de-identification method for unstructured free-
text clinical records, International Journal of Medical Informatics 83 (2014) 303–312.
URL: https://www.sciencedirect.com/science/article/pii/S1386505613002463. doi:https:
//doi.org/10.1016/j.ijmedinf.2013.11.005 .
[11] S. Y. C. H. J. P. J. L. Y. L. M.-S. C. C.-M. K. W.-S. L. J. H. Shin Soo-Yong, Park Yu Rang, A de-
identification method for bilingual clinical texts of various note types, jkms 30 (2015) 7–15.
URL: http://www.e-sciencecentral.org/articles/?scid=1022920. doi:10.3346/jkms.2015.
30.1.7 . arXiv:http://www.e-sciencecentral.org/articles/?scid=1022920 .
[12] K. Pantazos, S. Lauesen, S. Lippert, Preserving medical correctness, readabil-
ity and consistency in de-identified health records, Health Informatics Jour-
nal 23 (2017) 291–303. URL: https://doi.org/10.1177/1460458216647760. doi:10.1177/
1460458216647760 . arXiv:https://doi.org/10.1177/1460458216647760 .
[13] V. Menger, F. Scheepers, L. M. van Wijk, M. Spruit, Deduce: A pattern matching
method for automatic de-identification of dutch medical text, Telematics and In-
formatics 35 (2018) 727–736. URL: https://www.sciencedirect.com/science/article/pii/
S0736585316307365. doi:https://doi.org/10.1016/j.tele.2017.08.002 .
[14] H. Fabregat, A. Duque, J. Martinez-Romo, L. Araujo, De-identification through named
entity recognition for medical document anonymization, Proceedings of the Iberian
Languages Evaluation Forum (IberLEF 2019) (2019).
[15] K. Kajiyama, H. Horiguchi, T. Okumura, M. Morita, Y. Kano, De-identifying free text of
japanese electronic health records, Journal of Biomedical Semantics 11 (2020) 11.
[16] L. Lange, H. Adel, J. Strötgen, Closing the gap: Joint de-identification and concept
extraction in the clinical domain, in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics,
Online, 2020, pp. 6945–6952. URL: https://aclanthology.org/2020.acl-main.621. doi:10.
18653/v1/2020.acl- main.621 .
[17] J. L. Leevy, T. M. Khoshgoftaar, F. Villanustre, Survey on RNN and CRF models for
de-identification of medical free text, Journal of Big Data 7 (2020) 73.
[18] I. Pérez-Díez, R. Pérez-Moraga, A. López-Cerdán, J.-M. Salinas-Serrano, M. d. la Iglesia-
Vayá, De-identifying spanish medical texts - named entity recognition applied to radiology
reports, Journal of Biomedical Semantics 12 (2021) 6.
[19] R. Catelli, V. Casola, G. De Pietro, H. Fujita, M. Esposito, Combining contextualized
word representation and sub-document level analysis through bi-lstm+crf architecture
for clinical de-identification, Knowledge-Based Systems 213 (2021) 106649. URL: https:
//www.sciencedirect.com/science/article/pii/S0950705120307784. doi:https://doi.org/
10.1016/j.knosys.2020.106649 .
[20] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (1997)
1735–1780. doi:10.1162/neco.1997.9.8.1735 .
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[22] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita, M. Esposito, Crosslingual named
entity recognition for clinical de-identification applied to a covid-19 italian data set, Applied
Soft Computing 97 (2020) 106779. URL: https://www.sciencedirect.com/science/article/pii/
S1568494620307171. doi:https://doi.org/10.1016/j.asoc.2020.106779 .
[23] R. Catelli, F. Gargiulo, E. Damiano, M. Esposito, G. De Pietro, Clinical de-identification
using sub-document analysis and electra, in: 2021 IEEE International Conference on
Digital Health (ICDH), 2021, pp. 266–275. doi:10.1109/ICDH52753.2021.00050 .
[24] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fujita, M. Esposito, A novel covid-19
data set and an effective deep learning approach for the de-identification of italian medical
records, IEEE Access 9 (2021) 19097–19110. doi:10.1109/ACCESS.2021.3054479 .
[25] K. Murugadoss, A. Rajasekharan, B. Malin, V. Agarwal, S. Bade, J. R. Anderson, J. L.
Ross, W. A. Faubion, J. D. Halamka, V. Soundararajan, S. Ardhanari, Building a best-in-
class automated de-identification tool for electronic health records through ensemble
learning, Patterns 2 (2021) 100255. URL: https://www.sciencedirect.com/science/article/
pii/S2666389921000817. doi:https://doi.org/10.1016/j.patter.2021.100255 .
[26] C. Meaney, W. Hakimpour, S. Kalia, R. Moineddin, A comparative evaluation of transformer
models for de-identification of clinical text data, 2022. URL: https://arxiv.org/abs/2204.
07056. doi:10.48550/ARXIV.2204.07056 .
[27] A. J. McMurry, B. Fitch, G. Savova, I. S. Kohane, B. Y. Reis, Improved de-identification
of physician notes through integrative modeling of both public and private medical text,
BMC Medical Informatics and Decision Making 13 (2013) 112.
[28] Z. Jian, X. Guo, S. Liu, H. Ma, S. Zhang, R. Zhang, J. Lei, A cascaded approach for
chinese clinical text de-identification with less annotation effort, Journal of Biomedi-
cal Informatics 73 (2017) 76–83. URL: https://www.sciencedirect.com/science/article/pii/
S1532046417301776. doi:https://doi.org/10.1016/j.jbi.2017.07.017 .
[29] Y. Kim, P. Heider, S. Meystre, Ensemble-based methods to improve de-identification of
electronic health record narratives, AMIA Annu Symp Proc 2018 (2018) 663–672.
[30] P. Richter-Pechanski, S. Riezler, C. Dieterich, De-Identification of german medical admis-
sion notes, Stud Health Technol Inform 253 (2018) 165–169.
[31] Danish Language Council, Retskrivningsordbogen, 4th edition, 2017, including 8 digital
issues, Danish Language Council, Bogense, Denmark, 2012.
[32] Z. Zhong, D. Chen, A frustratingly easy approach for entity and relation extraction, in:
Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Association for Computa-
tional Linguistics, Online, 2021, pp. 50–61. URL: https://aclanthology.org/2021.naacl-main.5.
doi:10.18653/v1/2021.naacl- main.5 .
[33] E. Soysal, J. Wang, M. Jiang, Y. Wu, S. Pakhomov, H. Liu, H. Xu, CLAMP
– a toolkit for efficiently building customized clinical natural language
processing pipelines, Journal of the American Medical Informatics As-
sociation 25 (2017) 331–336. URL: https://doi.org/10.1093/jamia/ocx132.
doi:10.1093/jamia/ocx132 . arXiv:https://academic.oup.com/jamia/article-
pdf/25/3/331/34150625/ocx132.pdf .
[34] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
N19-1423. doi:10.18653/v1/N19- 1423 .
A. Medical Abbreviations
Table 2
Sources of medical abbreviations
Description Link Deep link
List of recognized abbreviations and symbols used in Hospital Sønderjylland skoletube.dk https://www.skoletube.dk/download.php?key=MGVhNTVjOTFkZDdiYjIyNDU0ZGI
Symbols and abbreviations used at the laboratory, Hospital Sønderjylland regionsyddanmark.dk https://shs.regionsyddanmark.dk/wm272641
Abbreviations and symbols in patient related data regionsjaellend.dk http://dok.regionsjaelland.dk/view.aspx?DokID=217238
Definitions and abbreviations, Department of clinical biochemistry,
Herlev and Gentofte Hospital gentoftehospital.dk https://www.gentoftehospital.dk/afdelinger-og-klinikker/klinisk-biokemisk-afdeling/metodeblade/Documents/Definitioner%20og%20forkortelser%20%E2%80%93%20Klinisk%20Biokemisk%20Afdeling_2020_11_03.pdf
Abbreviations and symbols, psychiatry in Region Nordjylland pri.rn.dk https://pri.rn.dk/Assets/15764/Forkortelser-og-symboler-gaeldende-for-sygeplejefaglige-optegnelser.pdf
Medical abbreviations medicinskefagudtryk.dk http://medicinskefagudtryk.dk/sidebar/forkortelser/
Ordinary medical abbreviations medviden.dk https://www.medviden.dk/vaerktoejer/andre/almindelige-medicinske-forkortelser/
Guidelines for journal writing gyldendal.dk http://guga.gyldendal.dk/Munksgaard/$\sim$/media/Munksgaard/Medicinsk%20Fagsprog%202/Retningslinjer.ashx
Abbreviations and designations medicin.dk https://pro.medicin.dk/Artikler/Artikel/215
Abbreviations used in Department R, FMB docplayer.dk https://docplayer.dk/5008471-Forkortelser-anvendt-i-afd-r-s-instrukser-og-vejledninger-120204-fmb.html
Clinical biochemistry test results pri.rn.dk https://pri.rn.dk/Sider/10307.aspx
Family tree abbreviations pri.rn.dk https://pri.rn.dk/Sider/29580.aspx
Abbreviations and symbols in the eye speciality pri.rn.dk https://pri.rn.dk/Sider/17827.aspx
Accepted abbreviations at vascular surgery department pri.rn.dk https://pri.rn.dk/Sider/17293.aspx
Renal abbreviations pri.rn.dk https://pri.rn.dk/Sider/16695.aspx
Heart Lung surgery abbreviations pri.rn.dk https://pri.rn.dk/Sider/10530.aspx
Intensive Therapy abbreviations docplayer.dk https://docplayer.dk/11055021-Hyppigt-anvendte-forkortelser-og-termer-i-intensiv-terapi.html
B. Results Table
Table 3
Distribution of identifiers in each of the training sets automatically annotated with different ratio
ceilings and the resulting model and dictionary-based F1 scores.
Ratio ceiling Sentences with identifiers Sentences without identifiers Total sentences Dictionary-based F1 % Model F1 % Name tags Street tags Location tags Total tags
1 21,343 21,343 42,686 33.45 43.86 26,775 1,015 4,309 32,099
2 21,447 21,447 42,894 48.80 65.17 29,003 1,128 3,870 34,001
4 21,131 21,131 42,262 56.02 72.79 28,596 1,217 3,781 33,594
8 21,185 21,185 42,370 61.52 79.19 28,388 1,437 3,749 33,574
16 20,923 20,923 41,846 64.56 82.21 27,988 1,508 3,803 33,299
32 20,597 20,597 41,194 69.11 85.52 27,342 1,660 3,707 32,709
64 19,750 19,750 39,500 71.05 87.36 26,382 2,130 3,533 32,045
128 19,354 19,354 38,708 72.19 87.73 25,963 2,248 3,446 31,657
256 18,058 18,058 36,116 74.02 89.10 24,191 2,391 3,138 29,720
512 16,485 16,485 32,970 74.43 89.62 22,277 2,396 2,897 27,570
1,024 15,490 15,490 30,980 74.42 88.24 21,001 2,323 2,652 25,976
2,048 14,246 14,246 28,492 74.42 88.25 19,617 2,290 2,475 24,382
4,096 13,041 13,041 26,082 73.97 87.85 18,127 2,205 2,332 22,664
8,192 12,424 12,424 24,848 73.82 86.35 17,450 2,196 2,244 21,890
16,384 11,205 11,205 22,410 72.84 87.48 15,826 2,133 1,931 19,890
32,768 10,278 10,278 20,556 72.63 87.06 14,732 2,090 1,756 18,578
65,536 8,879 8,879 17,758 69.77 85.34 12,848 1,915 1,462 16,225
131,072 8,482 8,482 16,964 68.98 86.01 12,330 1,903 1,380 15,613
262,144 8,307 8,307 16,614 68.46 85.37 12,295 1,893 1,291 15,479
C. Ambiguous Performance
Table 4
A comparison of the model and dictionary-based performance on words that occurred both as non-
identifiers and identifiers in the test set. Only rows where there is a difference in performance are
shown. The total is calculated over all words.
Non-identifiers Identifiers
Dictionary-based % (total) Model % (total) Dictionary-based % (total) Model % (total)
per 100% (32) 91% (32) 57% (7) 100% (7)
hans 100% (25) 88% (25) 20% (5) 100% (5)
maria 100% (4) 75% (4) 38% (13) 100% (13)
ringe 100% (11) 100% (11) 0% (3) 100% (3)
plads 100% (9) 100% (9) 0% (1) 100% (1)
rask 100% (8) 88% (8) 50% (2) 100% (2)
bak 100% (1) 0% (1) 100% (4) 100% (4)
bo 100% (1) 100% (1) 50% (4) 100% (4)
do 100% (3) 100% (3) 100% (1) 0% (1)
tønder 100% (1) 100% (1) 0% (3) 100% (3)
hammer 100% (1) 0% (1) 100% (2) 100% (2)
rene 100% (1) 100% (1) 50% (2) 100% (2)
slagelse 100% (2) 0% (2) 0% (1) 100% (1)
hammel 100% (1) 100% (1) 0% (1) 100% (1)
land 100% (1) 100% (1) 0% (1) 100% (1)
langeland 100% (1) 0% (1) 0% (1) 100% (1)
stokke 100% (1) 100% (1) 0% (1) 100% (1)
Total 96% (296) 92% (296) 50% (88) 84% (88)