Are Clinical BERT Models Privacy Preserving?
                   The Difficulty of Extracting Patient-Condition Associations
                                             Thomas Vakili and Hercules Dalianis
                               Department of Computer and Systems Sciences, Stockholm University
                                            P.O. Box 7003, SE-164 07 Kista, Sweden
                                              {thomas.vakili, hercules}@dsv.su.se


                            Abstract                                  comparison to more generative models such as GPT-2 (Car-
                                                                      lini et al. 2020). However, their samples were generated us-
  Language models may be trained on data that contain per-            ing a simple sampling technique, resulting in sentences of
  sonal information, such as clinical data. Such sensitive data
                                                                      low linguistic quality.
  must not leak for privacy reasons. This article explores
  whether BERT models trained on clinical data are suscepti-             Our goal is to strengthen these results by using more ad-
  ble to training data extraction attacks.                            vanced sampling techniques which produce higher-quality
  Multiple large sets of sentences generated from the model
                                                                      generations. In this way, we show that the lack of sensitive
  with top-k sampling and nucleus sampling are studied. The           information in the generated data is not simply a result of
  sentences are examined to determine the degree to which             the linguistic qualities of the samples. We argue that BERT’s
  they contain information associating patients with their con-       poor performance in text generation is, from a privacy per-
  ditions. The sentence sets are then compared to determine if        spective, a feature and not a bug.
  there is a correlation between the degree of privacy leaked and
  the linguistic quality attained by each generation technique.                        2   Language Models
  We find that the relationship between linguistic quality and
  privacy leakage is weak and that the risk of a successful train-    The language model used in this study is a BERT model
  ing data extraction attack on a BERT-based model is small.          trained by Lehman et al. (2021) using pseudonymized
                                                                      MIMIC-III data. It is based on the BERT architecture (De-
                                                                      vlin et al. 2019) and is a masked language model, which is
                     1     Introduction                               trained to correctly predict a masked token using the right
Modern language models have a vast number of parameters,              and left contexts surrounding it. BERT models are among
which is the source of their impressive capabilities. How-            the latest and best-performing language models, and several
ever, their size also implies many problems. Among these is           such models are being used in the health domain (Lee et al.
the problem of accidentally memorizing sensitive informa-             2019; Huang, Altosaar, and Ranganath 2020).
tion from their training data (Bender et al. 2021). Avoiding             Given a masked token xmask in a sentence X, the objec-
memorization is especially important when training on sen-            tive is to learn the probability distribution over a vocabulary
sitive data such as electronic patient records, as these contain      V such that:
sensitive information about the identity of patients. Acciden-                    xmask = argmax P (w|X \ xmask )                (1)
tal memorization of such information puts patients’ identi-                                   w∈V
ties and other sensitive information at risk of being leaked.
   This is not a purely theoretical risk. In fact, Carlini et al.     This sets masked language models apart from autoregressive
(2020) successfully mounted a training data extraction at-            language models. These models are instead trained to predict
tack on GPT-2. This attack produced many instances of                 the next token xi+1 based solely on the previous tokens in
clearly memorized passages from the training data, contain-           the sequence:
ing telephone numbers, addresses, and names of actual liv-                        xi+1 = argmax P (w|x1 , x2 , ..., xi )         (2)
ing persons.                                                                                 w∈V
   Based on a methodology from Lehman et al. (2021), we
mount a training data extraction attack on the clinical BERT 1                         3    Related Research
model that they release. Their results suggest that generating        Modern language models are very large. For example, the
sensitive data from a BERT model is difficult, especially in          small version of BERT consists of 110 million parameters
Copyright © 2021 for this paper by its authors. Use permitted under   (Devlin et al. 2019). This makes BERT and other large
Creative Commons License Attribution 4.0 International (CC BY         model architectures vulnerable to various types of privacy
4.0).                                                                 attacks. This section provides an overview of the most com-
   1                                                                  mon attacks before focusing on the main topic of this article:
     Short for Bidirectional Encoder RepresentaTion (Devlin et al.
2019).                                                                training data extraction.
3.1   Membership Inference Attacks                                sentences identical to sentences in the training corpus. A
Shokri et al. (2017) and Nasr, Shokri, and Houmansadr             number of these memorized sentences contain specific de-
(2019) describe how membership inference attacks can be           tails that are very unlikely to be generated by chance.
used to reveal whether or not a data point was part of a             This shows that GPT-2 and other language models can
model’s training data. They show how this can be carried          be prone to accidentally memorizing datapoints from their
out both in a white-box setting (where the model’s param-         training data, which may lead to privacy leaks. Furthermore,
eters are available) and in a black-box setting (where the        the aforementioned attack can be performed in a black-box
model can only be queried). They show that this attack can        setting and does not require direct access to the weights of
successfully be used against a range of different models and      the model.
datasets. However, none of these seem to focus on unstruc-           However, GPT-2 is an autoregressive language model.
tured natural language data.                                      These models have an obvious way of generating data:
   The white-box attack described in Nasr, Shokri, and            from left to right. Masked language models like BERT, on
Houmansadr (2019) shows that a model can be trained to            the other hand, have no such obvious generation strategies.
infer membership using the outputs of the last layer or the       Thus, autoregressive models like GPT-2 have traditionally
gradients provided by the loss function. There are a variety      been preferred over masked language models like BERT
of attacks, some requiring access to a subset of the training     when generating text. Due to this difference, it is not obvious
data, but others do not require any access to actual training     if autoregressive models like GPT-2 are disproportionately
data.                                                             affected by this vulnerability and to what extent masked lan-
   Lehman et al. (2021) attack a clinical BERT model trained      guage models share this problem.
on pseudonymized MIMIC-III data by adding multi-layer                Lehman et al. (2021) perform a related attack using the
perceptron and logistic regression classifiers to probe the       BERT model mentioned previously. They generate a large
BERT model. They tried training the classifiers to discern        number of sentences and examine the degree to which they
whether the model had been trained on datapoints containing       contain information linking patients with their conditions.
sensitive data such as name, medical conditions, and combi-       Their results indicate that the degree of privacy leakage is
nations thereof. They were unable to recover links between        low.
patients and their conditions using this method. On the other        However, the sentences are of poor linguistic quality due
hand, experiments focused on names indicate a certain de-         to the simple sampling technique used. In the following sec-
gree of memorization of patient names.                            tions, we will describe more sophisticated ways of sampling
                                                                  from BERT and evaluate how these techniques impact the
3.2   Unmasking Pseudonymized Training Data                       level of privacy leakage and the quality of the sentences.
If a language model M has been trained on a dataset D,
then there is a risk that the model has memorized certain          4    Generating Text using Masked Language
sensitive details. If this dataset is pseudonymized to create
a non-sensitive dataset D0 , then an adversary with access to                           Models
M and D0 may be able to reconstruct some of the original          Although autoregressive language models have been
data from D.                                                      favoured for text generation, recent studies have provided
   Such an attack was attempted by Nakamura et al. (2020).        strategies for generating coherent text from masked lan-
Sentences were selected from a clinical dataset which con-        guage models as well. Wang and Cho (2019) implement and
tained a patient’s first and last names. A BERT model trained     evaluate a generation strategy based on Gibbs sampling (Ge-
on the non-pseudonymized dataset was then used to calcu-          man and Geman 1984), which results in reasonably coherent
late the probability of predicting the correct first and last     outputs. Another strategy described by Ghazvininejad et al.
names in the sentences. The resulting probabilities were          (2019) first predicts all masked tokens at once. It then it-
small, and the authors conclude that BERT is not susceptible      eratively refines the output by re-masking the least likely
to this kind of attack.                                           predicted tokens. This approach is successfully applied to
   However, the probability distributions emitted by deep         machine translation.
neural networks are known to be inaccurate (Holtzman et al.          Besides deciding which tokens to unmask, one must also
2020; Guo et al. 2017). Thus, estimating the risk of re-          provide a method for sampling from the predicted unmasked
identifying a person using these probabilities is likely to be    tokens. Wang and Cho (2019) randomly sample from all
inaccurate.                                                       possible tokens weighted by their predicted probabilities.
                                                                  Holtzman et al. (2020) show that this can result in incoher-
3.3   Training Data Extraction                                    ent text and instead provide a method they call nucleus sam-
Attacks need not be limited to simply inferring whether or        pling. This sampling method only considers the subset of
not a datapoint was part of a model’s training data. Carlini      tokens that constitute the bulk of the probability mass: the
et al. (2020) demonstrate that it is possible to extract train-   nucleus. Recalling equation (1) and given a target probabil-
ing data from the language model GPT-22 (Radford et al.           ity mass p, we sample from the smallest subset V 0 of tokens
2019). They do this by implementing an attack that extracts       w ∈ V such that:
   2
                                                                                    X
     GPT-2 is an abbreviation of Generative Pre-trained                                  P (w|X \ xmask ) ≥ p               (3)
Transformer 2.                                                                    w∈V 0
Nucleus sampling is shown to produce text that, according              We compare our results with the samples generated by
to a variety of metrics, has similar properties to human-           Lehman et al. (2021). Their 500,000 sentences were gener-
produced text. They show that this strategy produces higher         ated from the same model using a burn-in period of 250 iter-
quality results than other popular techniques, such as the          ations, followed by 250 iterations using the top-k sampling
top-k sampling method.                                              method with k = 40.
   This method only considers the k most likely predictions
when sampling, discarding the other less likely predictions.        5.2   Sensitive Data in the Generated Samples
Nucleus sampling is similar in that it only considers the most      Each set of generated samples was processed in the same
likely predictions. However, nucleus sampling does not have         manner as done by Lehman et al. (2021) to ensure com-
a fixed k. The cut-off used to control the diversity of the sam-    parability. An NER tagger (Honnibal et al. 2020) was used
ples is instead determined dynamically using the parameter          to locate the few thousand sentences that contained names
p.                                                                  (first names or last names) associated with a patient in the
   Lehman et al. (2021) perform a training data extraction at-      pseudonymized MIMIC-III corpus. Then, every such sen-
tempt by sampling from the same clinical BERT model used            tence was further processed to determine if it mentioned a
in this study. They generate text by sampling from the top-         condition associated with the named patient. The set of con-
40 candidate tokens when they unmask each token. How-               ditions associated with the patients was determined by pro-
ever, results from Holtzman et al. (2020) show that this is         cessing the clinical notes using MedCAT (Kraljevic et al.
likely to be a too strict value for k and that other sampling       2021) in conjunction with the ICD-9 codes assigned to each
configurations may lead to better results.                          clinical note.
                                                                    Finding Conditions Some sentences with names con-
            5    Experiments and Results                            tained conditions irrelevant to the patient. Suppose most of
This article uses a version of MIMIC-III (Johnson et al.            the patient-condition associations in the generated corpora
2016) and a clinical BERT model trained on this corpus3 .           are false. In that case, the signal from finding a name and
MIMIC-III is a corpus of wide range of patient-related infor-       a condition in the same sentence is unreliable in determin-
mation that has been anonymized. In this article, a subset of       ing from what condition a patient suffers. The prevalence of
MIMIC-III containing clinical notes and diagnoses is used.          such false associations was measured by counting them.
The anonymous placeholders have been replaced with real-               Table 1 shows the results of this processing. There is a
istic pseudonyms, and the dataset consists of 1,247,291 clin-       slight increase in the proportion of sentences containing a
ical notes related to 27,906 patients. This pseudonymized           name and a matching condition. At the same time, the col-
dataset and the model trained on it were made available by          umn Name + Wrong condition shows that the percentage of
Lehman et al. (2021).                                               sentences containing a name and a condition not associated
                                                                    with a patient bearing the name is slightly larger for all sam-
5.1   Generating Memorized Information                              pling techniques.
Techniques modeled on those described by Carlini et al.                It is important to note that the conditions found using
(2020) were employed to determine whether or not the Clin-          MedCAT vary in their specificity. Figure 2 plots the per-
ical BERT model is susceptible to training data extraction          centage of all found conditions constituted by the ten most
attacks. A key difference, however, is how we sample from           common conditions. The top ten most common conditions
our non-autoregressive language model.                              explain a majority of the found conditions. This holds for
   As described in Section 4, there is no obvious way of            the texts generated by Lehman et al. (2021) and us and for
sampling from a masked language model. Instead, a vari-             the pseudonymized MIMIC-III corpus. Many of these are
ety of strategies are employed to extract text from the Clini-      very vague and general. Finding a possible link between a
cal BERT model. Tokens are selected using top-k sampling            name and the condition pain, for example, does not reveal
(k = 1000) and nucleus sampling (p = 0.99 and p = 0.95),            very much information.
as Holtzman et al. (2020) have shown these configurations
                                                                    Detecting Names Furthermore, Lehman et al. (2021)
to be effective when sampling from autoregressive models.
                                                                    found that their results likely contained many false positives
The token to unmask is selected randomly, and each gener-
                                                                    due to the ambiguous nature of some names. The samples
ated sequence is 100 tokens long.
                                                                    generated in this study show a similar pattern. For example,
   50,000 samples are generated using each strategy. First,
                                                                    approximately 10% of the sentences deemed to be associ-
each sequence is initialized as fully masked or using a
                                                                    ated with a patient and a condition were selected on the basis
prompt4 . In all cases, we then run a burn-in period (Johansen
                                                                    of containing the name (or word) Max.
2010) of 500 iterations to encourage a diverse set of outputs.
                                                                       The set of names detected in the generated sentences con-
Each initialized sequence is then processed for 1,000 itera-
                                                                    stitute a small portion of the total collection of names found
tions using one of the sampling methods.
                                                                    in the pseudonymized MIMIC-III corpus. Table 2 shows the
   3
     In Lehman et al. (2021) this model is referred to as Regular   percentages of all such names detected in the sentences gen-
Base.                                                               erated by Lehman et al. (2021) and us.
   4
     This prompt was used in 30% of the batches and was either         The vast majority of all names are not detected at all. This
[CLS] mr or [CLS] ms, which was the same setup used by              is only partly due to the vastly larger size of the MIMIC-
Lehman et al. (2021).                                               III corpus. More likely, this is due to the aforementioned
Figure 1: A few examples from a clinical note that the model seems to have memorized. The name (i.e. ”Coleman”) and the
condition (e.g. ”myclonic jerking”) are highlighted in yellow and green respectively.


                                    First name     Last name     Name + Condition       Name + Wrong condition
            Lehman et al. (2021)      0.94%          3.14%           23.53%                    28.33%
                      k = 1000        1.04%          3.61%           24.06%                    28.28%
                       p = 0.99       1.28%          3.76%           24.72%                    28.25%
                       p = 0.95       1.10%          3.81%           25.51%                    29.33%

Table 1: The First name and Last name columns show the proportion of sentences containing a first or last name. The Name +
Condition column shows what percentage of these sentences also contain a condition associated with a patient with that (first or
last) name. Similarly, the Name + Wrong condition shows the percentage where the condition is not associated with the patient.


Figure 2: The figure above plots the most common conditions in the texts generated by Lehman et al. (2021), our nucleus text
(p = 0.95), and MIMIC-III. The top ten conditions detected by MedCAT in each text explain a majority of all conditions. Many
of them are vague and general, like edema or pain.


                                                            Percentage of names detected
                                    Lehman et al. (2021)               10.1%
                                               k = 1000                3.27%
                                               p = 0.99                4.25%
                                               p = 0.95                2.40%

Table 2: Lehman et al. (2021) generate the largest amount of sentences (500,000 sentences), and 10.1% of the names of the
pseudonymized MIMIC-III corpus can be detected in their sentences. The largest proportion of names detected in our sentences
is the 4.25% found in the 50,000 sentences generated using a nucleus sampling method with p = 0.95.
overrepresentation of ambiguous names like Max. Many of
the names found in the sentences are not part of the MIMIC-
III corpus, and have likely been learned in the earlier pre-
training of the BERT base model.
   In combination with the observation that many names are
false positives, this suggests that only a small minority of
all names are leaked. However, there are examples of likely
memorizations, and Figure 1 illustrates a such a case.

5.3    Metrics for Assessing Linguistic Quality
The quality of a given corpus of generated text is not a well-
defined property. Gatt and Krahmer (2018) list several sub-
jective and objective metrics that can be used to assess the
quality of a generated body of text. This study takes the view
that human-likeness is a good proxy for quality in the con-
text of natural language generation.                                Figure 3: Rank-frequency distribution for the human gold
   The human-likeness of the generated samples was as-              standard (MIMIC-III) as well as the generated samples. The
sessed by computing a series of metrics and comparing               distribution of the samples generated in Lehman et al. (2021)
them to a gold standard corpus of human-produced text. The          have a tail of unnaturally frequent words which is absent in
corpus used as the gold standard was the pseudonymized              the gold standard and in our more advanced generations.
MIMIC-III corpus which the clinical BERT model was
trained to model. Using a more general corpus would make
less sense in this context. This is because the clinical BERT       5.4   Measuring the Quality of the Generated
model is specifically trained to learn the characteristics of             Samples
clinical notes, which differ significantly from more general        Every collection of generated samples was analyzed to de-
forms of writing.                                                   termine the quality of the generations. Table 3 and Figure 3
   Similarly to Holtzman et al. (2020), we calculated the           show that the methods used in this study result in generated
Self-BLEU (Zhu et al. 2018) and the shape of the Zipf distri-       samples that are closer to the MIMIC-III corpus.
bution (Piantadosi 2014) - two diversity metrics - as well as          The small number of repetitions that are absent in the
the repetitiveness of the texts - which captures the fluency5 .     datasets used for comparison is the exception. The MIMIC-
The quality of the generated samples is determined by com-          III data is human-produced, so it is not surprising that it
paring the metrics calculated from the generated samples            does not contain any repetitions. The other discrepancies
with those of the gold standard.                                    are likely due to the larger number of iterations used in this
   Self-BLEU is a metric of diversity that measures how sim-        study as compared to the 500 iterations used in Lehman et al.
ilar each sentence in a corpus is to the rest of the corpus. Zhu    (2021), which leaves some masked tokens in the generated
et al. (2018), who first proposed the metric, calculate it by       samples.
averaging together the BLEU of every sentence compared to
the rest of the corpus.
   Due to the size of our generated corpora, we calculate the                            6    Discussion
Self-BLEU slightly differently. As was done by Holtzman             This study has given us insights into the complicated area of
et al. (2020), the Self-BLEU is calculated using a random           protecting privacy in training data represented in language
subset |S 0 | = 1, 000 of the larger corpus S:                      models. One suggestion in the research community is to use
                                   P                                homomorphic encryption (Parmar et al. 2014; Al Badawi
                          1 X r∈S\s BLEU(s, r)                      et al. 2020) for the data and models. However, it seems that
        Self-BLEU = 0                                         (4)   using homomorphically encrypted models is currently too
                        |S |     0
                                          |S| − 1
                            s∈S                                     complicated for users.
  The Zipf distribution is a statistical distribution based on         A more straightforward way to protect the privacy of
Zipf’s law, which states that there is a relationship between       persons in the training data is to pseudonymize it before
a word’s rank r in a frequency list of a corpus and its fre-        training. Both Berg, Chomutare, and Dalianis (2019) and
quency f (r):                                                       Berg, Henriksson, and Dalianis (2020) build NER taggers
                                                                    on clinical data that has been pseudonymized. They find that,
                                    1
                        f (r) ∝                              (5)    while this decreases the performance of the NER taggers, it
                                  rszipf                            does so to an acceptable degree. These taggers can be used
   This relationship can be used to estimate szipf , which can      to build automatic de-identification systems that can make
then be used to compare the rank-frequency distributions of         training datasets less sensitive, as shown by Dalianis and
different corpora.                                                  Berg (2021). However, no such system can achieve perfect
   5
    The perplexity is left out as there is no consensus on how to   recall. Thus, this approach is analogous to a weak form of
calculate it for masked language models and the alternatives are    differential privacy where noise in the form of pseudonyms
very expensive to calculate (Salazar et al. 2019).                  is added to the training data.
                                                   [MASK]         Repetitions    bleu-4     bleu-5     szipf
                                  MIMIC-III           N/A             0%         0.399      0.298       1.05
                          Lehman et al. (2021)       5.54%           0%          0.251      0.116       1.39
                                     p = 0.99      1.91e-3%         0.12%        0.433      0.253       1.22
                                     p = 0.95      1.91e-3%         0.12%        0.485      0.306       1.26
                                    k = 1000       5.75e-3%         0.11%        0.435      0.246       1.23

Table 3: Text quality metrics for each corpus of text. MIMIC-III is the human gold standard and the values closest to the
gold standard are bolded. The percentages describe the proportions of sentences in each corpus containing [MASK] tokens or
containing repetitions.


   The clinical BERT model used in this article is trained on         not strongly correlated with the quality of the sampling tech-
clinical data, but uses a BERT model pre-trained on non-              niques.
sensitive data as its basis. This is good from a privacy per-            Nucleus sampling, first described as a technique for sam-
spective, as it means that names that are emitted when sam-           pling from the autoregressive model GPT-2 (Holtzman et al.
pling from the model are of uncertain origin. Detecting a             2020), is also shown to be an effective technique for sam-
name in the output is thus a weaker signal, as the name might         pling from the masked language model BERT. Further re-
simply be memorized from the first phase of training on               search into how to sample quality text from masked lan-
non-sensitive data. However, Gu et al. (2021) show that pre-          guage models is an interesting topic, but our research indi-
training with only medical data can yield stronger results,           cates that advances in that direction do not have significant
suggesting that this approach may become more prevalent               privacy implications.
in the future.                                                           It cannot be ruled out that other sampling techniques, re-
   Further research into extracting training data for BERT            gardless of their linguistic quality, may be able to extract
models trained solely on sensitive data would shed light on           training data more effectively. Carlini et al. (2020) showed
the potential risks of this approach. The model in this article       that the risk of an adversary successfully extracting training
is also uncased, meaning that it is only trained on lowercase         data from GPT-2 is significant. Our results, together with
tokens. This means that it has a harder time distinguishing           those of Lehman et al. (2021), strongly suggest that the risk
entities that are normally capitalized, like names, from other        of successfully sampling sensitive data from a BERT-based
words. Investigating the impact of not lowercasing the data           model is much smaller when compared to GPT-2.
would be interesting since this is a design choice that may
not be suitable for languages where the casing is important.                              Acknowledgments
   More robust metrics for measuring privacy leakage from             A special thanks to Sarthak Jain and Eric Lehman for their
training data extraction attacks would also be of use. The            patient assistance with reproducing their experiments from
metrics used in this article and by Lehman et al. (2021)              Lehman et al. (2021) and for making their data available to
strongly suggest that detecting a link between a patient’s            us. We are also grateful to the DataLEASH project for fund-
name and a condition is very difficult. A very small num-             ing this research work.
ber of samples contain any such possible associations, and
many of these are likely to be false positives. This is both                                    References
due to the ambiguity of many of the detected names and be-
ing slightly more likely to find a condition not associated           Al Badawi, A.; Hoang, L.; Mun, C. F.; Laine, K.; and Aung, K.
                                                                      M. M. 2020. Privft: Private and fast text classification with homo-
with the named patient.                                               morphic encryption. IEEE Access 8: 226544–226556.
   It is also unclear what risks are acceptable from a legal
perspective. Regulations such as the GDPR have strict re-             Bender, E. M.; Gebru, T.; McMillan-Major, A.; and Shmitchell, S.
quirements to avoid risk for identification. At the same time,        2021. On the Dangers of Stochastic Parrots: Can Language Mod-
                                                                      els Be Too Big? In Proceedings of the 2021 ACM Conference on
the GDPR also contains language stating that ”the costs of            Fairness, Accountability, and Transparency, 610–623.
and the amount of time required for identification” (Euro-
pean Commission 2018) should be taken into considera-                 Berg, H.; Chomutare, T.; and Dalianis, H. 2019. Building a
tion when making risk assessments. Clarifications from le-            De-identification System for Real Swedish Clinical Text Using
                                                                      Pseudonymised Clinical Text. In Proceedings of the Tenth Interna-
gal scholars are necessary for these and other results in the
                                                                      tional Workshop on Health Text Mining and Information Analysis
privacy domain to be contextualized and applicable to real            (LOUHI 2019), 118–125.
applications.
                                                                      Berg, H.; Henriksson, A.; and Dalianis, H. 2020. The Impact of De-
                                                                      identification on Downstream Named Entity Recognition in Clin-
                    7    Conclusions                                  ical Text. In Proceedings of the 11th International Workshop on
The sampling methods used in this article show a significant          Health Text Mining and Information Analysis, 1–11.
improvement regarding the linguistic quality of the samples,          Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss,
as shown in Table 3. At the same time, Table 1 shows that the         A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, U.; et al.
prevalence of patients and their conditions within the gener-         2020. Extracting Training Data from Large Language Models.
ated samples is stable. This suggests that privacy leakage is         arXiv preprint arXiv:2012.07805 .
Dalianis, H.; and Berg, H. 2021. HB Deid-HB De-identification           ISSN 0933-3657. doi:10.1016/j.artmed.2021.102083. URL https:
tool demonstrator. In Proceedings of the 23rd Nordic Conference         //www.sciencedirect.com/science/article/pii/S0933365721000762.
on Computational Linguistics (NoDaLiDa), 467–471.                       Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; and Kang,
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT:        J. 2019. BioBERT: a pre-trained biomedical language represen-
Pre-training of Deep Bidirectional Transformers for Language Un-        tation model for biomedical text mining. Bioinformatics btz682.
derstanding. In the North American Chapter of the Associa-              ISSN 1367-4803, 1460-2059. doi:10.1093/bioinformatics/btz682.
tion for Computational Linguistics: Human Language Technolo-            URL http://arxiv.org/abs/1901.08746. ArXiv: 1901.08746.
gies (NAACL-HLT)(2019).                                                 Lehman, E.; Jain, S.; Pichotta, K.; Goldberg, Y.; and Wallace, B. C.
European Commission. 2018. Recital 26 - Not applicable to               2021. Does BERT Pretrained on Clinical Notes Reveal Sensitive
anonymous data. URL https://gdpr.eu/recital-26-not-applicable-          Data? In Annual Conference of the North American Chapter of the
to-anonymous-data/.                                                     Association for Computational Linguistics, NAACL.
Gatt, A.; and Krahmer, E. 2018. Survey of the State of the Art in       Nakamura, Y.; Hanaoka, S.; Nomura, Y.; Hayashi, N.; Abe, O.;
Natural Language Generation: Core tasks, applications and eval-         Yada, S.; Wakamiya, S.; and Aramaki, E. 2020. KART: Privacy
uation. Journal of Artificial Intelligence Research 61: 65–170.         Leakage Framework of Language Models Pre-trained with Clinical
ISSN 1076-9757. doi:10.1613/jair.5477. URL https://www.jair.            Records. arXiv:2101.00036 [cs] URL http://arxiv.org/abs/2101.
org/index.php/jair/article/view/11173.                                  00036. ArXiv: 2101.00036.
Geman, S.; and Geman, D. 1984. Stochastic relaxation, Gibbs dis-        Nasr, M.; Shokri, R.; and Houmansadr, A. 2019. Comprehensive
tributions, and the Bayesian restoration of images. IEEE Transac-       privacy analysis of deep learning: Passive and active white-box in-
tions on pattern analysis and machine intelligence (6): 721–741.        ference attacks against centralized and federated learning. In 2019
Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019.         IEEE symposium on security and privacy (SP), 739–753. IEEE.
Mask-predict: Parallel decoding of conditional masked language          Parmar, P. V.; Padhar, S. B.; Patel, S. N.; Bhatt, N. I.; and Jhaveri,
models. arXiv preprint arXiv:1904.09324 .                               R. H. 2014. Survey of various homomorphic encryption algo-
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.;           rithms and schemes. International Journal of Computer Applica-
Naumann, T.; Gao, J.; and Poon, H. 2021. Domain-Specific Lan-           tions 91(8).
guage Model Pretraining for Biomedical Natural Language Pro-            Piantadosi, S. T. 2014. Zipf’s word frequency law in natural lan-
cessing. arXiv:2007.15779 [cs] URL http://arxiv.org/abs/2007.           guage: A critical review and future directions. Psychonomic Bul-
15779. ArXiv: 2007.15779.                                               letin & Review 21(5): 1112–1130. ISSN 1531-5320. doi:10.
Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On Cali-      3758/s13423-014-0585-6. URL https://doi.org/10.3758/s13423-
bration of Modern Neural Networks. arXiv:1706.04599 [cs] URL            014-0585-6.
http://arxiv.org/abs/1706.04599. ArXiv: 1706.04599.                     Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2020.          Sutskever, I. 2019. Language models are unsupervised multitask
The Curious Case of Neural Text Degeneration. In International          learners. OpenAI blog 1(8): 9.
Conference on Learning Representations. URL https://openreview.         Salazar, J.; Liang, D.; Nguyen, T. Q.; and Kirchhoff, K. 2019.
net/forum?id=rygGQyrFvH.                                                Masked language model scoring. arXiv preprint arXiv:1910.14659
Honnibal, M.; Montani, I.; Van Landeghem, S.; and Boyd, A.              .
2020. spaCy: Industrial-strength Natural Language Processing in         Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017. Mem-
Python. doi:10.5281/zenodo.1212303. URL https://doi.org/10.             bership inference attacks against machine learning models. In 2017
5281/zenodo.1212303.                                                    IEEE Symposium on Security and Privacy (SP), 3–18. IEEE.
Huang, K.; Altosaar, J.; and Ranganath, R. 2020. Clinical-              Wang, A.; and Cho, K. 2019. Bert has a mouth, and it must speak:
BERT: Modeling Clinical Notes and Predicting Hospital Readmis-          Bert as a markov random field language model. arXiv preprint
sion. arXiv:1904.05342 [cs] URL http://arxiv.org/abs/1904.05342.        arXiv:1902.04094 .
ArXiv: 1904.05342.
                                                                        Zhu, Y.; Lu, S.; Zheng, L.; Guo, J.; Zhang, W.; Wang, J.; and Yu,
Johansen, A. 2010. Markov Chain Monte Carlo. In Peterson, P.;           Y. 2018. Texygen: A Benchmarking Platform for Text Generation
Baker, E.; and McGaw, B., eds., International Encyclopedia of Ed-       Models. In The 41st International ACM SIGIR Conference on Re-
ucation (Third Edition), 245–252. Oxford: Elsevier, third edition       search & Development in Information Retrieval, SIGIR ’18, 1097–
edition. ISBN 978-0-08-044894-7. doi:https://doi.org/10.1016/           1100. New York, NY, USA: Association for Computing Machinery.
B978-0-08-044894-7.01347-6. URL https://www.sciencedirect.              ISBN 978-1-4503-5657-2. doi:10.1145/3209978.3210080. URL
com/science/article/pii/B9780080448947013476.                           https://doi.org/10.1145/3209978.3210080.
Johnson, A. E. W.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.;
Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi,
L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible criti-
cal care database. Scientific Data 3(1): 160035. ISSN 2052-4463.
doi:10.1038/sdata.2016.35. URL https://www.nature.com/articles/
sdata201635. Number: 1 Publisher: Nature Publishing Group.
Kraljevic, Z.; Searle, T.; Shek, A.; Roguski, L.; Noor, K.; Bean,
D.; Mascio, A.; Zhu, L.; Folarin, A. A.; Roberts, A.; Bendayan, R.;
Richardson, M. P.; Stewart, R.; Shah, A. D.; Wong, W. K.; Ibrahim,
Z.; Teo, J. T.; and Dobson, R. J. B. 2021. Multi-domain clinical nat-
ural language processing with MedCAT: The Medical Concept An-
notation Toolkit. Artificial Intelligence in Medicine 117: 102083.