Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations Thomas Vakili and Hercules Dalianis Department of Computer and Systems Sciences, Stockholm University P.O. Box 7003, SE-164 07 Kista, Sweden {thomas.vakili, hercules}@dsv.su.se Abstract comparison to more generative models such as GPT-2 (Car- lini et al. 2020). However, their samples were generated us- Language models may be trained on data that contain per- ing a simple sampling technique, resulting in sentences of sonal information, such as clinical data. Such sensitive data low linguistic quality. must not leak for privacy reasons. This article explores whether BERT models trained on clinical data are suscepti- Our goal is to strengthen these results by using more ad- ble to training data extraction attacks. vanced sampling techniques which produce higher-quality Multiple large sets of sentences generated from the model generations. In this way, we show that the lack of sensitive with top-k sampling and nucleus sampling are studied. The information in the generated data is not simply a result of sentences are examined to determine the degree to which the linguistic qualities of the samples. We argue that BERT’s they contain information associating patients with their con- poor performance in text generation is, from a privacy per- ditions. The sentence sets are then compared to determine if spective, a feature and not a bug. there is a correlation between the degree of privacy leaked and the linguistic quality attained by each generation technique. 2 Language Models We find that the relationship between linguistic quality and privacy leakage is weak and that the risk of a successful train- The language model used in this study is a BERT model ing data extraction attack on a BERT-based model is small. trained by Lehman et al. (2021) using pseudonymized MIMIC-III data. It is based on the BERT architecture (De- vlin et al. 2019) and is a masked language model, which is 1 Introduction trained to correctly predict a masked token using the right Modern language models have a vast number of parameters, and left contexts surrounding it. BERT models are among which is the source of their impressive capabilities. How- the latest and best-performing language models, and several ever, their size also implies many problems. Among these is such models are being used in the health domain (Lee et al. the problem of accidentally memorizing sensitive informa- 2019; Huang, Altosaar, and Ranganath 2020). tion from their training data (Bender et al. 2021). Avoiding Given a masked token xmask in a sentence X, the objec- memorization is especially important when training on sen- tive is to learn the probability distribution over a vocabulary sitive data such as electronic patient records, as these contain V such that: sensitive information about the identity of patients. Acciden- xmask = argmax P (w|X \ xmask ) (1) tal memorization of such information puts patients’ identi- w∈V ties and other sensitive information at risk of being leaked. This is not a purely theoretical risk. In fact, Carlini et al. This sets masked language models apart from autoregressive (2020) successfully mounted a training data extraction at- language models. These models are instead trained to predict tack on GPT-2. This attack produced many instances of the next token xi+1 based solely on the previous tokens in clearly memorized passages from the training data, contain- the sequence: ing telephone numbers, addresses, and names of actual liv- xi+1 = argmax P (w|x1 , x2 , ..., xi ) (2) ing persons. w∈V Based on a methodology from Lehman et al. (2021), we mount a training data extraction attack on the clinical BERT 1 3 Related Research model that they release. Their results suggest that generating Modern language models are very large. For example, the sensitive data from a BERT model is difficult, especially in small version of BERT consists of 110 million parameters Copyright © 2021 for this paper by its authors. Use permitted under (Devlin et al. 2019). This makes BERT and other large Creative Commons License Attribution 4.0 International (CC BY model architectures vulnerable to various types of privacy 4.0). attacks. This section provides an overview of the most com- 1 mon attacks before focusing on the main topic of this article: Short for Bidirectional Encoder RepresentaTion (Devlin et al. 2019). training data extraction. 3.1 Membership Inference Attacks sentences identical to sentences in the training corpus. A Shokri et al. (2017) and Nasr, Shokri, and Houmansadr number of these memorized sentences contain specific de- (2019) describe how membership inference attacks can be tails that are very unlikely to be generated by chance. used to reveal whether or not a data point was part of a This shows that GPT-2 and other language models can model’s training data. They show how this can be carried be prone to accidentally memorizing datapoints from their out both in a white-box setting (where the model’s param- training data, which may lead to privacy leaks. Furthermore, eters are available) and in a black-box setting (where the the aforementioned attack can be performed in a black-box model can only be queried). They show that this attack can setting and does not require direct access to the weights of successfully be used against a range of different models and the model. datasets. However, none of these seem to focus on unstruc- However, GPT-2 is an autoregressive language model. tured natural language data. These models have an obvious way of generating data: The white-box attack described in Nasr, Shokri, and from left to right. Masked language models like BERT, on Houmansadr (2019) shows that a model can be trained to the other hand, have no such obvious generation strategies. infer membership using the outputs of the last layer or the Thus, autoregressive models like GPT-2 have traditionally gradients provided by the loss function. There are a variety been preferred over masked language models like BERT of attacks, some requiring access to a subset of the training when generating text. Due to this difference, it is not obvious data, but others do not require any access to actual training if autoregressive models like GPT-2 are disproportionately data. affected by this vulnerability and to what extent masked lan- Lehman et al. (2021) attack a clinical BERT model trained guage models share this problem. on pseudonymized MIMIC-III data by adding multi-layer Lehman et al. (2021) perform a related attack using the perceptron and logistic regression classifiers to probe the BERT model mentioned previously. They generate a large BERT model. They tried training the classifiers to discern number of sentences and examine the degree to which they whether the model had been trained on datapoints containing contain information linking patients with their conditions. sensitive data such as name, medical conditions, and combi- Their results indicate that the degree of privacy leakage is nations thereof. They were unable to recover links between low. patients and their conditions using this method. On the other However, the sentences are of poor linguistic quality due hand, experiments focused on names indicate a certain de- to the simple sampling technique used. In the following sec- gree of memorization of patient names. tions, we will describe more sophisticated ways of sampling from BERT and evaluate how these techniques impact the 3.2 Unmasking Pseudonymized Training Data level of privacy leakage and the quality of the sentences. If a language model M has been trained on a dataset D, then there is a risk that the model has memorized certain 4 Generating Text using Masked Language sensitive details. If this dataset is pseudonymized to create a non-sensitive dataset D0 , then an adversary with access to Models M and D0 may be able to reconstruct some of the original Although autoregressive language models have been data from D. favoured for text generation, recent studies have provided Such an attack was attempted by Nakamura et al. (2020). strategies for generating coherent text from masked lan- Sentences were selected from a clinical dataset which con- guage models as well. Wang and Cho (2019) implement and tained a patient’s first and last names. A BERT model trained evaluate a generation strategy based on Gibbs sampling (Ge- on the non-pseudonymized dataset was then used to calcu- man and Geman 1984), which results in reasonably coherent late the probability of predicting the correct first and last outputs. Another strategy described by Ghazvininejad et al. names in the sentences. The resulting probabilities were (2019) first predicts all masked tokens at once. It then it- small, and the authors conclude that BERT is not susceptible eratively refines the output by re-masking the least likely to this kind of attack. predicted tokens. This approach is successfully applied to However, the probability distributions emitted by deep machine translation. neural networks are known to be inaccurate (Holtzman et al. Besides deciding which tokens to unmask, one must also 2020; Guo et al. 2017). Thus, estimating the risk of re- provide a method for sampling from the predicted unmasked identifying a person using these probabilities is likely to be tokens. Wang and Cho (2019) randomly sample from all inaccurate. possible tokens weighted by their predicted probabilities. Holtzman et al. (2020) show that this can result in incoher- 3.3 Training Data Extraction ent text and instead provide a method they call nucleus sam- Attacks need not be limited to simply inferring whether or pling. This sampling method only considers the subset of not a datapoint was part of a model’s training data. Carlini tokens that constitute the bulk of the probability mass: the et al. (2020) demonstrate that it is possible to extract train- nucleus. Recalling equation (1) and given a target probabil- ing data from the language model GPT-22 (Radford et al. ity mass p, we sample from the smallest subset V 0 of tokens 2019). They do this by implementing an attack that extracts w ∈ V such that: 2 X GPT-2 is an abbreviation of Generative Pre-trained P (w|X \ xmask ) ≥ p (3) Transformer 2. w∈V 0 Nucleus sampling is shown to produce text that, according We compare our results with the samples generated by to a variety of metrics, has similar properties to human- Lehman et al. (2021). Their 500,000 sentences were gener- produced text. They show that this strategy produces higher ated from the same model using a burn-in period of 250 iter- quality results than other popular techniques, such as the ations, followed by 250 iterations using the top-k sampling top-k sampling method. method with k = 40. This method only considers the k most likely predictions when sampling, discarding the other less likely predictions. 5.2 Sensitive Data in the Generated Samples Nucleus sampling is similar in that it only considers the most Each set of generated samples was processed in the same likely predictions. However, nucleus sampling does not have manner as done by Lehman et al. (2021) to ensure com- a fixed k. The cut-off used to control the diversity of the sam- parability. An NER tagger (Honnibal et al. 2020) was used ples is instead determined dynamically using the parameter to locate the few thousand sentences that contained names p. (first names or last names) associated with a patient in the Lehman et al. (2021) perform a training data extraction at- pseudonymized MIMIC-III corpus. Then, every such sen- tempt by sampling from the same clinical BERT model used tence was further processed to determine if it mentioned a in this study. They generate text by sampling from the top- condition associated with the named patient. The set of con- 40 candidate tokens when they unmask each token. How- ditions associated with the patients was determined by pro- ever, results from Holtzman et al. (2020) show that this is cessing the clinical notes using MedCAT (Kraljevic et al. likely to be a too strict value for k and that other sampling 2021) in conjunction with the ICD-9 codes assigned to each configurations may lead to better results. clinical note. Finding Conditions Some sentences with names con- 5 Experiments and Results tained conditions irrelevant to the patient. Suppose most of This article uses a version of MIMIC-III (Johnson et al. the patient-condition associations in the generated corpora 2016) and a clinical BERT model trained on this corpus3 . are false. In that case, the signal from finding a name and MIMIC-III is a corpus of wide range of patient-related infor- a condition in the same sentence is unreliable in determin- mation that has been anonymized. In this article, a subset of ing from what condition a patient suffers. The prevalence of MIMIC-III containing clinical notes and diagnoses is used. such false associations was measured by counting them. The anonymous placeholders have been replaced with real- Table 1 shows the results of this processing. There is a istic pseudonyms, and the dataset consists of 1,247,291 clin- slight increase in the proportion of sentences containing a ical notes related to 27,906 patients. This pseudonymized name and a matching condition. At the same time, the col- dataset and the model trained on it were made available by umn Name + Wrong condition shows that the percentage of Lehman et al. (2021). sentences containing a name and a condition not associated with a patient bearing the name is slightly larger for all sam- 5.1 Generating Memorized Information pling techniques. Techniques modeled on those described by Carlini et al. It is important to note that the conditions found using (2020) were employed to determine whether or not the Clin- MedCAT vary in their specificity. Figure 2 plots the per- ical BERT model is susceptible to training data extraction centage of all found conditions constituted by the ten most attacks. A key difference, however, is how we sample from common conditions. The top ten most common conditions our non-autoregressive language model. explain a majority of the found conditions. This holds for As described in Section 4, there is no obvious way of the texts generated by Lehman et al. (2021) and us and for sampling from a masked language model. Instead, a vari- the pseudonymized MIMIC-III corpus. Many of these are ety of strategies are employed to extract text from the Clini- very vague and general. Finding a possible link between a cal BERT model. Tokens are selected using top-k sampling name and the condition pain, for example, does not reveal (k = 1000) and nucleus sampling (p = 0.99 and p = 0.95), very much information. as Holtzman et al. (2020) have shown these configurations Detecting Names Furthermore, Lehman et al. (2021) to be effective when sampling from autoregressive models. found that their results likely contained many false positives The token to unmask is selected randomly, and each gener- due to the ambiguous nature of some names. The samples ated sequence is 100 tokens long. generated in this study show a similar pattern. For example, 50,000 samples are generated using each strategy. First, approximately 10% of the sentences deemed to be associ- each sequence is initialized as fully masked or using a ated with a patient and a condition were selected on the basis prompt4 . In all cases, we then run a burn-in period (Johansen of containing the name (or word) Max. 2010) of 500 iterations to encourage a diverse set of outputs. The set of names detected in the generated sentences con- Each initialized sequence is then processed for 1,000 itera- stitute a small portion of the total collection of names found tions using one of the sampling methods. in the pseudonymized MIMIC-III corpus. Table 2 shows the 3 In Lehman et al. (2021) this model is referred to as Regular percentages of all such names detected in the sentences gen- Base. erated by Lehman et al. (2021) and us. 4 This prompt was used in 30% of the batches and was either The vast majority of all names are not detected at all. This [CLS] mr or [CLS] ms, which was the same setup used by is only partly due to the vastly larger size of the MIMIC- Lehman et al. (2021). III corpus. More likely, this is due to the aforementioned Figure 1: A few examples from a clinical note that the model seems to have memorized. The name (i.e. ”Coleman”) and the condition (e.g. ”myclonic jerking”) are highlighted in yellow and green respectively. First name Last name Name + Condition Name + Wrong condition Lehman et al. (2021) 0.94% 3.14% 23.53% 28.33% k = 1000 1.04% 3.61% 24.06% 28.28% p = 0.99 1.28% 3.76% 24.72% 28.25% p = 0.95 1.10% 3.81% 25.51% 29.33% Table 1: The First name and Last name columns show the proportion of sentences containing a first or last name. The Name + Condition column shows what percentage of these sentences also contain a condition associated with a patient with that (first or last) name. Similarly, the Name + Wrong condition shows the percentage where the condition is not associated with the patient. Figure 2: The figure above plots the most common conditions in the texts generated by Lehman et al. (2021), our nucleus text (p = 0.95), and MIMIC-III. The top ten conditions detected by MedCAT in each text explain a majority of all conditions. Many of them are vague and general, like edema or pain. Percentage of names detected Lehman et al. (2021) 10.1% k = 1000 3.27% p = 0.99 4.25% p = 0.95 2.40% Table 2: Lehman et al. (2021) generate the largest amount of sentences (500,000 sentences), and 10.1% of the names of the pseudonymized MIMIC-III corpus can be detected in their sentences. The largest proportion of names detected in our sentences is the 4.25% found in the 50,000 sentences generated using a nucleus sampling method with p = 0.95. overrepresentation of ambiguous names like Max. Many of the names found in the sentences are not part of the MIMIC- III corpus, and have likely been learned in the earlier pre- training of the BERT base model. In combination with the observation that many names are false positives, this suggests that only a small minority of all names are leaked. However, there are examples of likely memorizations, and Figure 1 illustrates a such a case. 5.3 Metrics for Assessing Linguistic Quality The quality of a given corpus of generated text is not a well- defined property. Gatt and Krahmer (2018) list several sub- jective and objective metrics that can be used to assess the quality of a generated body of text. This study takes the view that human-likeness is a good proxy for quality in the con- text of natural language generation. Figure 3: Rank-frequency distribution for the human gold The human-likeness of the generated samples was as- standard (MIMIC-III) as well as the generated samples. The sessed by computing a series of metrics and comparing distribution of the samples generated in Lehman et al. (2021) them to a gold standard corpus of human-produced text. The have a tail of unnaturally frequent words which is absent in corpus used as the gold standard was the pseudonymized the gold standard and in our more advanced generations. MIMIC-III corpus which the clinical BERT model was trained to model. Using a more general corpus would make less sense in this context. This is because the clinical BERT 5.4 Measuring the Quality of the Generated model is specifically trained to learn the characteristics of Samples clinical notes, which differ significantly from more general Every collection of generated samples was analyzed to de- forms of writing. termine the quality of the generations. Table 3 and Figure 3 Similarly to Holtzman et al. (2020), we calculated the show that the methods used in this study result in generated Self-BLEU (Zhu et al. 2018) and the shape of the Zipf distri- samples that are closer to the MIMIC-III corpus. bution (Piantadosi 2014) - two diversity metrics - as well as The small number of repetitions that are absent in the the repetitiveness of the texts - which captures the fluency5 . datasets used for comparison is the exception. The MIMIC- The quality of the generated samples is determined by com- III data is human-produced, so it is not surprising that it paring the metrics calculated from the generated samples does not contain any repetitions. The other discrepancies with those of the gold standard. are likely due to the larger number of iterations used in this Self-BLEU is a metric of diversity that measures how sim- study as compared to the 500 iterations used in Lehman et al. ilar each sentence in a corpus is to the rest of the corpus. Zhu (2021), which leaves some masked tokens in the generated et al. (2018), who first proposed the metric, calculate it by samples. averaging together the BLEU of every sentence compared to the rest of the corpus. Due to the size of our generated corpora, we calculate the 6 Discussion Self-BLEU slightly differently. As was done by Holtzman This study has given us insights into the complicated area of et al. (2020), the Self-BLEU is calculated using a random protecting privacy in training data represented in language subset |S 0 | = 1, 000 of the larger corpus S: models. One suggestion in the research community is to use P homomorphic encryption (Parmar et al. 2014; Al Badawi 1 X r∈S\s BLEU(s, r) et al. 2020) for the data and models. However, it seems that Self-BLEU = 0 (4) using homomorphically encrypted models is currently too |S | 0 |S| − 1 s∈S complicated for users. The Zipf distribution is a statistical distribution based on A more straightforward way to protect the privacy of Zipf’s law, which states that there is a relationship between persons in the training data is to pseudonymize it before a word’s rank r in a frequency list of a corpus and its fre- training. Both Berg, Chomutare, and Dalianis (2019) and quency f (r): Berg, Henriksson, and Dalianis (2020) build NER taggers on clinical data that has been pseudonymized. They find that, 1 f (r) ∝ (5) while this decreases the performance of the NER taggers, it rszipf does so to an acceptable degree. These taggers can be used This relationship can be used to estimate szipf , which can to build automatic de-identification systems that can make then be used to compare the rank-frequency distributions of training datasets less sensitive, as shown by Dalianis and different corpora. Berg (2021). However, no such system can achieve perfect 5 The perplexity is left out as there is no consensus on how to recall. Thus, this approach is analogous to a weak form of calculate it for masked language models and the alternatives are differential privacy where noise in the form of pseudonyms very expensive to calculate (Salazar et al. 2019). is added to the training data. [MASK] Repetitions bleu-4 bleu-5 szipf MIMIC-III N/A 0% 0.399 0.298 1.05 Lehman et al. (2021) 5.54% 0% 0.251 0.116 1.39 p = 0.99 1.91e-3% 0.12% 0.433 0.253 1.22 p = 0.95 1.91e-3% 0.12% 0.485 0.306 1.26 k = 1000 5.75e-3% 0.11% 0.435 0.246 1.23 Table 3: Text quality metrics for each corpus of text. MIMIC-III is the human gold standard and the values closest to the gold standard are bolded. The percentages describe the proportions of sentences in each corpus containing [MASK] tokens or containing repetitions. The clinical BERT model used in this article is trained on not strongly correlated with the quality of the sampling tech- clinical data, but uses a BERT model pre-trained on non- niques. sensitive data as its basis. This is good from a privacy per- Nucleus sampling, first described as a technique for sam- spective, as it means that names that are emitted when sam- pling from the autoregressive model GPT-2 (Holtzman et al. pling from the model are of uncertain origin. Detecting a 2020), is also shown to be an effective technique for sam- name in the output is thus a weaker signal, as the name might pling from the masked language model BERT. Further re- simply be memorized from the first phase of training on search into how to sample quality text from masked lan- non-sensitive data. However, Gu et al. (2021) show that pre- guage models is an interesting topic, but our research indi- training with only medical data can yield stronger results, cates that advances in that direction do not have significant suggesting that this approach may become more prevalent privacy implications. in the future. It cannot be ruled out that other sampling techniques, re- Further research into extracting training data for BERT gardless of their linguistic quality, may be able to extract models trained solely on sensitive data would shed light on training data more effectively. Carlini et al. (2020) showed the potential risks of this approach. The model in this article that the risk of an adversary successfully extracting training is also uncased, meaning that it is only trained on lowercase data from GPT-2 is significant. Our results, together with tokens. This means that it has a harder time distinguishing those of Lehman et al. (2021), strongly suggest that the risk entities that are normally capitalized, like names, from other of successfully sampling sensitive data from a BERT-based words. Investigating the impact of not lowercasing the data model is much smaller when compared to GPT-2. would be interesting since this is a design choice that may not be suitable for languages where the casing is important. Acknowledgments More robust metrics for measuring privacy leakage from A special thanks to Sarthak Jain and Eric Lehman for their training data extraction attacks would also be of use. The patient assistance with reproducing their experiments from metrics used in this article and by Lehman et al. (2021) Lehman et al. (2021) and for making their data available to strongly suggest that detecting a link between a patient’s us. We are also grateful to the DataLEASH project for fund- name and a condition is very difficult. A very small num- ing this research work. ber of samples contain any such possible associations, and many of these are likely to be false positives. This is both References due to the ambiguity of many of the detected names and be- ing slightly more likely to find a condition not associated Al Badawi, A.; Hoang, L.; Mun, C. F.; Laine, K.; and Aung, K. M. M. 2020. Privft: Private and fast text classification with homo- with the named patient. morphic encryption. IEEE Access 8: 226544–226556. It is also unclear what risks are acceptable from a legal perspective. Regulations such as the GDPR have strict re- Bender, E. M.; Gebru, T.; McMillan-Major, A.; and Shmitchell, S. quirements to avoid risk for identification. At the same time, 2021. On the Dangers of Stochastic Parrots: Can Language Mod- els Be Too Big? In Proceedings of the 2021 ACM Conference on the GDPR also contains language stating that ”the costs of Fairness, Accountability, and Transparency, 610–623. and the amount of time required for identification” (Euro- pean Commission 2018) should be taken into considera- Berg, H.; Chomutare, T.; and Dalianis, H. 2019. Building a tion when making risk assessments. Clarifications from le- De-identification System for Real Swedish Clinical Text Using Pseudonymised Clinical Text. In Proceedings of the Tenth Interna- gal scholars are necessary for these and other results in the tional Workshop on Health Text Mining and Information Analysis privacy domain to be contextualized and applicable to real (LOUHI 2019), 118–125. applications. Berg, H.; Henriksson, A.; and Dalianis, H. 2020. The Impact of De- identification on Downstream Named Entity Recognition in Clin- 7 Conclusions ical Text. In Proceedings of the 11th International Workshop on The sampling methods used in this article show a significant Health Text Mining and Information Analysis, 1–11. improvement regarding the linguistic quality of the samples, Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, as shown in Table 3. At the same time, Table 1 shows that the A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, U.; et al. prevalence of patients and their conditions within the gener- 2020. Extracting Training Data from Large Language Models. ated samples is stable. This suggests that privacy leakage is arXiv preprint arXiv:2012.07805 . Dalianis, H.; and Berg, H. 2021. HB Deid-HB De-identification ISSN 0933-3657. doi:10.1016/j.artmed.2021.102083. URL https: tool demonstrator. In Proceedings of the 23rd Nordic Conference //www.sciencedirect.com/science/article/pii/S0933365721000762. on Computational Linguistics (NoDaLiDa), 467–471. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; and Kang, Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: J. 2019. BioBERT: a pre-trained biomedical language represen- Pre-training of Deep Bidirectional Transformers for Language Un- tation model for biomedical text mining. Bioinformatics btz682. derstanding. In the North American Chapter of the Associa- ISSN 1367-4803, 1460-2059. doi:10.1093/bioinformatics/btz682. tion for Computational Linguistics: Human Language Technolo- URL http://arxiv.org/abs/1901.08746. ArXiv: 1901.08746. gies (NAACL-HLT)(2019). Lehman, E.; Jain, S.; Pichotta, K.; Goldberg, Y.; and Wallace, B. C. European Commission. 2018. Recital 26 - Not applicable to 2021. Does BERT Pretrained on Clinical Notes Reveal Sensitive anonymous data. URL https://gdpr.eu/recital-26-not-applicable- Data? In Annual Conference of the North American Chapter of the to-anonymous-data/. Association for Computational Linguistics, NAACL. Gatt, A.; and Krahmer, E. 2018. Survey of the State of the Art in Nakamura, Y.; Hanaoka, S.; Nomura, Y.; Hayashi, N.; Abe, O.; Natural Language Generation: Core tasks, applications and eval- Yada, S.; Wakamiya, S.; and Aramaki, E. 2020. KART: Privacy uation. Journal of Artificial Intelligence Research 61: 65–170. Leakage Framework of Language Models Pre-trained with Clinical ISSN 1076-9757. doi:10.1613/jair.5477. URL https://www.jair. Records. arXiv:2101.00036 [cs] URL http://arxiv.org/abs/2101. org/index.php/jair/article/view/11173. 00036. ArXiv: 2101.00036. Geman, S.; and Geman, D. 1984. Stochastic relaxation, Gibbs dis- Nasr, M.; Shokri, R.; and Houmansadr, A. 2019. Comprehensive tributions, and the Bayesian restoration of images. IEEE Transac- privacy analysis of deep learning: Passive and active white-box in- tions on pattern analysis and machine intelligence (6): 721–741. ference attacks against centralized and federated learning. In 2019 Ghazvininejad, M.; Levy, O.; Liu, Y.; and Zettlemoyer, L. 2019. IEEE symposium on security and privacy (SP), 739–753. IEEE. Mask-predict: Parallel decoding of conditional masked language Parmar, P. V.; Padhar, S. B.; Patel, S. N.; Bhatt, N. I.; and Jhaveri, models. arXiv preprint arXiv:1904.09324 . R. H. 2014. Survey of various homomorphic encryption algo- Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; rithms and schemes. International Journal of Computer Applica- Naumann, T.; Gao, J.; and Poon, H. 2021. Domain-Specific Lan- tions 91(8). guage Model Pretraining for Biomedical Natural Language Pro- Piantadosi, S. T. 2014. Zipf’s word frequency law in natural lan- cessing. arXiv:2007.15779 [cs] URL http://arxiv.org/abs/2007. guage: A critical review and future directions. Psychonomic Bul- 15779. ArXiv: 2007.15779. letin & Review 21(5): 1112–1130. ISSN 1531-5320. doi:10. Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On Cali- 3758/s13423-014-0585-6. URL https://doi.org/10.3758/s13423- bration of Modern Neural Networks. arXiv:1706.04599 [cs] URL 014-0585-6. http://arxiv.org/abs/1706.04599. ArXiv: 1706.04599. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2020. Sutskever, I. 2019. Language models are unsupervised multitask The Curious Case of Neural Text Degeneration. In International learners. OpenAI blog 1(8): 9. Conference on Learning Representations. URL https://openreview. Salazar, J.; Liang, D.; Nguyen, T. Q.; and Kirchhoff, K. 2019. net/forum?id=rygGQyrFvH. Masked language model scoring. arXiv preprint arXiv:1910.14659 Honnibal, M.; Montani, I.; Van Landeghem, S.; and Boyd, A. . 2020. spaCy: Industrial-strength Natural Language Processing in Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017. Mem- Python. doi:10.5281/zenodo.1212303. URL https://doi.org/10. bership inference attacks against machine learning models. In 2017 5281/zenodo.1212303. IEEE Symposium on Security and Privacy (SP), 3–18. IEEE. Huang, K.; Altosaar, J.; and Ranganath, R. 2020. Clinical- Wang, A.; and Cho, K. 2019. Bert has a mouth, and it must speak: BERT: Modeling Clinical Notes and Predicting Hospital Readmis- Bert as a markov random field language model. arXiv preprint sion. arXiv:1904.05342 [cs] URL http://arxiv.org/abs/1904.05342. arXiv:1902.04094 . ArXiv: 1904.05342. Zhu, Y.; Lu, S.; Zheng, L.; Guo, J.; Zhang, W.; Wang, J.; and Yu, Johansen, A. 2010. Markov Chain Monte Carlo. In Peterson, P.; Y. 2018. Texygen: A Benchmarking Platform for Text Generation Baker, E.; and McGaw, B., eds., International Encyclopedia of Ed- Models. In The 41st International ACM SIGIR Conference on Re- ucation (Third Edition), 245–252. Oxford: Elsevier, third edition search & Development in Information Retrieval, SIGIR ’18, 1097– edition. ISBN 978-0-08-044894-7. doi:https://doi.org/10.1016/ 1100. New York, NY, USA: Association for Computing Machinery. B978-0-08-044894-7.01347-6. URL https://www.sciencedirect. ISBN 978-1-4503-5657-2. doi:10.1145/3209978.3210080. URL com/science/article/pii/B9780080448947013476. https://doi.org/10.1145/3209978.3210080. Johnson, A. E. W.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely accessible criti- cal care database. Scientific Data 3(1): 160035. ISSN 2052-4463. doi:10.1038/sdata.2016.35. URL https://www.nature.com/articles/ sdata201635. Number: 1 Publisher: Nature Publishing Group. Kraljevic, Z.; Searle, T.; Shek, A.; Roguski, L.; Noor, K.; Bean, D.; Mascio, A.; Zhu, L.; Folarin, A. A.; Roberts, A.; Bendayan, R.; Richardson, M. P.; Stewart, R.; Shah, A. D.; Wong, W. K.; Ibrahim, Z.; Teo, J. T.; and Dobson, R. J. B. 2021. Multi-domain clinical nat- ural language processing with MedCAT: The Medical Concept An- notation Toolkit. Artificial Intelligence in Medicine 117: 102083.