-

Multilingual ICD-10 Code Assignment with Transformer Architectures using MIMIC-III Discharge Summaries

Department of Computer Science

University of Applied Sciences

Arts Dortmund (FHDO)

Emil-Figge Str.

Dortmund

Germany hesch

0 Institute for Medical Informatics , Biometry and Epidemiology , University Hospital Essen , Essen , Germany

In this work, we present the participation of FHDO Biomedical Computer Science Group (BCSG) to the CLEF eHealth challenge 2020 Task 1 on automatic assignment of ICD-10 codes (CIE-10 in the Spanish translation) to clinical case studies. Training data has been augmented with documents from the Medical Information Mart for Intensive Care (MIMIC-III), a critical care database. ICD-10 CM General Equivalence Mappings (GEMs) were subsequently used to convert the codi cation from ICD-9 to ICD-10. Recent state-of-the-art Transformer-based models, such as BioBERT and ClinicalBERT are compared to the Generalized Autoregressive Pretraining for Language Understanding (XLNet) model. Finally, the apriori algorithm has been applied to build association rules by nding frequent item sets. An ensemble of BioBERT and XLNet achieved a mean Average Precision (MAP) score of 0:259 (0:306 for the subset of codes only present in the training and validation sets).

BioBERT MIMIC-III Conversion Apriori

billing mechanism in the Electronic Health Record (EHR) and can be used for automatic semantic indexing of clinical documents, but also to facilitate decision support systems that aim to help clinical coders by suggesting a relevant subset of potential codes for selection. The problem can be described as a mapping from natural language free-texts to medical concepts such that, given a new document, the system can assign multiple codes to it.

In terms of application in the biomedical eld, Bidirectional Encoder Representations from Transformers (BERT) has only recently been used for ICD code assignment tasks, such as classifying German animal experiments in CLEF eHealth 2019 [ 3,27,25 ]. While it has proven to work well on assigning a smaller subset of ICD codes, it is uncertain how Transformer architecture models can perform on arbitrary long clinical text and in solving extreme multi-label classication problems with a high average amount of assigned codes per document.

CLEF eHealth tracks feature the classi cation of multilingual clinical documents using ICD codes since 2016 [22,23,24,25]. This work enriches training data with the Medical Information Mart for Intensive Care (MIMIC-III) database and compares BERT based models with XLNet [ 32 ]. 2

Related Work

A hierarchy-based approach with Support Vector Machines (SVM) [ 8 ], using the 'is-a' relationship between ICD-9 codes to model label dependencies has been an early approach to ICD coding [ 26 ]. The hierarchy-based classi er surpassed the at SVM, which did not consider code dependencies. Other approaches identi ed label density and label noise as useful features [ 29 ], while others empirically evaluated the simultaneous occurrence of labels [16].

ML-NET [ 10 ] followed the hierarchy-based approach and extended the coding of documents. Its deep neural network consists of an additional network for estimating the number of labels. Instead of separating relevant vs. irrelevant labels by a threshold value, a network for predicting the number of labels was built by using the document vector as input.

Baumel et al. [ 4 ] evaluated 4 di erent models for ICD code assignment using data from MIMIC-II and MIMIC-III data sets. They presented a continuous bag-of-word model [19] (CBOW), a convolutional neural network, an SVM oneversus-all model and a bidirectional gated-recurrent unit model with hierarchical attention (HA-GRU).

Another proposed model is a code-wise attention network [21], where attention mechanisms are used to extract n-grams from the text that are in uential in predicting each code.

Uni ed Medical Language System (UMLS) [ 5 ] mapping and word embeddings have shown to be e ective within text classi cation in the biomedical domain and improved results in automatic ICD coding [ 28 ]. The embeddings were selected by sequentially mapping discharge summaries to UMLS biomedical concepts in an approach to enrich word representations and to eliminate variations caused by tense, abbreviations and/or spelling mistakes.

Dataset

For training data, two di erent sources were used: The o cal CodiEsp dataset3 with manually generated ICD-10 codi cations, and the MIMIC-III database with the older ICD-9 classi cation system in use, which are mappable to discharge summaries [15]. When exploring other additional resources, such as the abstracts collected from Lilacs and Ibecs4, the MIMIC-III database was selected as the main data source for augmentation, because it seems to be the most similar database compared to the CodiEsp corpus. Among the free text narrative structured documents describing hospital courses, it has a high average amount of manually assigned codes per document coming from real-world EHRs. With the decision to use the MIMIC-III dataset for augmentation it was also decided to focus on the English translated documents of CodiEsp corpus. A key di erence between the two data sources is that the codi cation for CodiEsp is a semantic mapping of concepts, where the assigned codes do not have to be based on medical outcome. For example, a negative serum test (as seen in Listing 1.1) for CodiEsp still results in appropriately assigned ICD-10 codes, whereas it would not appear on MIMIC-III.

Listing 1.1. Excerpt of CodiESP Document with id S0211-69952009000500014-1,

showing results of a blood serum test and codi cation (Assigned Codes List: r80.9, r20.2, b19.20, b19.10, r23.8, r60.0, r10.9, r19.7, m25.50, l98.9, b20). [ . . . ] On p h y s i c a l e x a m i n a t i o n : b l o o d p r e s s u r e 104/76 mmHg, BMI 2 7 , minimal edema i n l o w e r l i m b s and p a p u l e s i n e l b o w s and arms . Blood count and c o a g u l a t i o n were normal , c r e a t i n i n e 0 . 9 mg/ dl , t o t a l c h o l e s t e r o l 238 mg/ dl , t r i g l y c e r i d e s 104 mg/ dl , t o t a l p r o t e i n 6 . 5 g / d l and albumin 3 . 6 g / d l . A n t i c a r d i o l i p i n a n t i b o d i e s a n t i c a r d i o l i p i n : S e r o l o g y a g a i n s t HBV, HCV and HIV was n e g a t i v e . [ . . . ] 3.1

CodiEsp Corpus

The CodiEsp corpus consists of 1,000 clinical case studies manually selected by a practicing physician and a clinical documentarian [20]. The training and development dataset comprises 750 documents with an average of 11.09 codes assigned per document. The test set contains 250 documents and was provided with an additional collection of more than 2,000 documents (background set) to prevent manual corrections. Within the CodiEsp training and development dataset there are 26,696 unique tokens with an average of 301 tokens and 19 sentences per document. It contains 2,557 distinct codes in total of which 363 3 https://doi.org/10.5281/zenodo.3625746, last accessed 2020-07-17 4 https://doi.org/10.5281/zenodo.3606625, last accessed 2020-07-17 unseen codes appear in the test set as seen in Figure 1 (a). 68.24 % of the codes are explainable with the CodiEsp training and development dataset. 3.2

MIMIC-III Corpus

The MIMIC-III database comprises de-identi ed records from Beth Israel Deaconess Medical Center intensive care unit (ICU) stays, collected between 2001 and 2012. It contains 59,652 discharge summaries with an average of 11.48 codes assigned per document. It has 119,171 unique tokens with 1,947 tokens and 112 sentences on average. The dataset is in principle very well suited but has some characteristics that need to be adapted. The coding system is ICD-9, which has to be converted to ICD-10 accordingly to match the CodiEsp codi cation. In addition, the dataset only contains summaries of intensive care unit stays, which on average exceed the maximum length of tokens available for Transformer architectures. After conversion, the dataset contains 5,447 distinct codes as seen in Figure 1 (b).

Segmentation For BERT [ 9 ] models, the maximum length of a sequence after tokenizing is 512, resulting in an e ective limit of 510 tokens for the input layer after subtracting the [CLS] and [SEP] tokens. Because MIMIC-III discharge summaries have an average length of 1,947 tokens (see Table 1) with only 11:67 % of all documents not exceeding 510 tokens, the data has to be truncated in order to t into the Transformer model.

A simple approach as supposed by Sun et. al. [ 30 ] would be to only use the rst 510 tokens (head-only) or to use the last 510 tokens (tail-only) of a document, but none of them seem to be appropriate for truncating clinical text without losing relevant information.

When inspecting the summaries, even though they are free text narratives, a xed structure has been identi ed in most of the documents: They usually start with a Chief Complaint followed by a historical Background section, which may include History of Present Illness, Past Medical History, Social History and Family History. During Diagnostics and Pertinent Results, the structure is no longer as consistent and di erent sections appear, which are more dependent on the individual case. From the middle towards the end of the documents there is a section called Brief Hospital Course, which summarizes the ICU stay followed by discharge condition instructions and/or followup instructions.

In early experiments, the e ect of using di erent segments was evaluated. Here, it was found that using the rst 510 tokens (head-only) of discharge summaries decreased the performance in comparison to using the last 510 tokens (tail-only). It can be assumed that this is because the background history, which comes at the top of the documents, is not as relevant to the clinical coding as the narrative over the actual present hospital course. It was decided to remove content up to the Brief Hospital Course section and sequentially use the remaining document up to whatever ts into 510 tokens. 7; 822 documents were omitted where this section was not present, resulting in 13 % loss of data. Descriptive statistics of the segmented corpus can be seen in Table 1.

MIMIC-III MIMIC-III* CodiESP Train Dev Number of records with ICD code 59,652 51,830 750

Number of unique tokens 1,091,025 276,500 26,696 Number of bigrams 10,609,279 2,846,377 114,846 Number of trigrams 27,814,651 7,873,155 180,650

Avg. number of tokens / record 1,947 427 301 Avg. number of sentences / record 112 39 19 Avg. number of labels / record 11.48 11.45 11.09 ICD-9 code Conversion with General Equivalence Mappings ICD-9

codes of the MIMIC-III database have been converted to ICD-10 using the publicly available ICD-10 CM General Equivalence Mappings (GEMs) [ 6 ]. Turer et al. assessed the reliability of conversion between ICD-9 and ICD-10 and found that manual coding from the forward GEMs and backward GEMs were reproducible by 85.2 % and 90.4 % respectively [ 31 ].

Data Selection Because of the di erent data sources and MIMIC-III being limited to ICU cases, both datasets have been compared in terms of their distinct code subsets. As seen in Figure 1 (b), the MIMIC-III data contains 4,156 unique ICD-10 codes that are not present in the CodiEsp train, development, and testset. These codes are less generic, apply to the ICU cases and are not covered by the smaller CodiEsp corpus. To make the data augmentation more practical, only documents where 50 % or more of the assigned codes are present in the Top 100, Top 250 or Top 500 frequent codes of the CodiEsp training and development set were used (impact on training size can be seen in Table 3). Only discharge summaries containing the Brief Hospital Course section were selected by using a regular expression match, resulting in 51,830 out of 59,652 available documents.

Available data augmentation increases when changing the Top frequent code amount, because the criteria/matching rule, if a document has 50 % or more codes is less strict, resulting in more MIMIC-III documents ending up in training data. Increasing the augmentation data in that way increases recall, but reduces precision (see Table 4). A good compromise was to create a model that is able to predict the Top 100 frequent codes in CodiEsp. 4 4.1

Methods Transformer architecture and BERT

BERT and Transformer [ 9 ] have proven to be extremely e ective in many downstream natural language processing (NLP) tasks. While it works well on assigning a smaller subset of ICD codes [ 3,27 ], it is uncertain how BERT models can work MIMIC-III Data

4156 MIMIC-III Data 684

Train_Dev Set 613

522 156

207 Test Set 258 801

Train_Dev Set 1201 213 296 56

484 307 Test Set 1414 780

363 Train_Dev Set

Test Set MIMIC-III Data 897

Top 100 Codes 278 74 26 Test Set

765 (a) CodiEsp Train Dev and Test

Distribution.

(b) MIMIC-III and CodiEsp Train Dev and Test Set. (c) MIMIC-III with 50 % in Top 100

CodiEsp and Test Set.

(d) MIMIC-III with 50 % in Top 100

CodiEsp and Train Dev and Test Set. Fig. 1. Venn diagrams showing the distribution of the number of distinct ICD-10 codes for di erent datasets and subsets.

with clinical texts of any length and in solving extreme multi-label classi cation problems with a high average number of assigned codes per document.

Though the MIMIC-III augmentation does not t into the token limitation without clipping documents, the Transformer architecture o ers good innovations that can be practical in the classi cation of clinical text. The word tokenizer allows words that are outside the vocabulary to be represented by word pieces instead of simply assigning them to an unknown token, which is why it was selected for the rst tests. This feature is particularly useful for discharge summaries, as spelling mistakes and non-standard abbreviations are common.

Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) [18] and ClinicalBERT [ 2 ] have the same architecture but are pre-trained on large-scale biomedical corpora. BioBERT has been pre-trained on PubMed abstracts5 and PMC6 full-text articles. Bio ClinicalBERT7 is an extended model that was also pre-trained on all notes from MIMIC-III (880M words). The Bio ClinicalBERT model was selected because of the larger pretraining. 4.2

XLNet

The recently proposed Generalized Autoregressive Pretraining for Language Understanding (XLNet) model [ 32 ] is an autoregressive language model (LM). It is important to note that although BERT and XLNet have many similarities, there are some di erences that need to be explained. Here, autoregressive means that XLNet makes use of the TransformerXL [14] to capture information from previous sequences in order to process the current sequence, and achieving the regressive e ect at the sequence level. XLNet uses relative position coding and a permutation LM, by factorizing the output with all possible permutations.

The permutation e ect is limited to words which are \attended" to. This is done by changing the attention mask prior to the attention softmax while keeping track of the positional information in a sequence. For example, during pre-training, to predict a token t, the attention mask is set to minimum numbers for tokens that appear after position i > t. Only the tokens before and including t on the current factorization are used to compute the attention. The advantage is that the tokens that come before t change with each permutation, but their positions within the sequence are kept constant, allowing XLNet to capture bidirectional context.

XLNet implements the Multi-head attention, which is slightly di erent from the one in BERT, where it is known that it generates a query Q, a key K, and a value V projection of each word in the input sentence. For each query Q, the Multi-head attention Layer uses K to compute an attention score for each value vector V and then sums the value vectors into a single representation using the attention weights [ 7 ].

For XLNet, linear layers are used to map the input to the Multi-head attention layer directly. This results in mapping the input into smaller space with the same number of dimensions that add up to the original dimension as known for BERT. This allows each word to attend more to other words and not only to itself, which results in a nal richer representation of each word. 4.3

Preprocessing

Because the discharge summaries were de-identi ed free text narratives, additional pre-processing steps were taken to convert them into a sequence of sentences, removing all numbers, and name placeholders. Leading and trailing 5 https://pubmed.ncbi.nlm.nih.gov/, last accessed 2020-07-17 6 https://www.ncbi.nlm.nih.gov/pmc/, last accessed 2020-07-17 7 https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT, last accessed 2020-07-17 spaces, quotations and semicolons have also been removed. For the CodiESP corpus, no pre-processing was applied. 4.4

Training the models

The experiments were done with the PyTorch-transformers implementations of BERT and XLNet8. The overall end-to-end training process can be seen in Figure 2. The models were ne-tuned on all layers without freezing. As proposed by the original papers [ 9,32 ], Adam [17] was used in early experiments as the optimizier, but was then replaced by the Layerwise Adaptive Large Batch (LAMB) optimizer [ 33 ] because it resulted in a slightly reduced training time. The hyperparameters have been selected and optimized based on the development set performance. Using a learning rate of 7e-4 or 6e-4 resulted in the best scores, though the Transformer model seems to react very sensitively to the use of di erent learning rates, because selecting di erent settings often led to poor results.

Di erent warmup schedules were tried, but had no impact on the results. Among the two versions of BERT cased and uncased, it was found that overall the uncased version works slightly better. However, the di erence is still very small. For XLNet, the only available version is cased. The base version of XLNet was preferred over the large version due to computational expense. The training batch size was 8 for XLNet and 16 for BERT models. To produce the ranking of the codes, Binary cross-entropy with logits was used to obtain con dence for each ICD-10 code during inference. They were then ordered by con dence and cut o with a threshold of t = 0:4. The prediction pipeline of the BERT model including the association rules is shown in gure 3. 4.5

Apriori Association Rules

The apriori algorithm [ 1 ] has been used to nd frequent itemsets in a list of transactions but recently has also been in use to nd association rules and label co-occurrences in clinical text, such as in autopsy reports [ 11 ]. Association rules can be obtained with the support and con dence parameter, where the support of a set of items is the probability that this set of items occurs in a transaction. Con dence refers to the likelihood that an item B will also be purchased when item A is purchased. It can be calculated by dividing the number of transactions where A and B are bought together by the total number of transactions where A is bought. To identify and explore co-occurrences, a low min support (0.02) value has been used on the CodiEsp train and development set. The resulting apriori association rules as seen in Figure 6 have been plotted with the arulesviz [13] R package. The graph shows 59 rules.

One example for a relation is Hepatitis B and C as shown by the rule that connects b19.10 and b19.20. When exploring the data, it was found that this rule refers to serology tests, that often include test results for di erent viruses, such as hepatitis B and C. An example can be seen in Listing 1.1. Another con dent 8 https://github.com/huggingface/transformers, last accessed 2020-07-17 Visual Paradigm Online Diagrams Express Edition BioBERT Pubmed abstracts BioClinical Pubmed abstracts + BERT MIMSICu-mIImDaisricehsarge XLNet

BooksCorpus +

Englisch Wikipedia CodiEsp

Of icial Data from

Organizer MIMIC-I I

Matching MIMIC-I I

Documents Probabilities (t=0.4) [Annotated] label1 ...

labelN

Classifier with Sigmoid + BCELoss

C E[CLS]

E1 Data for Pre-training

Model (BERT/XLNet)

Pre-training

Loss [Unannotated]

Copy Weights Data for Fine-tuning

Model (BERT/XLNet)

Additional Layer

Fine-tuning

Loss rule is that localized enlarged lymph nodes (r59.0 and r59.9) links to unspeci ed fever (r50.9), which then links to unspeci ed pain (r52). As such rules should be covered by the trained model, not that many di erent rulesets have been tested and added during inference.

However, the 11-ruleset as seen in Figure 5 improved the mean Average Precision (MAP) results on the development set between 0 % to 1:2 % depending on the model and was therefore added to the nal submission if missed out. The submission guideline requires that the prediction is ordered by con dence. Because the predicted con dence cannot be compared with apriori support or con dence values and because the con dence of the primary model was not high enough, the association rule codes were added at the end. They were ranked by highest level of support. 5

Results and Discussion

Figure 4 shows experimental runs on the development set for the tested models with di erent pre-trained embeddings and di erent frequent Top code subsets. This results in di erent enriched training data and also in a di erent amount of labels a model is able to predict. A comparison of how many documents end up in the training data can be seen in Table 3. The nal best results on the development set for each model can be seen in Table 2.

While the F1-Score is superior on models which are only able to predict the Top 50 frequent codes, the MAP score penalises this behaviour on the full set, because not only the classi cation but also the positional ranking is taken into consideration. When matching the Top 50 most frequent codes with MIMIC-III there is not enough data available for augmentation (363 additional documents). Starting with the Top 100 most frequent codes, improvements coming from the additional data can be seen. The augmentation improves the reported MAP score by 0:097 (0:128 F1) for the XLNet model. Increasing the training data further increases recall, but decreases precision.

The nal test set results for evaluation were reported by the task organisers and can be seen in Table 4. On the test set, the Bio ClinicalBERT model achieved the overall best performance for a single model with a MAP score of 0:259. XLNet on Top 100 frequent codes achieved the best performance in precision.

When the goldstandard for the test set was released, it was evaluated how many of the unseen codes would have been explainable by keeping the remaining annotated codes of each MIMIC-III document within the training data (Knowledge Discovery). Figure 1 (d) shows that for the Top 100 most frequent codes training set, 56 distinct unseen codes would have been explainable. Here, a small performance improvement can be expected, but it is noteworthy that only a few of the codes were seen more than once in the test data (76 appearances in total). Because they were unseen before, it can be assumed that these are codes with rare appearances. It can be concluded that more resources are needed to be able to explain the full code set.

0:6 0:5 0:4 e r o -S 0:3 c 1 F 0:2 0:1 0

BERT base mimic Top 50 BERT base Top 50 XLNet mimic Top 50 XLNet mimic Top 100 XLNet Top 50 XLNet Top 100 Bio ClinicalBERT Top 100

0 500 1;000 1;500 2;000 2;500 3;000 3;500 4;000 4;500

Steps Model MAP F1 XLNet base cased + MIMIC-III - Top 50 0.232 0.608

XLNet base cased - Top 50 0.216 0.602

BERT base uncased + MIMIC-III - Top 50 0.143 0.47

BERT base uncased - Top 50 0.165 0.372 XLNet base cased + MIMIC-III - Top 100 0.247 0.432 XLNet base - Top 100 0.15 0.304

Bio Clinical BERT + MIMIC-III Top 100 0.244 0.361 Model Training Data Size Model Size

XLNet mimic 500 19,484 documents 459.78M XLNet mimic 250 10,754 documents 459.03M XLNet mimc 100 3,286 documents 458.58M

Bio Clinical BERT 100 3,286 documents 423.43M

Fig. 5. Graph for 11 ICD-10 apriori association rules. Size: min support(0.03) min con dence(0.3), Color: lift(1.393-20.294).

i51.9 i25.9

b19.10 n18.9

i10 i82.90 r5r95.90.9

e11.9 l98.9 d64.9 n28.9 This work compared BERT based models with XLNet. The e ect of enriching training data with documents from MIMIC-III was evaluated. Here, it was found that the MIMIC-III augmentation with code conversion was able to improve the results compared to using only the stock data set. The apriori algorithm has been applied to build and explore association rules by nding frequent item sets. The 11-ruleset was able to improve the mean Average Precision (MAP) results on the development set between 0 % and 1:2 %.

Among the submitted models, the ensemble of BioBERT and XLNet achieved the highest mean Average Precision (MAP) score of 0.259 (0.306 for the subset of codes only present in the train and validation sets). In terms of single model performance, the Bio ClinicalBERT model achieved overall best performance. The XLNet, even though pre-trained on generic text has the highest precision value on the test set and overall best performance on the development set.

Though the models are still far from achieving good results on the full label set, the task has been very challenging with many possible labels, given only a relatively small dataset. It was found that the large MIMIC-III database is not able to cover all unseen codes, so it can be concluded that more resources are needed to be able to explain the full code set.

In future work, XLNets attention should be further evaluated because the sequence dependency on the hidden states of previous sequences can be adjusted by a memory length hyper-parameter. It will be interesting to tune and see the impact of this parameter, but also to test and see how a domain-speci c XLNet model performs when pre-trained on large biomedical data. 13. Hahsler, M., Chelluboina, S.: arulesviz: Visualizing association rules and frequent itemsets. R package version 0.1-5 (2012) 14. Howard, J., Ruder, S.: Universal language model ne-tuning for text classication. In: Proceedings of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers). pp. 328{339 (01 2018).

https://doi.org/10.18653/v1/P18-1031 15. Johnson, A.E., Pollard, T.J., Shen, L., Li-Wei, H.L., Feng, M., Ghassemi, M.,

Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible

critical care database. Scienti c data 3(1), 1{9 (2016) 16. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Arti cial Intelligence in Medicine 65 (05 2015). https://doi.org/10.1016/j.artmed.2015.04.007 17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,

Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings

(2015) 18. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics (2019). https://doi.org/10.1093/bioinformatics/btz682

19. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: E cient Estimation of Word Representations in Vector Space. In: 1st International Conference on Learning Representations (ICLR). vol. abs/1301.3781. Scottsdale, Arizona, USA (2013) 20. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estape, J., Krallinger, M.:

Overview of automatic clinical coding: annotations, guidelines, and solutions for

non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working

Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020)

21. Mullenbach, J., Wiegre e, S., Duke, J., Sun, J., Eisenstein, J.: Explainable

Prediction of Medical Codes from Clinical Text. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Com

putational Linguistics: Human Language Technologies. pp. 1101{1111 (2018). https://doi.org/10.18653/v1/N18-1100 22. Neveol, A., Cohen, K.B., Grouin, C., Hamon, T., Lavergne, T., Kelly, L., Goeuriot,

L., Rey, G., Robert, A., Tannier, X., Zweigenbaum, P.: Clinical Information Ex

traction at the CLEF eHealth Evaluation lab 2016. CEUR Workshop Proceedings 1609, 28{42 (2016) 23. Neveol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey,

G., Rondet, C., Zweigenbaum, P.: CLEF eHealth 2017 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certi cates in English and French. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2017)

24. Neveol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier,

L., Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information Ex

traction Task Overview: ICD10 Coding of Death Certi cates in French, Hungarian and Italian. In: Working Notes of Conference and Labs of the Evaluation (CLEF)

Forum. CEUR Workshop Proceedings (2018)

25. Neves, M.L., Butzke, D., Dorendahl, A., Leich, N., Hummel, B., Schonfelder, G.,

Grune, B.: Overview of the CLEF eHealth 2019 multilingual information extrac

tion. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum.

CEUR Workshop Proceedings (2019)

1. Agrawal , R. , Srikant , R.: Fast algorithms for mining association rules . In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB) . vol. 1215 , pp. 487 { 499 ( 1994 )

2. Alsentzer , E. , Murphy , J. , Boag , W. , Weng , W.H. , Jin , D. , Naumann , T. , McDermott , M. : Publicly available clinical BERT embeddings . In: Proceedings of the 2nd Clinical Natural Language Processing Workshop . pp. 72 { 78 . Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019 ). https://doi.org/10.18653/v1/ W19 -1909

3. Amin , S. , Neumann , G. , Dun eld , K. , Vechkaeva , A. , Chapman , K.A. , Wixted , M.K. : MLT-DFKI at CLEF eHealth 2019: Multi-label Classi cation of ICD-10 Codes with BERT . In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum ( 2019 )

4. Baumel , T. , Nassour-Kassis , J. , Cohen, R. , Elhadad , M. , Elhadad , N. : Multi-label classi cation of patient notes: case study on ICD code assignment . In: Workshops at the thirty-second AAAI conference on arti cial intelligence ( 2018 )

5. Bodenreider , O. : The uni ed medical language system (UMLS): integrating biomedical terminology . Nucleic acids research 32(suppl 1) , D267{D270 ( 2004 )

6. Butler , R.R. : ICD-10 general equivalence mappings: Bridging the translation gap from ICD-9 . Journal of AHIMA 78 ( 9 ), 84 { 86 ( 2007 )

7. Clark , K. , Khandelwal , U. , Levy , O. , Manning , C. : What Does BERT Look at? An Analysis of BERT's Attention . In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP . pp. 276 { 286 (01 2019 ). https://doi.org/10.18653/v1/ W19 -4828

8. Cortes , C. , Vapnik , V. : Support-vector networks . Machine learning 20(3) , 273 { 297 ( 1995 )

9. Devlin , J. , Chang , M. , Lee , K. , Toutanova , K. : BERT: pre-training of deep bidirectional transformers for language understanding . CoRR abs/ 1810 .04805 ( 2018 )

10. Du , J. , Chen , Q. , Peng , Y. , Xiang , Y. , Tao , C. , Lu , Z. : ML-Net: multilabel classi cation of biomedical texts with deep neural networks . Journal of the American Medical Informatics Association 26 ( 11 ), 1279 { 1285 ( 2019 ). https://doi.org/10.1093/jamia/ocz085

11. Duarte , F. , Martins , B. , Pinto , C.S. , Silva , M.J. : Deep neural models for ICD-10 coding of death certi cates and autopsy reports in free-text . Journal of Biomedical Informatics 80 , 64 { 77 ( 2018 ). https://doi.org/10.1016/j.jbi. 2018 . 02 .011

12. Goeuriot , L. , Suominen , H. , Kelly , L. , Miranda-Escalada , A. , Krallinger , M. , Liu , Z. , Pasi , G. , Saez Gonzales, G. , Viviani , M. , Xu , C. : Overview of the CLEF eHealth Evaluation Lab 2020 . In: Arampatzis, A. , Kanoulas , E. , Tsikrika , T. , Vrochidis , S. , Joho , H. , Lioma , C. , Eickho , C. , Neveol , A. , Cappellato , L. , Ferro , N. (eds.) Experimental IR Meets Multilinguality , Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020 ) . LNCS Volume number: 12260 ( 2020 )

26. Perotte , A. , Pivovarov , R. , Natarajan , K. , Weiskopf , N. , Wood , F. , Elhadad , N.: Diagnosis code assignment: models and evaluation metrics . Journal of the American Medical Informatics Association 21 ( 2 ), 231 { 237 ( 2014 )

27. Sanger, M. , Weber , L. , Kittner , M. , Leser , U. : Classifying German Animal Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1 . In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum ( 2019 )

28. Schafer, H., Friedrich , C.M.: UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database . In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) . pp. 6089 { 6092 . IEEE ( 2019 )

29. Spolao^r, N., Cherman , E.A. , Monard , M.C. , Lee , H.D.: A comparison of multi-label feature selection methods using the problem transformation approach . Electronic Notes in Theoretical Computer Science 292 , 135 { 151 ( 2013 )

30. Sun , C. , Qiu , X. , Xu , Y. , Huang , X. : How to ne-tune BERT for text classi cation? In: China National Conference on Chinese Computational Linguistics . pp. 194 { 206 . Springer ( 2019 )

31. Turer , R.W. , Zuckowsky , T.D., Causey , H.J. , Rosenbloom , S.T.: ICD-10-CM Crosswalks in the primary care setting: assessing reliability of the GEMs and reimbursement mappings . Journal of the American Medical Informatics Association 22 ( 2 ), 417 { 425 ( 2015 )

32. Yang , Z. , Dai , Z. , Yang , Y. , Carbonell, J.G., Salakhutdinov , R. , Le , Q.V. : XLNet: Generalized Autoregressive Pretraining for Language Understanding . In: Wallach, H.M. , Larochelle , H. , Beygelzimer , A., d' Alche-Buc, F. , Fox , E.B. , Garnett , R . (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019 . pp. 5754 { 5764 . Vancouver, BC, Canada ( 2019 )

33. You , Y. , Li , J. , Reddi , S. , Hseu , J. , Kumar , S. , Bhojanapalli , S. , Song , X. , Demmel , J. , Keutzer , K. , Hsieh , C.J. : Large Batch Optimization for Deep Learning: Training BERT in 76 minutes . In: International Conference on Learning Representations (ICLR) ( 2020 )