Multilingual ICD-10 Code Assignment with Transformer Architectures using MIMIC-III Discharge Summaries FHDO Biomedical Computer Science Group (BCSG) Henning Schäfer1[0000−0002−4123−0406] , and Christoph M. Friedrich1,2[0000−0001−7906−0038] 1 Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, 44227 Dortmund, Germany hesch024@stud.fh-dortmund.de christoph.friedrich@fh-dortmund.de 2 Institute for Medical Informatics, Biometry and Epidemiology, University Hospital Essen, Essen, Germany Abstract. In this work, we present the participation of FHDO Biomed- ical Computer Science Group (BCSG) to the CLEF eHealth challenge 2020 Task 1 on automatic assignment of ICD-10 codes (CIE-10 in the Spanish translation) to clinical case studies. Training data has been aug- mented with documents from the Medical Information Mart for Intensive Care (MIMIC-III), a critical care database. ICD-10 CM General Equiv- alence Mappings (GEMs) were subsequently used to convert the codifi- cation from ICD-9 to ICD-10. Recent state-of-the-art Transformer-based models, such as BioBERT and ClinicalBERT are compared to the Gener- alized Autoregressive Pretraining for Language Understanding (XLNet) model. Finally, the apriori algorithm has been applied to build associa- tion rules by finding frequent item sets. An ensemble of BioBERT and XLNet achieved a mean Average Precision (MAP) score of 0.259 (0.306 for the subset of codes only present in the training and validation sets). Keywords: BioBERT· MIMIC-III · Apriori · XLNet · ICD-10 Code Conversion 1 Introduction This paper describes the participation of FHDO Biomedical Computer Science Group (BCSG) to the Conference and Labs of the Evaluation Forum (CLEF) eHealth 2020 Task 1 Subtask 1 on Multilingual Information Extraction (IE), which focuses on International Statistical Classification of Diseases (ICD) cod- ing for clinical textual data in Spanish [20,12]. Diagnostic codes are used as a Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. billing mechanism in the Electronic Health Record (EHR) and can be used for automatic semantic indexing of clinical documents, but also to facilitate decision support systems that aim to help clinical coders by suggesting a relevant subset of potential codes for selection. The problem can be described as a mapping from natural language free-texts to medical concepts such that, given a new document, the system can assign multiple codes to it. In terms of application in the biomedical field, Bidirectional Encoder Rep- resentations from Transformers (BERT) has only recently been used for ICD code assignment tasks, such as classifying German animal experiments in CLEF eHealth 2019 [3,27,25]. While it has proven to work well on assigning a smaller subset of ICD codes, it is uncertain how Transformer architecture models can perform on arbitrary long clinical text and in solving extreme multi-label classi- fication problems with a high average amount of assigned codes per document. CLEF eHealth tracks feature the classification of multilingual clinical docu- ments using ICD codes since 2016 [22,23,24,25]. This work enriches training data with the Medical Information Mart for Intensive Care (MIMIC-III) database and compares BERT based models with XLNet [32]. 2 Related Work A hierarchy-based approach with Support Vector Machines (SVM) [8], using the ’is-a’ relationship between ICD-9 codes to model label dependencies has been an early approach to ICD coding [26]. The hierarchy-based classifier surpassed the flat SVM, which did not consider code dependencies. Other approaches identified label density and label noise as useful features [29], while others empirically evaluated the simultaneous occurrence of labels [16]. ML-NET [10] followed the hierarchy-based approach and extended the coding of documents. Its deep neural network consists of an additional network for estimating the number of labels. Instead of separating relevant vs. irrelevant labels by a threshold value, a network for predicting the number of labels was built by using the document vector as input. Baumel et al. [4] evaluated 4 different models for ICD code assignment using data from MIMIC-II and MIMIC-III data sets. They presented a continuous bag-of-word model [19] (CBOW), a convolutional neural network, an SVM one- versus-all model and a bidirectional gated-recurrent unit model with hierarchical attention (HA-GRU). Another proposed model is a code-wise attention network [21], where atten- tion mechanisms are used to extract n-grams from the text that are influential in predicting each code. Unified Medical Language System (UMLS) [5] mapping and word embeddings have shown to be effective within text classification in the biomedical domain and improved results in automatic ICD coding [28]. The embeddings were selected by sequentially mapping discharge summaries to UMLS biomedical concepts in an approach to enrich word representations and to eliminate variations caused by tense, abbreviations and/or spelling mistakes. 3 Dataset For training data, two different sources were used: The offical CodiEsp dataset3 with manually generated ICD-10 codifications, and the MIMIC-III database with the older ICD-9 classification system in use, which are mappable to discharge summaries [15]. When exploring other additional resources, such as the abstracts collected from Lilacs and Ibecs4 , the MIMIC-III database was selected as the main data source for augmentation, because it seems to be the most similar database compared to the CodiEsp corpus. Among the free text narrative struc- tured documents describing hospital courses, it has a high average amount of manually assigned codes per document coming from real-world EHRs. With the decision to use the MIMIC-III dataset for augmentation it was also decided to focus on the English translated documents of CodiEsp corpus. A key difference between the two data sources is that the codification for CodiEsp is a seman- tic mapping of concepts, where the assigned codes do not have to be based on medical outcome. For example, a negative serum test (as seen in Listing 1.1) for CodiEsp still results in appropriately assigned ICD-10 codes, whereas it would not appear on MIMIC-III. Listing 1.1. Excerpt of CodiESP Document with id S0211-69952009000500014-1, showing results of a blood serum test and codification (Assigned Codes List: r80.9, r20.2, b19.20, b19.10, r23.8, r60.0, r10.9, r19.7, m25.50, l98.9, b20). [...] On p h y s i c a l e x a m i n a t i o n : b l o o d p r e s s u r e 104/76 mmHg, BMI 2 7 , minimal edema i n l o w e r l i m b s and p a p u l e s i n e l b o w s and arms . Blood count and c o a g u l a t i o n were normal , c r e a t i n i n e 0 . 9 mg/ dl , t o t a l c h o l e s t e r o l 238 mg/ dl , t r i g l y c e r i d e s 104 mg/ dl , t o t a l p r o t e i n 6 . 5 g / d l and albumin 3 . 6 g / d l . A n t i c a r d i o l i p i n a n t i b o d i e s a n t i c a r d i o l i p i n : S e r o l o g y a g a i n s t HBV, HCV and HIV was n e g a t i v e . [...] 3.1 CodiEsp Corpus The CodiEsp corpus consists of 1,000 clinical case studies manually selected by a practicing physician and a clinical documentarian [20]. The training and development dataset comprises 750 documents with an average of 11.09 codes assigned per document. The test set contains 250 documents and was provided with an additional collection of more than 2,000 documents (background set) to prevent manual corrections. Within the CodiEsp training and development dataset there are 26,696 unique tokens with an average of 301 tokens and 19 sentences per document. It contains 2,557 distinct codes in total of which 363 3 https://doi.org/10.5281/zenodo.3625746, last accessed 2020-07-17 4 https://doi.org/10.5281/zenodo.3606625, last accessed 2020-07-17 unseen codes appear in the test set as seen in Figure 1 (a). 68.24 % of the codes are explainable with the CodiEsp training and development dataset. 3.2 MIMIC-III Corpus The MIMIC-III database comprises de-identified records from Beth Israel Dea- coness Medical Center intensive care unit (ICU) stays, collected between 2001 and 2012. It contains 59,652 discharge summaries with an average of 11.48 codes assigned per document. It has 119,171 unique tokens with 1,947 tokens and 112 sentences on average. The dataset is in principle very well suited but has some characteristics that need to be adapted. The coding system is ICD-9, which has to be converted to ICD-10 accordingly to match the CodiEsp codification. In ad- dition, the dataset only contains summaries of intensive care unit stays, which on average exceed the maximum length of tokens available for Transformer ar- chitectures. After conversion, the dataset contains 5,447 distinct codes as seen in Figure 1 (b). Segmentation For BERT [9] models, the maximum length of a sequence after tokenizing is 512, resulting in an effective limit of 510 tokens for the input layer after subtracting the [CLS] and [SEP] tokens. Because MIMIC-III discharge summaries have an average length of 1,947 tokens (see Table 1) with only 11.67 % of all documents not exceeding 510 tokens, the data has to be truncated in order to fit into the Transformer model. A simple approach as supposed by Sun et. al. [30] would be to only use the first 510 tokens (head-only) or to use the last 510 tokens (tail-only) of a document, but none of them seem to be appropriate for truncating clinical text without losing relevant information. When inspecting the summaries, even though they are free text narratives, a fixed structure has been identified in most of the documents: They usually start with a Chief Complaint followed by a historical Background section, which may include History of Present Illness, Past Medical History, Social History and Family History. During Diagnostics and Pertinent Results, the structure is no longer as consistent and different sections appear, which are more dependent on the individual case. From the middle towards the end of the documents there is a section called Brief Hospital Course, which summarizes the ICU stay followed by discharge condition instructions and/or followup instructions. In early experiments, the effect of using different segments was evaluated. Here, it was found that using the first 510 tokens (head-only) of discharge sum- maries decreased the performance in comparison to using the last 510 tokens (tail-only). It can be assumed that this is because the background history, which comes at the top of the documents, is not as relevant to the clinical coding as the narrative over the actual present hospital course. It was decided to remove con- tent up to the Brief Hospital Course section and sequentially use the remaining document up to whatever fits into 510 tokens. 7, 822 documents were omitted where this section was not present, resulting in 13 % loss of data. Descriptive statistics of the segmented corpus can be seen in Table 1. Table 1. MIMIC-III, CodiESP Training and Development dataset descriptive statis- tics. (*) denotes the segmented corpus. MIMIC-III MIMIC-III* CodiESP Train Dev Number of records with ICD code 59,652 51,830 750 Number of unique tokens 1,091,025 276,500 26,696 Number of bigrams 10,609,279 2,846,377 114,846 Number of trigrams 27,814,651 7,873,155 180,650 Avg. number of tokens / record 1,947 427 301 Avg. number of sentences / record 112 39 19 Avg. number of labels / record 11.48 11.45 11.09 ICD-9 code Conversion with General Equivalence Mappings ICD-9 codes of the MIMIC-III database have been converted to ICD-10 using the pub- licly available ICD-10 CM General Equivalence Mappings (GEMs) [6]. Turer et al. assessed the reliability of conversion between ICD-9 and ICD-10 and found that manual coding from the forward GEMs and backward GEMs were repro- ducible by 85.2 % and 90.4 % respectively [31]. Data Selection Because of the different data sources and MIMIC-III being limited to ICU cases, both datasets have been compared in terms of their dis- tinct code subsets. As seen in Figure 1 (b), the MIMIC-III data contains 4,156 unique ICD-10 codes that are not present in the CodiEsp train, development, and testset. These codes are less generic, apply to the ICU cases and are not covered by the smaller CodiEsp corpus. To make the data augmentation more practical, only documents where 50 % or more of the assigned codes are present in the Top 100, Top 250 or Top 500 frequent codes of the CodiEsp training and development set were used (impact on training size can be seen in Table 3). Only discharge summaries containing the Brief Hospital Course section were selected by using a regular expression match, resulting in 51,830 out of 59,652 available documents. Available data augmentation increases when changing the Top frequent code amount, because the criteria/matching rule, if a document has 50 % or more codes is less strict, resulting in more MIMIC-III documents ending up in training data. Increasing the augmentation data in that way increases recall, but reduces precision (see Table 4). A good compromise was to create a model that is able to predict the Top 100 frequent codes in CodiEsp. 4 Methods 4.1 Transformer architecture and BERT BERT and Transformer [9] have proven to be extremely effective in many down- stream natural language processing (NLP) tasks. While it works well on assigning a smaller subset of ICD codes [3,27], it is uncertain how BERT models can work MIMIC-III Data Train_Dev Set 1414 780 363 613 801 4156 522 258 Test Set 156 207 Train_Dev Set Test Set (a) CodiEsp Train Dev and Test (b) MIMIC-III and CodiEsp Train Dev Distribution. and Test Set. MIMIC-III Data Train_Dev Set MIMIC-III Data 897 Top 100 Codes 213 1201 74 26 684 278 296 765 56 484 307 Test Set Test Set (c) MIMIC-III with 50 % in Top 100 (d) MIMIC-III with 50 % in Top 100 CodiEsp and Test Set. CodiEsp and Train Dev and Test Set. Fig. 1. Venn diagrams showing the distribution of the number of distinct ICD-10 codes for different datasets and subsets. with clinical texts of any length and in solving extreme multi-label classification problems with a high average number of assigned codes per document. Though the MIMIC-III augmentation does not fit into the token limitation without clipping documents, the Transformer architecture offers good innova- tions that can be practical in the classification of clinical text. The word tok- enizer allows words that are outside the vocabulary to be represented by word pieces instead of simply assigning them to an unknown token, which is why it was selected for the first tests. This feature is particularly useful for discharge summaries, as spelling mistakes and non-standard abbreviations are common. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) [18] and ClinicalBERT [2] have the same architecture but are pre-trained on large-scale biomedical corpora. BioBERT has been pre-trained on PubMed abstracts5 and PMC6 full-text articles. Bio ClinicalBERT7 is an ex- tended model that was also pre-trained on all notes from MIMIC-III (880M words). The Bio ClinicalBERT model was selected because of the larger pre- training. 4.2 XLNet The recently proposed Generalized Autoregressive Pretraining for Language Un- derstanding (XLNet) model [32] is an autoregressive language model (LM). It is important to note that although BERT and XLNet have many similarities, there are some differences that need to be explained. Here, autoregressive means that XLNet makes use of the TransformerXL [14] to capture information from previous sequences in order to process the current sequence, and achieving the regressive effect at the sequence level. XLNet uses relative position coding and a permutation LM, by factorizing the output with all possible permutations. The permutation effect is limited to words which are “attended” to. This is done by changing the attention mask prior to the attention softmax while keeping track of the positional information in a sequence. For example, during pre-training, to predict a token t, the attention mask is set to minimum numbers for tokens that appear after position i > t. Only the tokens before and including t on the current factorization are used to compute the attention. The advantage is that the tokens that come before t change with each permutation, but their positions within the sequence are kept constant, allowing XLNet to capture bidirectional context. XLNet implements the Multi-head attention, which is slightly different from the one in BERT, where it is known that it generates a query Q, a key K, and a value V projection of each word in the input sentence. For each query Q, the Multi-head attention Layer uses K to compute an attention score for each value vector V and then sums the value vectors into a single representation using the attention weights [7]. For XLNet, linear layers are used to map the input to the Multi-head atten- tion layer directly. This results in mapping the input into smaller space with the same number of dimensions that add up to the original dimension as known for BERT. This allows each word to attend more to other words and not only to itself, which results in a final richer representation of each word. 4.3 Preprocessing Because the discharge summaries were de-identified free text narratives, ad- ditional pre-processing steps were taken to convert them into a sequence of sentences, removing all numbers, and name placeholders. Leading and trailing 5 https://pubmed.ncbi.nlm.nih.gov/, last accessed 2020-07-17 6 https://www.ncbi.nlm.nih.gov/pmc/, last accessed 2020-07-17 7 https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT, last ac- cessed 2020-07-17 spaces, quotations and semicolons have also been removed. For the CodiESP corpus, no pre-processing was applied. 4.4 Training the models The experiments were done with the PyTorch-transformers implementations of BERT and XLNet8 . The overall end-to-end training process can be seen in Figure 2. The models were fine-tuned on all layers without freezing. As proposed by the original papers [9,32], Adam [17] was used in early experiments as the optimizier, but was then replaced by the Layerwise Adaptive Large Batch (LAMB) optimizer [33] because it resulted in a slightly reduced training time. The hyperparameters have been selected and optimized based on the development set performance. Using a learning rate of 7e-4 or 6e-4 resulted in the best scores, though the Transformer model seems to react very sensitively to the use of different learning rates, because selecting different settings often led to poor results. Different warmup schedules were tried, but had no impact on the results. Among the two versions of BERT cased and uncased, it was found that overall the uncased version works slightly better. However, the difference is still very small. For XLNet, the only available version is cased. The base version of XLNet was preferred over the large version due to computational expense. The training batch size was 8 for XLNet and 16 for BERT models. To produce the ranking of the codes, Binary cross-entropy with logits was used to obtain confidence for each ICD-10 code during inference. They were then ordered by confidence and cut off with a threshold of t = 0.4. The prediction pipeline of the BERT model including the association rules is shown in figure 3. 4.5 Apriori Association Rules The apriori algorithm [1] has been used to find frequent itemsets in a list of transactions but recently has also been in use to find association rules and label co-occurrences in clinical text, such as in autopsy reports [11]. Association rules can be obtained with the support and confidence parameter, where the support of a set of items is the probability that this set of items occurs in a transaction. Confidence refers to the likelihood that an item B will also be purchased when item A is purchased. It can be calculated by dividing the number of transactions where A and B are bought together by the total number of transactions where A is bought. To identify and explore co-occurrences, a low min support (0.02) value has been used on the CodiEsp train and development set. The resulting apriori association rules as seen in Figure 6 have been plotted with the arulesviz [13] R package. The graph shows 59 rules. One example for a relation is Hepatitis B and C as shown by the rule that connects b19.10 and b19.20. When exploring the data, it was found that this rule refers to serology tests, that often include test results for different viruses, such as hepatitis B and C. An example can be seen in Listing 1.1. Another confident 8 https://github.com/huggingface/transformers, last accessed 2020-07-17 Visual Paradigm Online Diagrams Express Edition BioBERT Pubmed abstracts Pubmed abstracts + BioClinical Pre-training MIMIC-III Discharge Data for Pre-training Model (BERT/XLNet) BERT Loss Summaries BooksCorpus + XLNet Englisch Wikipedia [Unannotated] Copy Weights Official Data from CodiEsp Organizer Additional Fine-tuning Data for Fine-tuning Model (BERT/XLNet) Layer Loss Matching MIMIC-III MIMIC-III Documents [Annotated] Fig. 2. Workflow of end-to-end training process. Unannotated pre-training (on large corpora) and annotated fine-tuning with combined resources (CodiEsp and MIMIC- III). The weights Visual Paradigm Online Diagramsfrom pre-training phase were transferred. Express Edition Visual Paradigm Online Diagrams Express Edition Apriori Association Rules label1 Probabilities (t=0.4) ... r52 labelN R1 Classifier with r50.9 Sigmoid + BCELoss C T1 T2 TN ... BERT E[CLS] E1 E2 EN ... Token Embeddings [CLS] Token 1 Token 2 Token N ... Text Input Fig. 3. Inference for BERT models with apriori association rules. The text-input is classified with BCELoss function to get probabilities for all available ICD codes. The confidence must be at least t = 0.4 (threshold) to count as a positive ICD Visual code. Paradigm Online Diagrams Express Edition rule is that localized enlarged lymph nodes (r59.0 and r59.9) links to unspecified fever (r50.9), which then links to unspecified pain (r52). As such rules should be covered by the trained model, not that many different rulesets have been tested and added during inference. However, the 11-ruleset as seen in Figure 5 improved the mean Average Precision (MAP) results on the development set between 0 % to 1.2 % depending on the model and was therefore added to the final submission if missed out. The submission guideline requires that the prediction is ordered by confidence. Because the predicted confidence cannot be compared with apriori support or confidence values and because the confidence of the primary model was not high enough, the association rule codes were added at the end. They were ranked by highest level of support. 5 Results and Discussion Figure 4 shows experimental runs on the development set for the tested models with different pre-trained embeddings and different frequent Top code subsets. This results in different enriched training data and also in a different amount of labels a model is able to predict. A comparison of how many documents end up in the training data can be seen in Table 3. The final best results on the development set for each model can be seen in Table 2. While the F1-Score is superior on models which are only able to predict the Top 50 frequent codes, the MAP score penalises this behaviour on the full set, because not only the classification but also the positional ranking is taken into consideration. When matching the Top 50 most frequent codes with MIMIC-III there is not enough data available for augmentation (363 additional documents). Starting with the Top 100 most frequent codes, improvements coming from the additional data can be seen. The augmentation improves the reported MAP score by 0.097 (0.128 F1) for the XLNet model. Increasing the training data further increases recall, but decreases precision. The final test set results for evaluation were reported by the task organisers and can be seen in Table 4. On the test set, the Bio ClinicalBERT model achieved the overall best performance for a single model with a MAP score of 0.259. XLNet on Top 100 frequent codes achieved the best performance in precision. When the goldstandard for the test set was released, it was evaluated how many of the unseen codes would have been explainable by keeping the remaining annotated codes of each MIMIC-III document within the training data (Knowl- edge Discovery). Figure 1 (d) shows that for the Top 100 most frequent codes training set, 56 distinct unseen codes would have been explainable. Here, a small performance improvement can be expected, but it is noteworthy that only a few of the codes were seen more than once in the test data (76 appearances in total). Because they were unseen before, it can be assumed that these are codes with rare appearances. It can be concluded that more resources are needed to be able to explain the full code set. 0.6 0.5 0.4 F1-Score 0.3 BERT base mimic Top 50 0.2 BERT base Top 50 XLNet mimic Top 50 0.1 XLNet mimic Top 100 XLNet Top 50 XLNet Top 100 0 Bio ClinicalBERT Top 100 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 Steps Fig. 4. Experimental runs on the development set for tested models with different pre- trained embeddings and different frequent Top code subsets. F1-Score is being reported at intervals of 500 steps during training. Table 2. Results of the evaluation performed on the development set. MAP and F1- Score are being reported, where bold indicates best results for a category. Model MAP F1 XLNet base cased + MIMIC-III - Top 50 0.232 0.608 XLNet base cased - Top 50 0.216 0.602 BERT base uncased + MIMIC-III - Top 50 0.143 0.47 BERT base uncased - Top 50 0.165 0.372 XLNet base cased + MIMIC-III - Top 100 0.247 0.432 XLNet base - Top 100 0.15 0.304 Bio Clinical BERT + MIMIC-III Top 100 0.244 0.361 Table 3. Model size comparison on final submission. Model Training Data Size Model Size XLNet mimic 500 19,484 documents 459.78M XLNet mimic 250 10,754 documents 459.03M XLNet mimc 100 3,286 documents 458.58M Bio Clinical BERT 100 3,286 documents 423.43M Table 4. Results of the final evaluation performed by the task organisers. They re- port MAP, Precision, Recall and F1 scores. (* Cat) denotes that the score has been computed for categories determined as the first three digits of a code. (* Codes) de- notes that the score has been computed for the subset of codes only present in the train and validation sets. (*) denotes ensemble of Bio ClinicalBERT mimic 100 and XLNet mimic 100. Bold indicates best results for the category. (BERT†) denotes that the Bio ClinicalBERT version was used. Model MAP MAP Codes P R F1 F1 Codes F1 Cat BERT† mimic 100 apriori 0.242 0.288 0.375 0.285 0.324 0.352 0.373 XLNet BERT† ensemble* 0.259 0.306 0.407 0.287 0.337 0.367 0.387 XLNet mimic 100 apriori 0.231 0.275 0.457 0.244 0.318 0.351 0.366 XLNet mimic 250 apriori 0.21 0.244 0.342 0.28 0.308 0.334 0.366 XLNet mimic 500 apriori 0.128 0.149 0.235 0.215 0.225 0.243 0.276 size: support (0.031 − 0.069) color: lift (1.393 − 20.294) b19.10 b19.20 i10 r59.0 b99.9 r59.9 e11.9 r11.10 r50.9 d72.829 r52 r10.9 Fig. 5. Graph for 11 ICD-10 apriori association rules. Size: min support(0.03) min confidence(0.3), Color: lift(1.393-20.294). Graph for 59 rules size: support (0.02 − 0.069) color: lift (1.393 − 33.088) i51.9 i25.9 r11.0 r60.0 r60.9 r11.10 r19.7 l53.9 r06.00 b99.9 b19.20 k75.9 r10.9 r63.0 d72.829 b20 b19.10 r50.9 r52 r53.1 r59.9 e11.9 r59.0 i96 i10 r69 n18.9 r63.4 i82.90 n28.9 c64.9 n28.89 l98.9 d64.9 Fig. 6. Graph for 59 ICD-10 apriori association rules. Size: min support(0.02) min confidence(0.3), Color: lift(1.393-20.294). 6 Conclusions This work compared BERT based models with XLNet. The effect of enriching training data with documents from MIMIC-III was evaluated. Here, it was found that the MIMIC-III augmentation with code conversion was able to improve the results compared to using only the stock data set. The apriori algorithm has been applied to build and explore association rules by finding frequent item sets. The 11-ruleset was able to improve the mean Average Precision (MAP) results on the development set between 0 % and 1.2 %. Among the submitted models, the ensemble of BioBERT and XLNet achieved the highest mean Average Precision (MAP) score of 0.259 (0.306 for the subset of codes only present in the train and validation sets). In terms of single model performance, the Bio ClinicalBERT model achieved overall best performance. The XLNet, even though pre-trained on generic text has the highest precision value on the test set and overall best performance on the development set. Though the models are still far from achieving good results on the full label set, the task has been very challenging with many possible labels, given only a relatively small dataset. It was found that the large MIMIC-III database is not able to cover all unseen codes, so it can be concluded that more resources are needed to be able to explain the full code set. In future work, XLNets attention should be further evaluated because the sequence dependency on the hidden states of previous sequences can be adjusted by a memory length hyper-parameter. It will be interesting to tune and see the impact of this parameter, but also to test and see how a domain-specific XLNet model performs when pre-trained on large biomedical data. References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Pro- ceedings of the 20th International Conference on Very Large Data Bases (VLDB). vol. 1215, pp. 487–499 (1994) 2. Alsentzer, E., Murphy, J., Boag, W., Weng, W.H., Jin, D., Naumann, T., Mc- Dermott, M.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78. Associa- tion for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019). https://doi.org/10.18653/v1/W19-1909 3. Amin, S., Neumann, G., Dunfield, K., Vechkaeva, A., Chapman, K.A., Wixted, M.K.: MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum (2019) 4. Baumel, T., Nassour-Kassis, J., Cohen, R., Elhadad, M., Elhadad, N.: Multi-label classification of patient notes: case study on ICD code assignment. In: Workshops at the thirty-second AAAI conference on artificial intelligence (2018) 5. Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004) 6. Butler, R.R.: ICD-10 general equivalence mappings: Bridging the translation gap from ICD-9. Journal of AHIMA 78(9), 84–86 (2007) 7. Clark, K., Khandelwal, U., Levy, O., Manning, C.: What Does BERT Look at? An Analysis of BERT’s Attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. pp. 276–286 (01 2019). https://doi.org/10.18653/v1/W19-4828 8. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297 (1995) 9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR abs/1810.04805 (2018) 10. Du, J., Chen, Q., Peng, Y., Xiang, Y., Tao, C., Lu, Z.: ML-Net: multi- label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association 26(11), 1279–1285 (2019). https://doi.org/10.1093/jamia/ocz085 11. Duarte, F., Martins, B., Pinto, C.S., Silva, M.J.: Deep neural models for ICD-10 coding of death certificates and autopsy reports in free-text. Journal of Biomedical Informatics 80, 64 – 77 (2018). https://doi.org/10.1016/j.jbi.2018.02.011 12. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth Evaluation Lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.) Ex- perimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020) 13. Hahsler, M., Chelluboina, S.: arulesviz: Visualizing association rules and frequent itemsets. R package version 0.1-5 (2012) 14. Howard, J., Ruder, S.: Universal language model fine-tuning for text classi- fication. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 328–339 (01 2018). https://doi.org/10.18653/v1/P18-1031 15. Johnson, A.E., Pollard, T.J., Shen, L., Li-Wei, H.L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible critical care database. Scientific data 3(1), 1–9 (2016) 16. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artificial In- telligence in Medicine 65 (05 2015). https://doi.org/10.1016/j.artmed.2015.04.007 17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015) 18. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (2019). https://doi.org/10.1093/bioinformatics/btz682 19. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient Estimation of Word Rep- resentations in Vector Space. In: 1st International Conference on Learning Repre- sentations (ICLR). vol. abs/1301.3781. Scottsdale, Arizona, USA (2013) 20. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020) 21. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable Prediction of Medical Codes from Clinical Text. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies. pp. 1101–1111 (2018). https://doi.org/10.18653/v1/N18-1100 22. Névéol, A., Cohen, K.B., Grouin, C., Hamon, T., Lavergne, T., Kelly, L., Goeuriot, L., Rey, G., Robert, A., Tannier, X., Zweigenbaum, P.: Clinical Information Ex- traction at the CLEF eHealth Evaluation lab 2016. CEUR Workshop Proceedings 1609, 28–42 (2016) 23. Névéol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G., Rondet, C., Zweigenbaum, P.: CLEF eHealth 2017 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in English and French. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2017) 24. Névéol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier, L., Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information Ex- traction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2018) 25. Neves, M.L., Butzke, D., Dörendahl, A., Leich, N., Hummel, B., Schönfelder, G., Grune, B.: Overview of the CLEF eHealth 2019 multilingual information extrac- tion. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2019) 26. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.: Diagnosis code assignment: models and evaluation metrics. Journal of the Ameri- can Medical Informatics Association 21(2), 231–237 (2014) 27. Sänger, M., Weber, L., Kittner, M., Leser, U.: Classifying German Animal Ex- periment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum (2019) 28. Schäfer, H., Friedrich, C.M.: UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp. 6089–6092. IEEE (2019) 29. Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science 292, 135–151 (2013) 30. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: China National Conference on Chinese Computational Linguistics. pp. 194–206. Springer (2019) 31. Turer, R.W., Zuckowsky, T.D., Causey, H.J., Rosenbloom, S.T.: ICD-10-CM Cross- walks in the primary care setting: assessing reliability of the GEMs and reimburse- ment mappings. Journal of the American Medical Informatics Association 22(2), 417–425 (2015) 32. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: XLNet: Generalized Autoregressive Pretraining for Language Understanding. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: NeurIPS 2019. pp. 5754–5764. Vancouver, BC, Canada (2019) 33. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., Hsieh, C.J.: Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In: International Conference on Learning Representations (ICLR) (2020)