Classifying German Animal Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1 Mario Sänger1? , Leon Weber1? , Madeleine Kittner1? , and Ulf Leser1 Humboldt Universität zu Berlin, Knowledge management in Bioinformatics, Berlin, Germany {saengema,weberple,kittner,leser}@informatik.hu-berlin.de Abstract. In this paper we present our contribution to the CLEF eHealth challenge 2019, Task 1. The task involves the automatic annotation of German non-technical summaries of animal experiments with ICD-10 codes. We approach the task as multi-label classification problem and leverage the multi-lingual version of the BERT text encoding model [6] to represent the summaries. The model is extended by a single output layer to produce probabilities for individual ICD-10 codes. In addition, we make use of extra training data from the German Clinical Trials Register and ensemble several model instances to improve the overall performance of our approach. We compare our model with five baseline systems including a dictionary matching approach and single-label SVM and BERT classification models. Experiments on the development set highlight the advantage of our approach compared to the baselines with an improvement of 3.6%. Our model achieves the overall best perfor- mance in the challenge reaching an F1 score of 0.80 in the final evalua- tion. Keywords: ICD-10 Classification · German Animal Experiments · Multi- label Classification · Multi-lingual BERT Encodings 1 Introduction Biomedical natural language processing (NLP) aims to support biomedical re- searchers, health professionals in their daily clinical routine as well as patients and the public searching for disease-related information. A large part of Biomedi- cal NLP focuses on extraction of biomedical concepts from scientific publications or classification of such documents to biomedical concepts. In the past biomed- ical NLP has strongly advanced for biomedical or clinical documents in English [7]. Non-English biomedical NLP lags behind since the availability of annotated Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. ? These authors contributed equally. corpora and other resources (e.g. dictionaries and ontologies for biomedical con- cepts) in non-English languages is limited. Since 2015 the CLEF eHealth community addresses this issue by organising shared tasks on non-English or multilingual information extraction. The subject of CLEF eHealth shared tasks since 2016 [13–15] include the classification of clinical documents according to the International Classification of Diseases and Related Health Problems (ICD-10) [17]. More precisely, the task has been the assignment of ICD-10 codes to death certificates in Frensh, English, Hungar- ian and Italian. Among the best performing teams in 2018, the task has been treated as a multi-label classification problem or as sequence-to-sequence predic- tion leveraging neural networks [2]. Other well performing systems were based on a supervised learning system using multi-layer perceptrons and an One-vs- Rest (OVR) strategy supplemented with IR methods [1] or an ensemble model for ICD 10 coding prediction utilising word embeddings created on the training data as well as on language-specific Wikipedia articles [9]. In 2019, the CLEF eHealth Evaluation Task 1 focuses on the assignment of ICD-10 codes to health-related, non-technical summaries of animal experi- ments in German [10, 16]. According to the laws of the European Union each member state has to publish a comprehensible, nontechnical summary (NTS) of each authorised research project involving laboratory animals to provide greater transparency and increase the protection of animal welfare. In Germany the web- based database AnimalTestInfo1 houses and publishes planned animal studies to inform researchers and the public. To improve analysis of the database, sum- maries submitted in 2014 and 2015 (roughly 5.300) were labelled by human experts according to the German version of the ICD-10 classification system2 in [4]. Based on this pilot study further documents added to the database have been labelled and used to conduct this year’s CLEF eHealth challenge. The task is to explore the automatic assignment of ICD-10 codes to the animal experi- ments, i.e. given the non-technical summary predicting the ICD-10 codes that are investigated in the study. We treat the task as a multi-label classification problem and apply the multi- lingual BERT model [6] which recently achieved state-of-the-art results in eleven different NLP tasks [12]. The model is extended by a single output layer to produce probabilities for individual ICD-10 codes. Since training data in this task is sparse, we also use summaries of clinical trails conducted in Germany published by the German Clinical Trials Register (GCTR). We compare our model with five baseline systems including a dictionary matching approach and single-label SVM and BERT classification models. The implementation of our models is available as open source software at github3 . 1 https://www.animaltestinfo.de/ 2 https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm/ kode-suche/htmlgm2016/ 3 https://github.com/mariosaenger/wbi-clef19x 2 Method Here we describe the corpora, used terminologies and classification models we use in the task. 2.1 Corpora and Terminologies The lab organisers provided a corpus of 8,385 German non-technical summaries of animal experiments (NTS) originating from the AnimalTestInfo database. For each experiment a short title is given followed by a description of expected benefits as well as pressures and damages of the animals. Furthermore, strategies to prevent unnecessary harm to the animals and to improve animal welfare are described. Each summary was labeled by experts using the German version of the ICD-10 classification system. Depending on the level of detail of the summary different levels (e.g. chapter, group) of the ICD-10 ontology are used to annotated the experiment. About two-thirds of the experiments are labeled with exactly one disease and 10% with multiple diseases; the remainder have no annotated disease. For each disease the complete path in the ICD-10 ontology, i.e. up to two parent groups and the chapter of the annotated disease, is given. About two third of the summaries are annotated with 2-level paths (e.g. I | B50-B64 ), 20 % with 3- or 4-level paths (eg. IV | E70-E90 | E10-E14 or II | C00-C97 | C00-C75 | C15-C26 ) and less than 1 % of the summaries are only annotated with chapters (e.g. VI ). The data set is divided into a stratified train and development split (7,543 / 842) at document level. For the final evaluation an hold-out set of 407 experiments are used by the organisers. In addition to the provided data set, we use information from the German Clinical Trials Register (GCTR)4 . The GCTR provides access to basic informa- tion (e.g. trial title, short description, studied health condition, inclusion and exclusion criteria) of clinical trials conducted in Germany and is also annotated with ICD-10 codes. We downloaded all trials available through the GCTR web- site. For each trial we make use of the title as well as the scientific and lay language summary. We use the chapter and all (sub-) groups up to the third level of the ontology of the given ICD-10 codes describing the studied health condition as labels for the trial, similar to ICD-10 coding in the NTS data set. In this way we are able to extend the training set by 7,615 documents having 18,263 ICD-10 codes. ICD-codes of each study in the GCTR data set relate to the ICD-10 version valid at publication of a study. We did not adjust for any differences (e.g. any potentially missing ICD-10 codes) to version 2016 used for the NTS corpus. The two data sets almost fully overlap with regard to the considered health problems. Of the 233 distinct ICD-10 codes occurring in the complete NTS corpus, 226 (97%) are mentioned in GCTR too. Moreover, 27 other ICD-10 codes will be introduced through the additional data set. Table 1 summarises the used corpora. 4 https://www.drks.de/drks_web/setLocale_EN.do Table 1. Overview about the used data sets. The non-technical animal experiment summaries (NTS) are provided by the task organisers. Furthermore, we build a second data set based on the German Clinical Trails Register (GCTR). #ICD-10 codes #Documents #ICD-10 codes (distinct) NTS Train 7453 15251 230 Dev 842 1682 156 GCTR 7615 18263 253 2.2 BERT for multi-label classification Our approach for the task is based on BERT language model [6]. BERT is a text encoding model that recently achieved state-of-the-art results in many dif- ferent NLP tasks [12]. It is a neural network based on the transformer architec- ture of [19], which was pretrained using two different language modelling tasks: masked language modeling and next sentence prediction. Specifically, we use the multilingual version of BERT-Base5 that has been pre-trained on Wikipedia dumps of 104 different languages including German. Given a sequence of tokens t1 , . . . , tL , BERT first subdivides the tokens into subword-tokens, yielding a new (usually, longer) sequence s1 , . . . , sN using WordPiece [21]. Then, it produces vector representations for each subword-token e1 , . . . , eN ∈ R768 and one vector c ∈ R768 which is not tied to a specific token. BERT supports sequence lengths up to 512 sub-word tokens. We represent each animal experiment by taking as much as possible sub-word tokens from the title and the description of expected benefits and pressures of the summary text as model input. Following [6], we employ c as a representation for the whole token sequence. We treat the assignment of ICD-10 codes as a one-versus-rest multi-label clas- sification problem [5], i.e. as |Y| independent binary classification tasks, where Y is the set of all ICD-10 codes occurring in the training set. Each example is used as a positive example if it has the respective label, while all other exam- ples are used as negative examples. The only connection between the individual classification tasks is the BERT encoder which is shared between all tasks and which receives parameter updates from all of them. We use a single output layer W ∈ R768×|Y| to compute the output probabilities per class with σ(c · W ), where σ is the element-wise sigmoid function, and use binary cross-entropy as a loss. We implement our model in PyTorch [18] using the pytorch-pretrained-BERT 6 implementation of BERT and use the included modified version of Adam [11] for optimization. We train our model for 60 epochs on a single Nvidia V100 GPU, which takes about nine hours. In principle, it would also be possible to train 5 https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_ H-768_A-12.zip 6 https://github.com/huggingface/pytorch-pretrained-BERT and evaluate the model using only CPUs but that would take considerably more time. We train multiple model instances using different random seeds and ensemble their predictions. Ensembling of multiple neural network models has shown to be beneficial in several NLP taks [6]. We ensemble the models in two ways: (1) by averaging the predictions of the different model instances and (2) learning a logistic regression classifier based on the model outputs on the development set. We denote the two ensembling model variants as BERT multi-label Avg and BERT multi-label LogReg. Note, that because BERT multi-label LogReg is trained on the development set, the resulting scores on this data are no longer a reliable estimate for out-of-sample performance and can only be fairly compared to the other approaches on the development set. 2.3 Baselines To gain better insights about the performance level of our approach we com- pare it with five different baseline methods. First, we implement a dictionary- matching approach. For this we took the concept descriptions of all codes listed in the ICD-10 ontology as well as all given synonyms and search for occur- rences of these terms in the title and goals (line 1 and 2) of an animal trial summary. Dictionary matching is performed by indexing all ICD-10 concepts using Apache Solr 7.5.07 and applying exact and fuzzy matching. Each ICD-10 concept is linked to its related path up to the chapter-level which is used for annotation. All concepts matched by the dictionary are reported as results. We do not perform any further post-processing like sorting out overlapping ICD-10 paths. For the other baselines we transform the task into (1) a group-level or (2) a sub-group-level classification problem, i.e. we use the label on the second level of the ICD-10 hierarchy (e.g. for I | C00-C97 | C00-C75 we use C00-C97 ) resp. the deepest label (e.g. for I | C00-C97 | C00-C75 we use C00-C75 ) for a given trial summary as gold standard. In both cases, for instances with multiple codes originating from different branches of the ICD-10 ontology we use the first label as gold standard. Moreover, we add a special no-class label to support documents without any annotated ICD-10 code. We investigate two different classification methods for the tasks, Support Vec- tor Machines (SVM) and BERT sequence classification model[6]. For the former, we build TF-IDF vectors as input representation for the trial summaries. For the latter, the model architecture is equivalent to our multi-label model except that the final linear layer calculates a soft-max over the classes of the classification task and hence applies a (single-class) cross-entropy loss for training. For the both classification baselines, we augment the predictions of the mod- els according to the ICD-10 hierarchy, e.g. if a group-level model predicts C00- C97 we automatically add the parent chapter (in this case I ) to the prediction. 7 https://lucene.apache.org/solr/ 3 Results & Discussion 3.1 Experimental setup We use the training split of the provided corpus as well as the documents from the GCTR data set to train our multi-label as well as all baseline models. For the BERT multi-label and the SVM classification models we perform hyperpa- rameter optimisation and select the best model of each approach based on the development set performance. Regarding the SVM models, we follow [8] and test {2−5 , 2−3 , . . . , 215 } as values for the C parameter. The best scores are reached with C = 2 / C = 0.5 for group level / sub-group-level classification. In case of our BERT multi-label approach we only tune the learning rate parameter. We evaluate the sequence {5e − 5, 4e − 5, 3e − 5, 2e − 5, 1e − 5} and found that 4e − 5 achieves the highest scores. We omit hyperparameter tuning for the BERT classi- fication models due to time constraints. Therefore, we use the default parameter settings of the model, i.e. learning rate of 5e − 5. As described in Section 2.2 we learn eight model instances of our approach using different random seeds and ensemble them. The two ensemble variants are built (a) by averaging of the two best model instances and (b) learning a logistic regression classifier based on the output of the three models with the highest scores. The latter is trained on the output of the individual model instances on the development set. We opt for this settings based on preliminary experiments on the training and development set. To gain insights about the effectiveness of the additional data, we evaluate each model (except for the ensemble models) in two data configuration settings: with and without the additional texts from the GCTR data set (see Section 2.1). We use the provided evaluation script and report precision, recall and F1 scores as evaluation metrics. 3.2 Development results Table 2 highlights the results of all evaluated models on the provided develop- ment set, both with and without the additional data from GCTR as training data. The best single model performance is reached by the BERT subgroup base- line model. In this setting the model achieves an F1 score of 0.778. Almost the same performance can be reached by our BERT multi-label approach (0.776). However, the latter offers a clearly better performance if the provided train- ing set is extended by the GCTR samples (0.81 vs. 0.782). This represents an improvement of 0.028 (+3.6%) in terms of F1 . For both baseline classification methods the sub-group models outperform the group-based variants, as to be expected. In case of the SVM, the performance increases from 0.655 to 0.717 (+ 9.5%) if considering sub-group labels instead group labels. With the BERT model the performance increases by 10.4% from 0.705 to 0.778. Interestingly, the BERT group level model performs nearly on par with the sub-group level SVM model. This is especially noteworthy as we do not perform hyperparameter op- timisation for BERT group / sub-group but for the corresponding SVM models. This highlights the effectiveness and suitability of the BERT model for this task, since in general SVMs offer competitive performance for document classification problems [20]. The dictionary matching can’t compete with the machine learning based solu- tions. Even through the matching of the concept terms with the trail summaries provides the highest recall (0.894) of all evaluated approaches, the precision of the approach is very low (0.416) due to many false positives. In particular, the approach often predicts incorrect chapter annotations, for instance the chapter XXI 681 times. This is because of the broad and general topic of the chapters respectively their descriptions, e.g. XXI is about ”Factors influencing health status and contact with health services”. Comparing the configurations with and without the GCTR documents, it can be seen that the performance increases (at least slightly) for all considered mod- els. Improvements range from 0.5% (SVM Sub-group) to 1.9% (BERT group) for the baseline systems with respect to their variants without the additional data. In contrast, the multi-label model can benefit more greatly from the extended training set (+3.2%). The overall best performance is achieved by ensembling the best BERT multi- label models. In both ensembling variants the model reaches an F1 score of 0.815. This represents an increase of 0.6% over the single model. Table 2. Evaluation results of our model (last three rows) and the five baseline ap- proaches (first five rows) on the provided development set. We report precision, recall and F1 scores in two data scenarios: (left) using only the provided training data and (right) using documents from German Clinical Trail Register as additional training instances. *: Ensembling trained on development set. NTS data NTS + GCTR data Model P R F1 P R F1 Dictionary matching 0.416 0.894 0.568 - - - SVM group 0.778 0.565 0.655 0.813 0.554 0.659 SVM sub-group 0.804 0.646 0.717 0.815 0.653 0.725 BERT group 0.810 0.624 0.705 0.820 0.640 0.719 BERT sub-group 0.811 0.748 0.778 0.833 0.737 0.782 BERT multi-label 0.901 0.747 0.776 0.834 0.788 0.810 BERT multi-label Avg - - - 0.850 0.782 0.815 BERT multi-label LogReg - - - 0.808* 0.822* 0.815* 3.3 Development predictions We further analysed the predictions made by the different approaches. Figure 1 (left) compares the true positives of our BERT multi-label model as well as the SVM and BERT sub-group baseline (all with the GCTR corpus as additional training data). We exclude the dictionary matching baseline for this investi- gation, since the approach predicts too optimistically and thereby distorts the picture. First of all it can be noted that, in total 1,422 of the 1,682 gold standard ICD-10 codes are identified by at least one of the three methods. This corre- sponds to 84.5% of the complete development data set. The intersection of all three methods consists of 1,001 true positives. This represents 70.4% of all cor- rectly identified codes. Additionally, 1,240 (87.2%) labels are predicted by two of the three methods. Furthermore, it can be seen that 110 true positives are exclusively identified by our multi-label approach. This constitutes 7.7% of all correctly found codes. In contrast, 98 codes (6.9%) were predicted by (at least) one of the two classification baseline and not detected by our BERT multi-label approach. We tried to investigate the differences between the multi-label and the classification models but can’t come up with a clear (error) pattern. We also perform the investigation using the best ensembled version of our approach (BERT multi-label LogReg). Figure 1 (right) highlights the results of this comparison. Through the ensembling we are able to additionally identify 20 labels correctly. Moreover, 38 ICD-10 codes that were exclusively predicted by the classification baselines previously are now detected by the multi-label approach too. However, when interpreting the figures one has to keep in mind that the logistic regression model that ensembles the predictions of the individ- ual model instances is trained on the development set and hence may tend to represent an over-optimistic picture. Fig. 1. Comparison of the predicted true positive ICD-10 codes of the evaluated models. On the left the best (single) instance of our BERT multi-label model is contrasted with the best SVM and BERT classification baseline. The diagram on the right shows the changes when using the best ensemble model of our approach (BERT multi-label LogReg). 3.4 Test results Table 3 shows the results of the final evaluation performed by the task organ- isers. Every team was allowed to submit up to 3 runs of their approaches. We submitted three different runs: the best single model instance (according to the development results) of BERT multi-label (WBI-run1 ) as well as the Avg- and LogReg-ensemble (WBI-run2 /WBI-run3 ). All models are trained on the GCTR- extended data. Table 3. Results of the final evaluation performed by the task organisers. They report precision, recall and F1 scores. We submitted three runs: BERT multi-label (WBI- run1 ), Avg- (WBI-run2 ) and LogReg-ensemble (WBI-run3 ). Our models achieve the best performance in the challenge. Bold figures highlight the highest value per column. Team Run P R F1 DEMIR run1 0.46 0.50 0.48 run2 0.49 0.44 0.46 run3 0.46 0.49 0.48 IMS-UNIPD run1 0.00 0.00 0.00 run2 0.009 0.50 0.017 run3 0.10 0.05 0.07 MLT-DFKI 0.64 0.86 0.73 SSN-NLP run1 0.19 0.27 0.22 run2 0.19 0.27 0.23 run3 0.13 0.34 0.36 TALP-UPC 0.37 0.35 0.36 WBI run1 0.83 0.77 0.80 run2 0.84 0.74 0.79 run3 0.80 0.78 0.79 The overall best performance is accomplished by the single BERT multi- label model. In this setting the model achieves an F1 score of 0.80. The model shows a slightly better precision (0.83) than recall (0.77). Comparing the model with both ensembling variants it can be seen that all models perform almost on par and just leverage slightly different precision-recall trade-offs. The Avg- ensemble of the best models (run2 ) predicts more conservatively reaching the highest precision (0.84) of all evaluated models, but offers lower recall scores. In contrast, the LogReg-ensemble provides well-balanced precision and recall scores. Moreover, it has to be noted that the final evaluation scores are virtually the same as the development scores. However, no positive effects can be observed through ensembling of multiple models (at least in the considered way). Comparing our method with the other submissions, it can be seen that our model outperforms the other team’s approaches by a large extend. The second best team (MLT-DFKI) reaches a higher recall (0.86) than our multi-label model (0.77). However, their approach has a lower precision compared to our model (0.64/0.83). This allows our model to achieve a 9.6% higher F1 score. 4 Conclusion This paper presents our contribution to Task 1 of the CLEF eHealth competition 2019. The task challenges the automatic assignment of ICD-10 codes to German non-technical summaries of animal experiments. We approach the task as multi-label classification problem and leverage the multi-lingual version of BERT [5] to represent the summaries. We extend the model with a single output layer to predict probabilities for each ICD-10 code. Furthermore, we utilise additional data from the German Clinical Trails Regis- ter to built an extended training data set and hereby improve the overall per- formance of the approach. Evaluation results highlight the advantage of our proposed approach. Our model achieves the highest performance figures of all submission with an F1 score of 0.80. Moreover, experiments on the development set illustrate that the model outperforms several strong classification baselines by a large extend. There are several research questions worth to investigate following this work. Due to the multi-lingual nature of the used BERT encoding model it would be interesting to evaluate our approach in an cross-lingual setup, e.g. apply the learned model to non-German clinical documents or animal trail summaries. For this purpose we want to use the data from the previous editions of the CLEF eHealth challenges, i.e. Italian, English, French and Hungarian death cer- tificates. This is especially interesting, because of the different text format of the certificates. They are much shorter than the animal experiment summaries and contain a lot of abbreviations of medical terms. It is an open question how well our trained model can be transferred to this type of texts. Furthermore, we also plan to inspect other approaches to the task, e.g. modelling the task as question-answering problem. Recently, versions of BERT trained on English biomedical literature have been published [12, 3]. It would be worthwhile to in- vestigate whether an extension of such models to multi-lingual biomedical texts would improve results further. Acknowledgments Leon Weber acknowledges the support of the Helmholtz Einstein International Berlin Research School in Data Science (HEIBRiDS). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. References 1. Almagro, M., Montalvo, S., de Ilarraza, A.D., Pérez, A.: Mamtra-med at clef ehealth 2018: A combination of information retrieval techniques and neural net- works for icd-10 coding of death certificates 2. Atutxa, A., Casillas, A., Ezeiza, N., Goenaga, I., Fresno, V., Gojenola, K., Mar- tinez, R., Oronoz, M., Perez-de Vinaspre, O.: Ixamed at clef ehealth 2018 task 1: Icd10 coding with a sequence-to-sequence approach. CLEF (2018) 3. Beltagy, I., Cohan, A., Lo, K.: Scibert: Pretrained contextualized embeddings for scientific text. arXiv preprint arXiv:1903.10676 (2019) 4. Bert, B., Dörendahl, A., Leich, N., Vietze, J., Steinfath, M., Chmielewska, J., Hensel, A., Grune, B., Schönfelder, G.: Rethinking 3r strategies: Digging deeper into animaltestinfo promotes transparency in in vivo biomedical research. PLoS biology 15(12), e2003217 (2017) 5. Bishop, C.M.: Pattern recognition and machine learning. springer (2006) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017) 8. Hsu, C.W., Chang, C.C., Lin, C.J., et al.: A practical guide to support vector classification (2003) 9. Jeblee, S., Budhkar, A., Milic, S., Pinto, J., Pou-Prom, C., Vishnubhotla, K., Hirst, G., Rudzicz, F.: Toronto cl at clef 2018 ehealth task 1: Multi-lingual icd-10 coding using an ensemble of recurrent and convolutional neural networks 10. Kelly, L., Suominen, H., Goeuriot, L., Neves, M., Kanoulas, E., Li, D., Azzopardi, L., Spijker, R., Zuccon, G., Scells, H., ao Palotti, J.: Overview of the CLEF eHealth evaluation lab 2019. In: Cappellato, L., F.N.L.D.E., Müller, H. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Lecture Notes in Computer Science. Springer, Berlin Heidelberg, Germany (2019) 11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 12. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: pre- trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746 (2019) 13. Neveol, A., Goeuriot, L., Kelly, L., Cohen, K., Grouin, C., Hamon, T., Lavergne, T., Rey, G., Robert, A., Tannier, X., et al.: Clinical information extraction at the clef ehealth evaluation lab 2016. In: Proceedings of CLEF 2016 Evaluation Labs and Workshop: Online Working Notes. CEUR-WS (September 2016) (2016) 14. Névéol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G., Rondet, C., Zweigenbaum, P.: Clef ehealth 2017 multilingual information extraction task overview: Icd10 coding of death certificates in english and french. In: CLEF (Working Notes) (2017) 15. Névéol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikán, L., Ramadier, L., Rey, G., Zweigenbaum, P.: Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certificates in french, hungarian and italian. In: CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS (2018) 16. Neves, M., Butzke, D., Dörendahl, A., Leich, N., Hummel, B., Schönfelder, G., Grune, B.: Overview of the CLEF eHealth 2019 Multilingual Information Extrac- tion. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., et al. (eds.) Experimen- tal IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth International Conference of the CLEF Association (CLEF 2019). Lecture Notes in Computer Science. Springer, Berlin Heidelberg, Germany (2019) 17. Organization, W.H., et al.: The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines. Geneva: World Health Organization (1992) 18. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017) 19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017) 20. Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2. pp. 90–94. Association for Computational Linguistics (2012) 21. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)