-

Overview of the CLEF eHealth 2019 Multilingual Information Extraction

Antje Dorendahl

Nora Leich

Benedikt Hummel

Gilbert Schonfelder

0 1

Barbara Grune

1 0 Charite - Universitatsmedizin Berlin, Institute of Clinical Pharmacology and Toxicology , Chariteplatz 1, 10117 Berlin , Germany 1 German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR) , Diedersdorfer Weg 1, 12277, Berlin , Germany

Non-technical summaries (NTSs) of animal experimentation can be valuable resources to foster more transparency of research made with animals and to better inform the community about this topic. The NTSs of planned animal experiments in Germany are publicly available and have been manually assigned to ICD-10 codes. We used this data in the scope of organizing the Multilingual Information Extraction Task (Task 1) in the CLEF eHealth challenge. For the development phase, we released a training dataset containing more than 8,000 NTSs and their corresponding codes (if any assigned). For the test phase, we released 407 unseen NTSs for which the participants should submit the predictions made by their systems. The best performing system obtained a P, R, and FM of 0.83, 0.77, and 0.80, respectively.

Document indexing ICD-10 codes summaries of animal experiments

Non-technical summaries (NTSs) are short descriptions of the planned animal experiments to be carried out in a country and are stipulated when requesting permission for the experiment. The European Union (EU) requires the member states to collect these summaries and to make them available to the community in order to foster more transparency in animal research [12]. The German Federal Institute for Risk Assessment (BfR, in its acronym in German) publishes the German NTSs online in the AnimalTestInfo database3.

These NTSs are regularly manually annotated with ICD-10 codes for the identi cation of the diseases that are the focus of the planned experiments. Indexing the NTSs using terms from standard teminologies provides additional information on the research goals of the animal experiments and supports a detailed analysis of the data. [ 3 ].

We utilized our annotated NTSs in the scope of a shared task in the CLEF eHealh challenge. Our shared task aimed to evaluate systems for the automatic detection of the ICD-10 codes in German NTSs4. For this purpose, we utilized the manually annotated data for building training, development, and test datasets, which were to be used by the participants in the shared task. Previous editions of CLEF eHealth addressed similar tasks, such as the extraction of ICD-10 codes in death certi cates for English and French [8], and the following year for French, Italian, and Hungarian [9].

The remainder of the paper is structured as follows: we describe details of the shared task in Section 2 and the participating teams and systems in Section 3. We presented the baselines that we developed in Section 4 and the results obtained by participants and baselines in Section 5. 2

Details of the Shared Task

In this section we describe details of the challenge, including the schedule of the event, the data that we released, and the evaluation that we carried out. Schedule. We released the training data, which is split into a training and development datasets, to the participants on February 1st, 2019. During three months, the participants could utilize this data for training, tuning, and evaluating their system. We released the o cial test set on May 6th, 2019. The participants had one week to process the test data and prepare the submissions les, that had to be uploaded in to the submission system until May 13th, 2019. Each team was allowed to submit up to three runs for their systems, i.e., di erent con gurations or approaches that they experimented during the development period. Manual (human-annotated) approaches were not allowed in the shared task. Data. Our training data consisted of a set of 8,386 manually annotated NTSs which was split into two datasets: 7,544 NTSs for the training dataset and 842 NTSs for the development dataset. For the test set, we released a collection of 407 unseen NTSs, i.e., which were not included in the training data. Each NTS is divided in six sections, namely, title, objectives, bene ts, harms, replacement, reduction and re nement.

Evaluation. We evaluated the predictions returned by the participating systems based on an automatic and a manual approach. We automatically evaluated the 4 Task 1: https://clefehealth.imag.fr/?page_id=26 submissions based on the standard metrics of precision (P), recall (R) and fmeasure (FM). We utilized the Python script that we released to the participants during the shared task.5 For the manual validation, one of our annotators manually checked a total of 100 NTSs originated from false positives (FPs) and false negatives (FNs) returned by the best runs. We randomly selected 25 FPs and FNs from the best run of the two best-scoring teams, thus a total of 100 NTSs. During the manual validation, our expert checked whether the wrong predictions (FP or FN) were indeed false. 3

Teams and systems

We received 14 submissions from six teams originated from a total of six countries, as summarized in Table 1. We present a summary of each team and their systems below. DEMIR [ 1 ]. The DEMIR team developed an approach based on two phases. In the rst phase, they utilized the Elasticsearch tool to perform k-Nearest Neighbor (kNN) and threshold-Nearest Neighbor (tNN). In the second phase, the codes were selected from the top ones using two majority voting approaches based on either the pre-de ned top M codes or on the similarity scores of the corresponding NTSs. The team submitted three runs, namely, k-NN based on k=5 and M=2 (run1), tNN based on T=30 and M=3 (run2), and tNN based on T=80 and adaptive M.

IMS-UNIPD [ 5 ]. The IMS-UNIPD team experimented with three probabilistic Nave Bayes (NB) classi ers, following the same approach that they used in previous editions of the Multilingual Information Extraction Task in CLEF eHealth. All models were based on a two-dimensional representation of probabilities. They submitted three runs based on the three NB classi ers, namely, Bernoulli (run1), Multinomial (run2) and Poisson (run3).

5 https://github.com/mariananeves/clef19ehealth-task1

MLT-DFKI [ 2 ]. The MLT-DFKI team tried a variety of approaches, such as Conditional Neural Networks (CNN) and Attention models, which that usually used for Neural Machine Translation (NMT), among others. They obtained the best results when relying on Bidirectional Encoder Representations from Transformers (BERT) and, more speci cally, on BioBERT which was trained on biomedical documents [ 7 ]. For using this approach, which is available for the English language, the team had to rst automatically translate the NTSs using the Google Translate API. The team only submitted one run.

SSN-NLP [ 6 ]. The SSN-NLP team developed a multi-layer Recurrent Neural Network (RNN) with a Long Short Term Memory (LSTM) as recurrent unit. They experimented with two attention mechanisms, namely Normed Bahdanau (NB) and Scaled Luong (SL), and with the requirement of a minimum number of occurrences of a code as generated by the model. They submitted three runs, namely, NB attention and minimum two occurrences (run1), SL attention and minimum of two occurrences (run2), and SL attention, minimum 2 occurrences and all codes if no code is repeated more than once (run3).

TALP UPC. The TALP-UPC team developed a simple semi-supervised system based on Machine Translation and Named Entity Recognition (NER). In a rst step, the \Bene ts\ section was translated into English using the Amazon Translate API6. For NER, they used MetaMap7 (online batch submission system) and considered only the ICD-10 vocabulary source. After the identi cation of the entities (codes), their parents in the ICD-10 hierarchy were also selected to the prediction list.

WBI [11]. The WBI team utilized a multilingual BERT text encoding model [ 4 ] and additional training data of German clinical trials8 also annotated with ICD-10 codes. They also experimented with training various instances of the models and ensembling the predictions based on their average or on a logistic regression classi er. The team submitted three runs, namely, BERT multi-label (run1), ensemble based on the average (run2), and an ensemble based on logistic regression (run3). 4

Baseline Approaches

We developed some baselines systems to compare the results from the participants to a simple text classi cation approach. The automatic classi cation of NTSs according to the ICD-10 codes consists of a multi-class and multi-label problem. It is multi-class because the ICD-10-GM-2016 ontology contains a total of 270 (until level4) that could potentially be assigned to an NTS, while it is multi-label because more than one code can be assigned to a each NTS.

6 https://aws.amazon.com/translate/ 7 https://metamap.nlm.nih.gov/ 8 from https://www.drks.de/drks_web/

We considered only supervised learning approach based on our training data, i.e. codes that do not appear in the training data cannot be identi ed by our baseline approaches. Given it is a multi-label problem, during the training phase and for the training dataset, one classi er is trained for each of the 270 codes (if training data is available). During the test phase, for the development and test datasets, and for each NTS, each of the above classi er is used for deciding regarding the assignment of the corresponding code to the summary. All documents were pre-processed using the standard tokenization and TF-IDF functionality available in the Python scikitlearn library9. We considered two types of experiments, one using all sections of the summaries, and one using only only the title and the bene ts sections.

We followed the approaches based on Support Vector Machine (SVM) that was previously utilized for the MIMIC II dataset [10]. The authors proposed at and hierarchical SVMs in which the hierarchical structure of the ICD-10 terminology is considered in the latter. Both SVM algorithms were based on the SVM implementation available in the Python scikitlearn library and the di erences between the two approaches are described below.

Flat This approach does not make use of the hierarchical structure of the terminology neither when building the classi ers nor when classifying the NTSs from the test set. For the at SVM approach, we built one classi er for each code based on the totality of the summaries in the training dataset, i.e. for each code, the positive training examples were the NTSs that contained the particular code, while the negative examples were all NTSs that did not contain the code. Therefore, the classi ers were trained on a very unbalanced data for those codes that occur very seldom in our training data.

Hierarchical In this approach, we consider the four levels of the hierarchy of the ICD-10 ontology, as considered in our manual annotations of the NTSs. The classi ers related to codes on level 1 were trained on the whole training data, in which the positive examples were the ones that contained the particular code and the negative examples were the one that did not contain the code. Therefore, the classi er for level 1 are not di erent from the ones built in the at approach for these same codes. As for next levels, the classi er for a particular code was only trained on the NTSs which belonged to the corresponding parent code. For instance, the classi er for code C00-C97 (level 2) was trained on all NTSs that were assigned to chapter II. The positive examples were the one assigned to code C00-C97, while the negative examples were the all the others assigned to chapter II but not to C00-C97, for instance, those that belong to the other codes in this chapter, such as D00-D09, D10-D36 or D37-D48. Therefore, each classi er has a di erent number of training examples, but a more balanced one with regard to the proportion of positive and negative examples, in comparison to the at approach.

9 https://scikit-learn.org/stable/

In this section we present the results obtained by the runs submitted by the participating teams and by our baseline systems. As described in Section 2, one expert manually validated a random sample of 100 FPs and FNs from the best runs from the two best-scoring teams, namely, run1 from WBI and the only run submitted by team MLT-DFKI. The FNs and FPs were automatically detected by our evaluation script (cf. Section 2) with regards to our gold standard test set. We provide a discussion below about the errors in our gold standards that we found.

FNs. From the 25 NTSs from run1 of the WBI team, our expert found seven NTSs in which a total of 14 FNs codes were wrong (cf. Table 3). These were not codes missed by the run, but rather codes that were mistakenly assigned to the NTSs in our gold standard. The same seven NTSs also contained 12 wrong FNs codes detected for the run from team MLT-DFKI. Curiously, even though we randomly selected the FNs codes, both runs had practically the same FNs codes from the same seven NTSs.

FPs. From the 25 NTSs from run1 of the WBI team, our expert found 12 NTSs in which a total of 22 FPs codes were wrong (cf. Table 4). These were codes that the expert judged as correct but that were not originally included in our gold standard. For the run from team MLT-DFKI, our expert judged as correct predictions just nine codes from four NTSs, from the total of 25 NTSs that were manually evaluated. 6

Conclusions

We presented the rst corpus of non-technical summaries (NTS) of animal experiments for the German language. We annotated the NTS with the ICD-10 codes and utilized the data in the scope of a shared task in the CLEF eHealth challenge. Runs from two of the participants obtained results above 0.80 of fmeasure and outperformed our baseline systems. The results obtained by the participants show that automatizing this task is indeed feasible, for instance, for the development of a semi-automatic system to support the experts in the manual annotation of the NTSs.

Acknowledgment

We would like to thank all participants for their interest in our task, and Felipe Soares for providing a description of his team's system. We would like to acknowledge the Australian National University for supporting the submission Web site in EasyChair. 8. Neveol, A., Anderson, R.N., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G., Robert, A., Zweigenbaum, P.: CLEF eHealth 2017 multilingual information extraction task overview: ICD10 coding of death certi cates in English and French.

In: Proc of CLEF eHealth Evaluation lab. Dublin, Ireland (September 2017) 9. Neveol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier, L., Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certi cates in French, Hungarian and Italian. In: Proc of CLEF eHealth Evaluation lab. Avignon, France (September 2018) 10. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.: Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association 21(2), 231{237 (2014). https://doi.org/10.1136/amiajnl-2013-002159, http://dx.doi.org/10. 1136/amiajnl-2013-002159 11. Sanger, M., Weber, L., Kittner, M., Leser, U.: Classifying German Animal Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1. In: CLEF (Working Notes) (2019) 12. Taylor, K., Rego, L., Weber, T.: Recommendations to improve the EU nontechnical summaries of animal experiments. ALTEX - Alternatives to animal experimentation 35(2), 193{210 (Apr 2018). https://doi.org/10.14573/altex.1708111, https://www.altex.org/index.php/altex/article/view/90

1. Ahmed , N. , Ar

bas

, A., Alpkocak , A. : DEMIR at CLEF eHealth 2019 : Information Retrieval based Classi cation of Animal Experiment Summaries . In: CLEF (Working Notes) ( 2019 )

2. Amin , S. , Neumann , G. , Dun eld , K. , Vechkaeva , A. , Chapman , K. , Wixted , M.: MLT-DFKI at CLEF eHealth 2019: Multi-label Classi cation of ICD-10 Codes with BERT . In: CLEF (Working Notes) ( 2019 )

3. Bert , B. , Dorendahl, A. , Leich , N. , Vietze , J. , Steinfath , M. , Chmielewska , J. , Hensel , A. , Grune , B. , Schonfelder, G.: Rethinking 3r strategies: Digging deeper into AnimalTestInfo promotes transparency in in vivo biomedical research . PLOS Biology 15 ( 12 ), 1 { 20 (12 2017 ). https://doi.org/10.1371/journal.pbio. 2003217 , https://doi.org/10.1371/journal.pbio.2003217

4. Devlin , J. , Chang , M. , Lee , K. , Toutanova , K. : BERT: pre-training of deep bidirectional transformers for language understanding . CoRR abs/ 1810 .04805 ( 2018 ), http://arxiv.org/abs/ 1810 .04805

Nunzio , G.M.: Classi cation of Animal Experiments: A Reproducible Study . IMS Unipd at CLEF eHealth Task 1 . In: CLEF (Working Notes) ( 2019 )

6. Kayalvizhi , S. , Thenmozhi , D. , Aravindan , C. : Deep Learning Approach for Semantic Indexing of Animal Experiments Summaries in German Language . In: CLEF (Working Notes) ( 2019 )

7. Lee , J. , Yoon , W. , Kim , S. , Kim , D. , Kim , S. , So , C.H. , Kang , J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining . CoRR abs/ 1901 .08746 ( 2019 ), http://arxiv.org/abs/ 1901 .08746