=Paper=
{{Paper
|id=Vol-2125/invited_paper_18
|storemode=property
|title=CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian
|pdfUrl=https://ceur-ws.org/Vol-2125/invited_paper_18.pdf
|volume=Vol-2125
|authors=Aurélie Névéol,Aude Robert,Francesco Grippo,Claire Morgand,Chiara Orsi,Laszlo Pelikan,Lionel Ramadier,Grégoire Rey,Pierre Zweigenbaum
|dblpUrl=https://dblp.org/rec/conf/clef/NeveolRGMOPRRZ18
}}
==CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian==
CLEF eHealth 2018 Multilingual Information Extraction task overview: ICD10 coding of death certificates in French, Hungarian and Italian Aurélie Névéol1 , Aude Robert2 , Francesco Grippo3 , Claire Morgand2 , Chiara Orsi3 , László Pelikán4 , Lionel Ramadier1 , Grégoire Rey2 , and Pierre Zweigenbaum1 1 LIMSI, CNRS, Université Paris-Saclay, Orsay, France firstname.lastname@limsi.fr 2 INSERM-CépiDc, Le Kremlin-Bicêtre, France firstname.lastname@inserm.fr 3 ISTAT, Italy frgrippo@istat.it and chiara.orsi@istat.it 4 KSH, Hungary laszlo.pelikan@ksh.hu Abstract. This paper reports on Task 1 of the 2018 CLEF eHealth eval- uation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with coding of death certificates, as introduced in CLEF eHealth 2016. This large- scale classification task consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). The languages offered for the task this year were French, Hungarian and Italian. Participant systems were evaluated against a blind reference standard of 11,932 death certificates in the French dataset 21,176 cer- tificates in the Hungarian dataset and 3,618 certificates in the Italian dataset using Precision, Recall and F-measure. In total, fourteen teams participated: 14 teams submitted runs for the French dataset, 5 submit- ted runs for the Hungarian dataset and 6 for the Italian dataset. For death certificate coding, the highest performance was 0.838 F-measure for French, 0.9627 for Hungarian and 0.9524 for Italian. Keywords: Natural Language Processing; Entity Linking, Text Classi- fication, French, Biomedical Text 1 Introduction This paper describes an investigation of information extraction and normaliza- tion (also called “entity linking”) from French, Hungarian and Italian-language health documents conducted as part of the CLEF eHealth 2018 lab [1]. The task addressed is the automatic coding of death certificates using the International Classification of Diseases, 10th revision (ICD10) [2]. This is an essential task in epidemiology. The determination of causes of death directly results in the pro- duction of national death statistics. In turn, the analysis of causes of death at a global level informs public health policies. In continuity with previous years, the methodology applied is the shared task model[3]. Over the past five years, CLEF eHealth offered challenges addressing several aspects of clinical information extraction (IE) including named entity recogni- tion, normalization [4–7] and attribute extraction [8]. Initially, the focus was on a widely studied type of corpus, namely written English clinical text [4, 8]. Starting in 2015, the lab’s IE challenge evolved to address lesser studied corpora, including biomedical texts in a language other than English i.e., French [5]. This year, we continue to offer a shared task based on a large set of gold standard annotated corpora in French with a coding task that required normalized en- tity extraction at the sentence level. We also provided an equivalent dataset in Hungarian, and a synthetic dataset for the same task in Italian. The significance of this work comes from the observation that challenges and shared tasks have had a significant role in advancing Natural Language Process- ing (NLP) research in the clinical and biomedical domains [9, 10], especially for the extraction of named entities of clinical interest and entity normalization. One of the goals for this shared task is to foster research addressing multiple languages for the same task in order to encourage the development of multilin- gual and language adaption methods. This year’s lab suggests that the task of coding can be addressed reproducibly with comparable performance in several European languages without relying on translation. Furthermore, a global method addressing three languages at once opened interesting perspective for multi-lingual clinical NLP [11]. 2 Material and Methods In the CLEF eHealth 2018 Evaluation Lab Task 1, three datasets were used. The French dataset was supplied by the CépiDc1 , the Hungarian dataset was supplied by KSH2 and the Italian dataset was supplied by ISTAT3 . All three datasets refer to the International Classification of Diseases, tenth revision (ICD10),a reference classification of about 14,000 diseases and related concepts managed by the World Health Organization and used worldwide, to register causes of death and reasons for hospital admissions. Further details on the datasets, tasks and evaluation metrics are given below. 2.1 Datasets The CépiDc corpus was provided by the French institute for health and medical research (INSERM) for the task of ICD10 coding in CLEF eHealth 1 Centre d’épidémiologie sur les causes médicales de décès, Unité Inserm US10, http: //www.cepidc.inserm.fr/. 2 Központi Statisztikai Hivatal, https//www.ksh.hu/. 3 Istituto nazionale di statistica, http://www.istat.it/. 2018 (Task 1). It consists of free text death certificates collected electronically from physicians and hospitals in France over the period of 2006–2015 [12]. The KSH-HU corpus was provided by the Hungarian central statistical office (KSH). It consists of a sample of randomly extracted free text death certificates collected from doctors in Hungary for the year of death 2016. There is no elec- tronic certification in this country, so in contrast to the French corpus, this corpus contains only deaths reported using paper forms (and then transcribed electronically). The ISTAT-IT corpus was provided by the Italian national institute of statis- tics (ISTAT). To better preserve confidentiality, the corpus was fabricated based on real data. Indeed, the fake certificates were created from authentic death certificates corresponding to different years of coding. The lines of a synthetic document each came from a different certificate, while ensuring topical coherance and preserving the chain of causes of death (line 1 of a synthetic certificate was created using line 1 of a real certificate). The coherence of age, sex and causes referred were also preserved. The synthetic certificates were then coded as if they reported a real death for 2016. To summarize, this synthetic corpus provides a re- alistic simulation of language and terminology found in Italian death certificates, together with official coding. Up to 90 percent of the corpus contains terminol- ogy completely recognized by the Italian dictionary but it also offers examples of language that cannot be automatically recognized by the Italian system : lin- guistics variants, new expressions and spelling mistakes in the text for instance. A characteristic of the Italian dictionary is the poverty of labels associated with the ICD-10 codes for external causes (including certificates reporting surgery), which must be reviewed manually by the coding team. Dataset excerpts. Death certificates are standardized documents filled by physicians to report the death of a patient. The content of the medical infor- mation reported in a death certificate and subsequent coding for public health statistics follows complex rules described in a document that was supplied to par- ticipants [12]. Tables 1, 2 and 3 present excerpts of the corpora that illustrate the heterogeneity of the data that participants had to deal with. While some of the text lines were short and contained a term that could be directly linked to a sin- gle ICD10 code (e.g., “choc septique”), other lines could contain non-diacritized text (e.g., “peritonite...” missing the diacritic on the first “e”), abbreviations (e.g., “BPCO” instead of “broncopneumopatia cronica ostruttiva”). Other chal- lenges included run-on narratives or mixed text alternating between upper case non-diacritized text and lower-case diacritized text. Descriptive statistics. Table 4 present statistics for the specific data sets provided to participants. For two of the languages, the dataset construction was time-oriented in order to reflect the practical use case of coding death certificates, Table 1. A sample document from the CépiDC French Death Certificates Corpus: the raw causes (Raw) and computed causes (Computed) are aligned into line-level mappings to ICD codes (Aligned). English translations for each raw line follow: 1: septic shock ; 2: colon perforation leading to stercoral peritonitis; 3: Acute Respiratory Distress Syndrome; 4: multiple organ failure; 5: HBP: High Blood Pressure. ICD line text normalized text codes 1 choc septique - 2 peritonite stercorale sur perforation colique - Raw 3 Syndrome de détresse respiratoire aiguë - 4 defaillance multivicerale - 5 HTA - 1 defaillance multivicerale R57.9 syndrome détresse respi- Computed 2 J80.0 ratoire aiguë 3 choc septique A41.9 4 peritonite stercorale K65.9 5 perforation colique K63.1 6 hta I10.0 1 choc septique choc septique A41.9 2 peritonite stercorale sur perforation colique peritonite stercorale K65.9 Aligned 2 peritonite stercorale sur perforation colique perforation colique K63.1 syndrome détresse respi- 3 Syndrome de détresse respiratoire aiguë J80.0 ratoire aiguë 4 defaillance multivicerale défaillance multiviscérale R57.9 5 HTA hta I10.0 Table 2. One sample document from the Hungarian corpus (KSH-HU Death Certifi- cates Corpus). English translations for each raw line follow: 1: respiratory failure; 3: bacterial pneumonia; 4: pulmonary bronchitis, hepatic metastasis, cerebral metastasis. type line text ICD codes 1 légzési elégt - Raw 3 bakt tgy - 4 tüdõ hörgõ rd, máj áttét,agy áttét - 1 J968 Computed 3 J159 4 C349 4 C787 4 C793 where historical data is available to train systems that can then be applied to current data to assist with new document curation. For French, the training set covered the 2006–2014 period, and the test set from 2015. For Hungarian, Table 3. One sample document from the Italian corpus (ISTAT-IT Death Certificates Corpus). English translations for each raw line follow: 1: neoplastic cachexia; 2: atrial fibrillation with rapid ventricular response; 3: cardio-circulatory decompensation, res- piratory decompensation; 4: pulmonary neoplasia; 6: sigmoid resection for neoplasia, COPD (Chronic Obstructive Pulmonary Disease), hypothyroidia. type line text ICD codes 1 CACHESSIA NEOPLASTICA - 2 FA AD ELEVATA RISPOSTA VENTRICOLARE - Raw 3 SCOMPENSO CARDIOCIRCOLATORIO, SCOMPENSO RESPIRATORIO - 4 NEOPLASIA POLMONARE - 6 RESEZIONE DEL SIGMA PER NEOPLASIA , BPCO , - IPOTIROIDISMO - 1 C809 2 I489 2 I471 Computed 3 I516 3 J988 4 C349 6 Y836 6 D48 6 J448 6 E0399 data was only available for the year 2016, but the training and test sets were nonetheless divided chronologically during that year. While the French dataset offers more documents spread over a nine year period, it also reflects changes in the coding rules and practices over the period. In contrast, the Hungarian dataset is smaller but more homogeneous. The Italian dataset was fabricated from de- identified original death certificates to further preserve patient confidentiality. Table 4. Descriptive statistics of the Death Certificates datasets in French, Hungarian and Italian. Tokens were counted using the linux wc -w command. French Hungarian Italian Training Test Training Test Training Test (2006–2014) (2015) (2016) (2016) (2016) (2016) Certificates 125,384 11,931 84,703 21,176 14,502 3,618 Lines 368,065 34,918 324,266 81,291 49,825 12,602 Tokens 1,250,232 84,091 666,839 167,507 666,839 167,507 Total ICD codes 509,103 48,948 392,020 98,264 60,955 15,789 Unique ICD codes 3,723 1,806 3,124 2,011 1,443 903 Unique unseen ICD codes - 70 - 202 - 100 Dataset format. In compliance with the World Health Organization (WHO) international standards, death certificates comprise two parts: Part I is dedicated to the reporting of diseases related to the main train of events leading directly to death, and Part II is dedicated to the reporting of contributory conditions not directly involved in the main death process.4 According to WHO recommenda- tions, the completion of both parts is free of any automatic assistance that might influence the certifying physician. The processing of death certificates, includ- ing ICD10 coding, is performed independently of physician reporting. In France, Hungary and Italy, coding of death certificates is performed within 18 months of reporting using the IRIS system [13]. In the course of coding practice, the data is stored in different files: a file that records the native text entered in the death certificates (referred as ‘raw causes’ thereafter) and a file containing the result of ICD code assignment (referred as ‘computed causes’ thereafter). The ‘computed causes’ file may contain normalized text that supports the coding decision and can be used in the creation of dictionaries for the purpose of coding assistance. We found that the formatting of the data into raw and computed causes made it difficult to directly relate the codes assigned to original death certificate texts. This makes the datasets more suitable for approaching the coding problem as a text classification task at the document level rather than a named entity recog- nition and normalization task. We have reported separately on the challenges presented by the separation of data into raw and computed causes, and proposed solutions to merge the French data into a single ‘aligned’ format, relying on the normalized text supplied with the French raw causes [14]. Table 1 presents a sample of French death certificate in ‘raw’ and ‘aligned’ format. It illustrates the challenge of alignment with the line 2 in the raw file ”péritonite stercorale sur perforation colique” which has to be mapped to line 4 ”peritonite stercorale” (code K65.9) and line 5 ”perforation colique” (code K63.1) in the computed file. Data files. Table 5 presents a description of the files that were provided to the participants: training (train) files were distributed at the end of February 2018; test files (test, with no gold standard) were distributed at test time (at the end of April 2018); and the gold standard for test files (test+g in aligned format, test, computed in raw format) were disclosed to the participants after the text phase (in May 2018) so that participants could reproduce the performance measures announced by the organizers. 2.2 ICD10 coding task The coding task consisted of mapping lines in the death certificates to one or more relevant codes from the International Classification of Diseases, tenth revi- sion (ICD10). For the raw datasets, codes were assessed at the certificate level. For the aligned dataset, codes were assessed at the line level. 4 As can be seen in the sample documents, the line numbering in the raw causes file may (Table 2) or may not (Table 1) be the same in the computed causes file. In some cases, the ordering in the computed causes file was changed to follow the causal chain of events leading to death. Table 5. Data files. Files after the dashed lines are test files; files after the dotted lines contain the gold test data. L = language (fr = French, hu = Hungarian, it = Italian). L. Split Type Year File name fr train aligned 2006–2012 AlignedCauses 2006-2012.csv Aligned fr train+g aligned 2006–2012 AlignedCauses 2006-2012full.csv fr train aligned 2013 AlignedCauses 2013.csv fr train+g aligned 2013 AlignedCauses 2013full.csv fr train aligned 2014 AlignedCauses 2014.csv fr train+g aligned 2014 AlignedCauses 2014full.csv fr test aligned 2015 AlignedCauses 2015F 1.csv fr test list 2015 GoldStandardFR2008 IDs.out fr test+g aligned 2015 AlignedCauses 2015 full 2018 UTF8 filtered 1m commonRaw.csv fr train raw 2006–2012 CausesBrutes FR 2006–2012.csv fr train ident 2006–2012 Ident FR training.csv fr train+g computed 2006–2012 CausesCalculees FR 2006–2012.csv fr train raw 2013 CausesBrutes FR 2013.csv Raw fr train ident 2013 Ident FR 2013.csv fr train+g ident 2013 Ident FR 2013 full.csv fr train+g computed 2013 CausesCalculees FR 2013.csv fr train raw 2014 CausesBrutes FR 2014.csv fr train ident 2014 Ident FR 2014.csv fr train+g ident 2014 Ident FR 2014 full.csv fr train+g computed 2014 CausesCalculees FR 2014.csv fr test raw 2015 CausesBrutes FR 2015F 1.csv fr test ident 2015 Ident FR 2015F 1.csv fr test list 2015 GoldStandardFR2008 IDs.out fr test+g computed 2015 CausesCalculees 2015 full 2018 UTF8 filtered 1m commonRaw.csv hu train raw 2016 CausesBrutes HU 1.csv hu train ident 2016 Ident HU 1.csv Raw hu train+g computed 2016 CausesCalculees HU 1.csv hu test raw 2016 CausesBrutes HU 2.csv hu test ident 2016 Ident HU 2.csv hu test+g computed 2016 CausesCalculees HU 2.csv it train raw 2016 CausesBrutes IT 1.csv it train ident 2016 Ident IT 1.csv Raw it train+g computed 2016 corpus/CausesCalculees IT 1.csv it test raw 2016 CausesBrutes IT 2.csv it test ident 2016 Ident IT 2.csv it test+g computed 2016 CausesCalculees IT 2.csv 2.3 Evaluation metrics System performance was assessed by the usual metrics of information extraction: precision (Formula 1), recall (Formula 2) and F-measure (Formula 3; specifically, we used β=1.). true positives Precision = (1) true positives + false positives true positives Recall = (2) true positives + false negatives (1 + β 2 ) × precision × recall F-measure = (3) β 2 × precision + recall Results were computed using two perl scripts, one for the raw datasets (in French, Hungarian and Italian) and one for the aligned dataset (in French only). The evaluation tools were supplied to task participants along with the training data. Measures were computed for all causes in the datasets, i.e. the evaluation covered all ICD codes in the test datasets. For the raw datasets, matches (true positives) were counted for each ICD10 full code supplied that matched the reference for the associated document. For the aligned dataset, matches (true positives) were counted for each ICD10 full code supplied that matched the reference for the associated document line. This year, we also experimented with a secondary metric, which consisted in computing recall over the primary causes of death. In death certificate coding, once all the relevant causes of death have been identified in all certificate lines, the chain of events leading to the dealth is analyzed to yield one single primary cause of death, which is central to national statistics reporting. This primary cause was available to us for the French and Italian datasets. Primary recall was therefore computed as the number of certificates where the primary cause was retrieved by systems over the total number of certificates. 3 Results Participating teams included between one and nine team members and resided in Algeria (team techno), Canada (team TorontoCL), China (teams ECNU and We- bIntelligentLab), France (teams APHP, IAM, ISPED), Germany (team WBI), Italy (Team UNIPD), Spain (teams IxaMed, SINAI and UNED), Switzerland (team SIB) and the United Kingdom (team KCL). For the Hungarian raw dataset, we received 9 official runs from 5 teams. For the Italian raw dataset, we received 12 official runs from 7 teams. For the French raw dataset, we received 18 official runs from 12 teams. We also received three additional non-official runs from 2 teams, including one run implementing corrections for a faulty official run. For the French aligned dataset, we received 16 official runs from 8 teams. We also received three additional non-official runs from 2 teams, including one run implementing corrections for a faulty official run. 3.1 Methods implemented in the participants’ systems Participants relied on a diverse range approaches including classification meth- ods (often leveraging neural networks), information retrieval techniques and dic- tionary matching accommodating for different levels of lexical variation. Most participants (12 teams out of 14) used the dictionaries that were supplied as part of the training data as well as other medical terminologies and ontologies (at least one team). ECNUica. The methods implemented by the ECNUica team [15] combine sta- tistical machine learning and symbolic algorithms together to solve the ICD10 coding task. First they utilize the regular match expressions to mapping test data and find out the ICD10 codes. What’s more, in order to handle the data which have no mapping ICD10 codes, they use attributes such as gender and age in the corpus as the feature data to train the random forest and Xgboost model. And then, all the data is classified into A-Z 26 categories, so they use rule-based and similarity computation method to match the classified data with training data. Finally they obtain the specific ICD10 codes of the test data. ECSTRA-APHP. The ECSTRA-APHP team [16] cast the task as a machine learning problem involving the prediction of the ICD10 codes (categorical vari- able) from the raw text transformed into word embeddings. We rely on proba- bilistic convolutional neural network for classification. In the present work, we train a CNN with that uses multiple filters (with varying window sizes) to ob- tain multiple features on top of word vectors obtained as the first hidden layer of the classification itself. Due to very week representation for the some of ICD codes, we complete prediction with dictionary-based lexical matching classifier which rely on word recognition from a knowledge base build from several avail- able dictionaries on the French ICD 10 classification : second volume of ICD, orphanet thesaurus, French SNOMED CT, and CépiDC dictionaries provided for the challenge. IAM-ISPED The method used by the IAM ISPED team [17] is a dictionary- based approach. It uses the terms of a terminology (ICD10) to assign ICD10 codes to each text line. The program has a module of typos detection that runs a Levenshtein distance and a module of synonyms expansion (Ins =¿ Insuffisance). The runs1 and 2 differ by the terms used : in run2, all the terms of the column ”Standard text” in AlignedCauses files (2006-2012;2013;2014) were used, which corresponded to 42,439 terms and 3,539 codes; in run1, the terms of run2 and the terms in the ”Dictionnaire2015.csv” file were used, which corresponded to 148,447 terms and 6,392 codes. The source code of the program will be released. IMS-UNIPD. Team UNIPD [18] aimed to implement 1) a minimal expert sys- tem based on rules to translate acronyms, 2) together with a binary weighting approach to retrieve the items in the dictionary most similar to the portion of the certificate of death, and 3) a basic approach to select the class with the highest weight. IxaMed. The IxaMed group [19] has approached the automatic ICD10 coding for French, Italian and Hungarian with a neural model that tries to map the input text snippets with the output ICD10 codes. Their solution does not make assumptions about the content of the input and output data, treating them by means of a machine learning approach that assigns a set of labels to any input line. The solution is language-independent, in the sense that treating a new language only needs a set of (input, output) examples, making no use of language-specific information apart from terminological resources such as ICD10 dictionaries, when available. KCL-Health-NLP. The KCL-Health-NLP team [20] employed a document-level encoder-decoder neural approach. The convolutional encoder operates at the character level. The decoder is recurrent. For French, they contrast the usage of only Raw Text, as well as this text combined with string matched ICD codes. The string matching approach relies on the dictionaries provided, and uses a word n-gram (1-5) representation (ignoring diacritics, including stemming and removal of stopwords) to search for matches. For Italian, they take advantage of language-independent character-level characteristics and contrast results with and without pre-training using the French data. External resources are not used. LSI-UNED. The LSI-UNED team [21] submitted two runs for each raw dataset. A supervised learning system (run 2) has been implemented using multilayer perceptrons and an One-vs-Rest (OVR) strategy. The training of models was carried out with the training data and dictionaries of CépiDC, estimating the frequency of terms weighted with Bi-Normal Separation (BNS). Additionally, this approach has been supplemented with IR methods in a second system (run 1). To this end, the bias has been limited, generating learning models for the ICD-10 codes that appear more than 100 times in the training dataset. The unclassified diseases by these models are used to build queries and apply them to search engines with code descriptions. SIB-BITEM. The BITEM-SIB [22] leveraged the large size and textual nature of the training data by investigating an instance-based learning approach. The 360,000 annotated sentences contained in the training data were indexed with a standard search engine. Then, the k-Nearest Neighbors of an input sentence were exploited in order to infer potential codes, thanks to majority voting. A dictionary-based approach was also used for directly mapping codes in sentences, and both approaches were linearly combined. SINAI. The SINAI team [23] made a system based on Natural Language Pro- cessing (NLP) techniques to detect International Classification Diseases (ICD10) codes using different machine learning algorithms. First, their system found all the possibles ICD10 codes looking for how many words of each code exist in the text. Next, several measures of quality of these codes were calculated. With these metrics, different machine learning algorithms were trained and finally the best model was selected to use in the system. Most of the techniques used are independent of the language, therefore the system is easily adaptable to other languages. KR-ISPED. The SITIS-ISPED team [24] used a deep learning approach and relied on the training data supplied: they used OpenNMT-py, an open source framework for Neural Machine Translation (seq2seq), implemented in PyTorch. To transform diagnostics into ICD10 codes they utilize an encoder-decoder ar- chitecture, consisting of two recurrent neural networks combined together with an attention mechanism. First, the diagnostics and their ICD10 codes are ex- tracted from the csv files and then respectively split into a source text file and a target text file. This extraction is made by a simple bash program. In this way the data consists of parallel source (diagnosis) and target (ICD10 codes) data containing one sentence per line with words separated by a space. Then those data are split into two groups: one for training and one for validation. Validation files are used to evaluate the convergence of the training process. For source files, a first preprocessing step converts upper cases into lower cases. A tokenization process is applied on sources files and on target files which are used as input for the neural network The used encoder/decoder model consists of a 2 layers LSTM with 500 hidden units on both the encoder and decoder. The encoder encodes the input sequence into a context vector which is used by the decoder to generate the output sequence. The training process goes on for 13 epochs and provide a model. From the test data provided by the CLEF organization, we extracted the diagnostics, preprocessed them and used the model we created to ”translate” them into their respective ICD10 codes. Techno. The techno team [25] developed Naive Bayes (NB) classifier for text classification to information extraction from written text at CLEF eHealth 2018 challenge, task1. We used a NB classifier to generate a classification model. The evaluation of the proposed approach does not show good performance. TorontoCL. The TorontoCL team [26] assigned ICD-10 codes to cause-of-death phrases in multiple languages by creating rich and relevant word embedding mod- els. They train 100-dimensional word embeddings on the training data provided, as well as on language-specific Wikipedia corpora. they then use an ensemble model for ICD coding prediction which includes n-gram matching of the raw text to the provided ICD dictionary followed by an ensemble of a convolutional neural network and a recurrent neural network encoder-decoder. WBI. The contribution of the WBI team [11] focus on the setup and evalua- tion of a baseline language-independent neural architecture as well as a simple, heuristic multi-language word embedding space. Their approach builds on two recurrent neural networks models and models the extraction and classification of death causes as two-step process. First, they employ a LSTM-based sequence- to-sequence model to obtain a death cause from each death certificate line. Af- terwards, a bidirectional LSTM model with attention mechanism will be utilized to assign the respective ICD-10 codes to the received death cause description. Both models represent words using pre-trained fastText word embeddings. In- dependently from the original language of a word they represent it by looking up the word in the embedding models of the three languages and concatenate the obtained vectors to build heuristic shared vector space. WebIntelligentLab. The WebIntelligentLab team used a deep learning method viz. lstm with fully connected layers that uses only training data, no dictionary, and other external data. Baseline. To provide a better assessment of the task difficulty and system per- formance, this year we offered results from a so-called frequency baseline, which consisted in assigning to a certificate line from the test set the top 2 most fre- quently associated ICD10 codes in the training and development sets, using case and diacritic insensitive line matching. 3.2 System performance on death certificate coding Tables 6 to 9 present system performance on the ICD10 coding task for each dataset. Team IxaMed obtained the best performance in terms of F-measure for all datasets. However, we can note that the overall recall perfromance did not always align with the recall computed over primary causes of death (for French and Italian only). 4 Discussion In this section, we discuss system performance as well as dataset composition and we highlight directions for future work. 4.1 Natural Language Processing for assisting death certificates coding System performance generally far exceeded the baseline for all three languages. The best systems achieved high precision (.846 F-measure and above) as well as high recall (.597 for French, .955 for Hungarian and .945 for Italian). Similarly to last year, we observe a gap in recall performance between the raw and aligned version of the French dataset, which suggests that there is value in performing the line alignment of the training data. We also note that the primary cause of death recall is higher on the aligned vs. raw format. Many systems offered higher primary cause of death recall than overall recall on the aligned dataset. Although no direct comparison is possible because the test sets were different, we can notice that the best performance from last year (.825 F-mesure for French raw, .867 F-mesure for French aligned by the LIMSI team [27]) remains ahead of this year’s achievements. The results of the submitting systems show consistent performance across languages for those that addressed more than one language. Of note, all systems but one set up a common architecture for the different languages, that then inde- pendently leveraged the resources available in each language (i.e. pre-processing, training corpus, dictionaries, external corpora used to create word embeddings...) Only one team [11] attempted to develop a unique system that could address all three languages, with varying success depending on the language. They also Table 6. System performance for ICD10 coding on the French aligned test corpus in terms of Precision (P), recall (R) and F-measure (F). A horizontal dash line places the frequency baseline performance. The top part of the table displays official runs, while the bottom part displays non-official and baseline runs. French (Aligned) Team P R F Primary R IxaMed-run2 .841 .835 .838 .819 IxaMed-run1 .846 .822 .834 .814 IAM-run2 .794 .779 .786 .770 IAM-run1 .782 .772 .777 .757 SIB-TM .763 .764 .764 .777 TorontoCL-run2 .810 .720 .762 .702 Official runs TorontoCL-run1 .815 .712 .760 .694 KCL-Health-NLP-run1 .787 .553 .649 .629 KCL-Health-NLP-run2 .769 .537 .632 .621 SINAI-run2 .733 .534 .618 .549 SINAI-run1 .725 .528 .611 .527 WebIntelligentLab .673 .491 .567 .451 ECNUica-run1 .771 .437 .558 .526 ECNUica-run2 .771 .437 .558 .526 techno .489 .356 .412 .410 KR-ISPED .029 .020 .023 .029 average .712 .581 .634 .589 median .771 .545 .641 .621 Non-off. APHP-run1 .634 .600 .621 .653 APHP-run2 .794 .607 .688 .713 KR-ISPED-corrected .665 .453 .539 .524 Frequency baseline .452 .450 .451 .495 report that their method still has room for improvement as it currently handles the task as a classification method that assigns one and only one code per death certificate line, which significantly limits the recall performance. Overall, the level of performance achieved by participants this year shows great potential for assisting death certificate coders throughout Europe in their daily task. 4.2 Limitations Size of the French test set. The French test set initially distributed this year comprised 24,375 death certificates in the raw and aligned format. Owing to a bug in the selection process, only 11,931 certificates were present in both raw and aligned format. In order to make the results directly comparable between formats, system performance was eventually computed on the subset of 11,931 common certificates. Even though the final size of the test is smaller than initially planned, we believe that the test set is still large enough to provide interesting insight on system performance for death certificate coding in French. Table 7. System performance for ICD10 coding on the French raw test corpus in terms of Precision (P), recall (R), F-measure (F) and recall on Primary Cause of Death (Primary R) A horizontal dash line places the frequency baseline performance. The top part of the table displays official runs, while the bottom part displays non-official and baseline runs. French (Raw) Team P R F Primary R IxaMed-run1 .872 .597 .709 .579 IxaMed-run2 .877 .588 .704 .573 Official runs LSI-UNED-run1 .842 .556 .670 .535 LSI-UNED-run2 .879 .540 .669 .506 IAM-run2 .820 .560 .666 .555 IAM-run1 .807 .555 .657 .544 TorontoCL-run2 .842 .522 .644 .507 TorontoCL-run1 .847 .515 .641 .500 WebIntelligentLab .702 .495 .580 .451 ECNUica-run1 .790 .456 .578 .530 KCL-Health-NLP-run1 .738 .405 .523 .430 KCL-Health-NLP-run2 .724 .394 .510 .421 ims-unipd .653 .396 .493 .401 techno .569 .286 .380 .349 WBI-run2 .512 .253 .339 .302 WBI-run1 .494 .246 .329 .293 KR-ISPED .043 .021 .028 .015 ECNUica-run2 1.000 0.000 .000 .000 average .723 .410 .507 .414 median .798 .475 .579 .500 Non-off. APHP-run1 .668 .601 .633 .613 APHP-run2 .816 .607 .696 .713 KR-ISPED-corrected .676 .323 .437 .377 Frequency baseline .341 .201 .253 .221 Comparability across languages. Overall system performance seem to be higher on the Hungarian (average F-measure .80) and Italian (average F-measure .799) datasets, compared to French (raw average F-measure .507). However, the question of strict comparability across languages remains open because of the differences in nature between the datasets. The Italian dataset is a synthetic dataset fabricated using selected real data. It is possible that the selection pro- cess yielded somewhat content that was more standard and more easy to analyze in order to reach the consistency goals for the final synthetic certificates. The Hungarian dataset was obtained from transcribed paper certificates. It is possi- ble that some of the natural language difficulties present in the original paper certificates (such as typos) were smoothed out during the transcription process, which was performed manually by contractors. The French dataset was obtained directly from electronic certification, which means that it contains the original text exactly as entered by doctors without any filtering of difficulties. The prac- Table 8. System performance for ICD10 coding on the Hungarian raw test corpus in terms of Precision (P), recall (R) and F-measure (F). Hungarian (Raw) Team P R F IxaMed run2 .970 .955 .963 IxaMed run1 .968 .954 .961 LSI UNED-run2 .946 .911 .928 LSI UNED-run1 .932 .922 .927 TorontoCL-run2 .922 .897 .910 TorontoCL-run1 .901 .887 .894 ims unipd .761 .748 .755 WBI-run2 .522 .388 .445 WBI-run1 .518 .384 .441 average .827 .783 .803 median .922 .897 .910 Frequency baseline .243 .174 .202 Table 9. System performance for ICD10 coding on the Italian raw test corpus in terms of Precision (P), recall (R), F-measure (F) and primary cause of death recall (Primary R). Italian (Raw) Team P R F Primary R IxaMed run1 .960 .945 .952 .705 IxaMed run2 .945 .922 .934 .699 LSI UNED-run1 .917 .875 .895 .666 LSI UNED-run2 .931 .861 .895 .616 TorontoCL-run1 .908 .824 .864 .650 TorontoCL-run2 .900 .829 .863 .652 WBI-run2 .862 .689 .766 .715 WBI-run1 .857 .685 .761 .712 KCL-Health-NLP-run1 .746 .636 .687 .492 KCL-Health-NLP-run2 .725 .616 .666 .492 ims unipd .535 .484 .509 .375 average .844 .761 .799 .616 median .900 .824 .863 .652 Frequency baseline .165 .172 .169 .071 tice of writing death certificates in the three different countries may also generate notable differences in the writing style or depth of descriptions that impact the analysis. A further exploration of dataset characteristics in terms of number of typos, acronyms or token/type ratios could yield interesting insight on the comparability of data across languages. 5 Conclusion We released a new set of death certificates to evaluate systems on the task of ICD10 coding in multiple languages. This is the fourth edition of a biomedical NLP challenge that provides large gold-standard annotated corpora in a language other than English. Results show that high performance can be achieved by NLP systems on the task of coding for death certificates in French, Hungarian and Italian. The level of performance observed shows that there is potential for integrating automated assistance in the death certificate coding workflow in all three languages. The corpus used and the participating team system results are an important contribution to the research community. The comparable corpora could be used for studies that go beyond the scope of the challenge, including a cross-country analysis of death certificate contents. In addition, the focus on three languages other than English (French, Hungarian and Italian) remains a rare initiative in the biomedical NLP community. Acknowledgements We want to thank all participating teams for their effort in addressing new and challenging tasks. The organization work for CLEF eHealth 2018 task 1 was supported by the Agence Nationale pour la Recherche (French National Research Agency) under grant number ANR-13-JCJC-SIMI2-CABeRneT. The CLEF eHealth 2018 evaluation lab has been supported in part by the CLEF Initiative and Data61. References 1. Suominen, H., Kelly, L., Goeuriot, L., Kanoulas, E., Azzopardi, L., Spijker, R., Li, D., Névéol, A., Ramadier, L., Robert, A., Zuccon, G., Palotti, J. Overview of the CLEF eHealth Evaluation Lab 2018. In: CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS), Springer, September. (2018)8 2. World Health Organization. ICD-10. International Statistical Classification of Dis- eases and Related Health Problems. 10th Revision. Volume 2. Instruction manual. 2011. 3. Névéol, A., Anderson, R.N., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G., Robert, A., Zweigenbaum, P.: CLEF eHealth 2017 multilingual information ex- traction task overview: Icd10 coding of death certificates in english and french. In: CLEF 2017 Online Working Notes. CEUR-WS (2017) 4. Suominen H, Salantera S, Velupillai S, Chapman WW, Savova G, Elhadad N, Pradhan S, South BR, Mowery DL, Jones GJF, Leveling J, Kelly L, Goeuriot L, Martinez D, Zuccon G. Overview of the ShARe/CLEF eHealth Evaluation Lab 2013. In: Forner P, Müller H, Paredes R, Rosso P, Stein B (eds), Informa- tion Access Evaluation. Multilinguality, Multimodality, and Visualization. LNCS (vol. 8138):212-231. Springer, 2013 5. Goeuriot L, Kelly L, Suominen H, Hanlen L, Névéol A, Grouin C, Palotti J, Zuccon G. Overview of the CLEF eHealth Evaluation Lab 2015. In: Information Access Evaluation. Multilinguality, Multimodality, and Interaction. Springer, 2015 6. Kelly L, Goeuriot L, Suominen H, Névéol A, Palotti J, Zuccon G. (2016) Overview of the CLEF eHealth Evaluation Lab 2016. In: Fuhr N. et al. (eds) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science, vol 9822. Springer, Cham 7. Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Aurélie Névéol, Aude Robert, Evangelos Kanoulas, Rene Spijker, João Palotti, and Guido Zuccon. CLEF 2017 eHealth Evaluation Lab Overview. CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS), Springer, Septem- ber, 2017. 8. Kelly L, Goeuriot L, Suominen H, Schreck T, Leroy G, Mowery DL, Velupillai S, Chapman WW, Martinez D, Zuccon G, Palotti J. Overview of the ShARe/CLEF eHealth Evaluation Lab 2014. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds), Information Access Evaluation. Multilinguality, Multimodality, and Interaction. LNCS (vol. 8685):172-191. Springer, 2014 9. Chapman WW, Nadkarni PM, Hirschman L, D’Avolio LW, Savova GK, Uzuner O (2011). Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J Am Med Inform Assoc, 18(5):540-3 10. Huang CC, Lu Z (2015). Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform, 2015 May 1. pii: bbv024. 11. Ševa J, Sänger M, and Leser U (2018). WBI at CLEF eHealth 2018 Task 1: Language-independent ICD-10 coding using multi-lingual embeddings and recur- rent neural networks. CLEF 2018 Online Working Notes. CEUR-WS 12. Pavillon G., Laurent F (2003). Certification et codification des causes médicales de décès. Bulletin Epidémiologique Hebdomadaire - BEH:134-138. http://opac. invs.sante.fr/doc_num.php?explnum_id=2065 (accessed: 2016-06-06) 13. Johansson LA, Pavillon G (2005). IRIS: A language-independent coding system based on the NCHS system MMDS. In WHO-FIC Network Meeting, Tokyo, Japan 14. Lavergne T, Névéol A, Robert A, Grouin C, Rey G, Zweigenbaum P. A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage. Proceedings of the Fifth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2016. 2016. 15. Li M, Xu C, Wei T, Bao D, Lu N, and Yang J (2018). ECNU at 2018 eHealth Task1 Multilingual Information Extraction. CLEF 2018 Online Working Notes. CEUR-WS 16. Flicoteaux R (2018). ECSTRA-APHP @ CLEF eHealth2018-task 1: ICD10 Code Extraction from Death Certificates. CLEF 2018 Online Working Notes. CEUR-WS 17. Cossin S, Jouhet V, Mougin F, Diallo G, and Thiessard F (2018). IAM at CLEF eHealth 2018 : Concept Annotation and Coding in French Death Certificates. CLEF 2018 Online Working Notes. CEUR-WS 18. Di Nunzio GM (2018). Classification of ICD10 Codes with no Resources but Re- producible Code. IMS Unipd at CLEF eHealth Task 1. CLEF 2018 Online Working Notes. CEUR-WS 19. Atutxa A, Casillas A, Ezeiza N, Goenaga I, Fresno V, Gojenola K, Martinez R, Oronoz M and Perez-de-Viñaspre O (2018). IxaMed at CLEF eHealth 2018 Task 1: ICD10 Coding with a Sequence-to-Sequence approach. CLEF 2018 Online Working Notes. CEUR-WS 20. Ive J, Viani N, Chandran D, Bittar A, and Velupillai S (2018). KCL-Health- NLP@CLEF eHealth 2018 Task 1: ICD-10 Coding of French and Italian Death Certificates with Character-Level Convolutional Neural Networks CLEF 2018 On- line Working Notes. CEUR-WS 21. Almagro M, Montalvo S, Diaz de Ilarraza A, and Pérez A (2018). LSI UNED at CLEF eHealth 2018: A Combination of Information Retrieval Techniques and Neural Networks for ICD-10 Coding of Death Certificates. CLEF 2018 Online Working Notes. CEUR-WS 22. Gobeill J and Ruch P (2018). Instance-based learning for ICD10 categorization. CLEF 2018 Online Working Notes. CEUR-WS 23. Lopez-Úbeda P, Diaz-Galiano MC, Martin-Valdivia MT, and Ureña-López LA (2018). Machine learning to detect ICD10 codes in causes of death. CLEF 2018 Online Working Notes. CEUR-WS 24. Réby K, Cossin S, Bordea G, and Diallo G (2018). SITIS-ISPED in CLEF eHealth 2018 Task 1: ICD10 coding using Deep Learning. CLEF 2018 Online Working Notes. CEUR-WS 25. Bounaama R and El Amine Abderrahim M (2018). Tlemcen University at CELF eHealth 2018 Team techno: Multilingual Information Extraction - ICD10 coding. CLEF 2018 Online Working Notes. CEUR-WS 26. Jeblee S, Budhkar A, Milić S, Pinto J, Pou-Prom C, Vishnubhotla K, Hirst G, and Rudzicz F (2018). TorontoCL at the CLEF 2018 eHealth Challenge Task 1. CLEF 2018 Online Working Notes. CEUR-WS 27. Zweigenbaum P and Lavergne T (2017). Multiple methods for multi-class, multi- label ICD-10 coding of multi-granularity, multilingual death certificates. CLEF 2017 Online Working Notes. CEUR-WS