Introduction

Classi cation of ICD10 Codes with no Resources but Reproducible Code. IMS Unipd at CLEF eHealth Task 1

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0 0 Dept. of Information Engineering 1 University of Padua

In this paper, we describe the second participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 Task 1. In this task, participants are required to extract causes of death from multilingual death reports (French, Hungarian and Italian) and label them with the correct International Classi cation Diseases (ICD10) code. We tackled this task by focusing on the reproducible code, that we published last year, which produces a clean dataset that can be used to implement more sophisticated approaches.

Introduction

In this paper, we report the experimental results of the second participation of the IMS group to the CLEF eHealth Lab [ 5 ], in particular to Task 1: \Multilingual Information Extraction - ICD10 coding" [ 2 ]. This task consists in automatically labelling death certi cates written in di erent languages (French, Hungarian, and Italian) with International Classi cation Diseases (ICD10) codes.

The main goal of our participation to the task this year was to test the e ectiveness of the reproducible code made available by [ 3 ] which builds a classi cation system that i) converts raw data into a cleaned dataset following a `tidyverse' approach1, ii) implements a set of manual rules to split sentences and translate medical acronyms, and iii) implement a lexicon based classi cation approach [ 1 ].

The contribution of our experiments to this task can be summarized as follows: { A study of a reproducibility framework to explain each step of the pipeline from raw data to cleaned data; { An evaluation of the application of a classi cation system prepared for a language (French) and applied without any additional training or changes to the source code to two di erent languages (Hungarian and Italian).

We submitted three o cial runs, one for each language and prepared a number of additional uno cial runs that we will evaluate and compare in order to study the change in performance when adding more information in the pipeline. 1 https://www.tidyverse.org

French avec sur par suite a un[e] dans un contexte de

apres \,", \;", \/" 2

Method

In this section, we summarize the pipeline used in [ 3 ] that has been reproduced in this work for each run. 2.1

Pipeline for Data Cleaning

In order to produce a clean dataset, we followed the same pipeline for data ingestion and preparation for all the experiments: { read a line of a death certi cate, { split the line according to the expression listed in Table 1; { remove extra white space (leading, trailing, internal); { transform letters to lower case; { remove punctuation; { expand acronyms (if any); { correct common patterns (if any).

Acronym Expansion Acronym expansion is a crucial step to normalize data and make the death certi cate clearer and more coherent with the ICD10 codes. For the French experiments, we used. the original list of 1179 acronyms prepared by a semi-automated approach by [ 3 ].

We show the rst ten acronym expansions in Table 2. We want to stress the fact that this particular implementation of the expansion selects, in those cases where there is more than once choice (for example \aa"), only the rst choice. This is part of our current work in order to improve this step of the pipeline. 2.2

Classi cation

We used a simple unsupervised lexicon based approach to label each (segment of a) line of a death certi cate [ 1 ]. The procedure to assign an ICD10 code that does not require any training is the following:

acronym expansion 5-hiaa acide 5-hydroxyindolactique 5-ht 5-hydroxytryptamine 5-ht srotonine a1at alpha-1-antitrypsine a1at a1-antitrypsine aa aorte ascendante aa a ection actuelle aa acide amin aa antiarthrosique aaa anvrisme de l'aorte abdominale step data line pneumopathie infectieuse lobaire inferieure droite terms pneumopathie, infectieuse, lobaire, inferieure, droite

ICD10 scores J181 = 7, J13 = 1 { for each term in the (segment of a) line, sum one for each ICD10 label that contains the term, { for each (segment of a) line compute the score of each ICD10 label; { group the ICD10 labels that have the maximum score; { assign the most frequent code within this group.

The score of each label is the sum of the binary weights. In those cases where two or more classes have the same number of entries with the maximum score, the rst class in the list is assigned by default. This is another part of the pipeline that requires more e ort in order to improve the e ectiveness of the classi er. In Table 3, we show an example of the rst three steps, while in Table 4 the de nition of the ICD10 codes that received the highest score. 3

Experiments and Results

We submitted three o cial runs, one for each language: French, Hungarian, and Italian. The idea of these experiments was to test the e ectiveness of the original French ICD10 classi er on two new languages without any modi cation to the source code. That is, acronym expansion and sentence splitting are done using French resources. We used only the raw dataset for all the languages. ICD10 de nition J13 pneumopathie franche lobaire inferieure J181 pneumopathie commune lobaire inferieure J181 pneumopathie infectieuse lobaire aigue J181 pneumopathie infectieuse lobaire moyenne J181 pneumopathie infectieuse lobaire superieure J181 pneumopathie lobaire inferieure J181 pneumopathie lobaire inferieure aigue J181 pneumopathie lobaire inferieure bilaterale The results of the three experiments are shown in Table 5. The French run performed su ciently well, and comparable to the results presented in [ 3 ]. The F1 measure is close to the average of the results of all the participants in this task. This con rms that a solid clean dataset is a good starting point to build a classi er, even a simple classi er like the one we implemented.

The Hungarian and Italian results are, as we expected, worse than the average scores (much worse for Italian). However, it seems that the Hungarian dataset was in a sense \easier" compared to the our results of our experiments in the Italian subtask. We are going to investigate the reasons for this large di erence in performance as future work. Another interesting fact is that, while for the French task Precision was much higher than Recall, for the Hungarian and Italian dataset these two measures seem more \balanced". This may suggest that a better acronym expansion and better sentence splitting may favour Precision over Recall. 3.2

Uno

cial Runs As part of current and future work, we have prepared a set of uno cial runs. A rst set of runs study the e ect of an alternative weighting scheme, tf-idf instead of binary weighting, another set of runs (for Hungarian and Italian) explore the e ectiveness of splitting the sentence with the correct words, see Table 6, as well as expand acronym with the appropriate language. More runs will be created with additional parameters concerning the multiple label assignment and a better acronym expansion algorithm.

At present time, we have been able to evaluate the e ectiveness of some combinations of these parameters. In particular, we tested the binary weighting approach vs the tf-idf approach, using the original French source code (`inappropriate' acronyms and sentence splitting), results are shown in the rst two lines of Table 7. These results con rms that for Hungarian and Italian the binary weighting approach performs better than tf-idf (the only language that showed some improvement in this task with the tf-idf weights was English [ 3 ])

Then, we performed an experiment with binary weights and a `correct' sentence splitting (see Table 6) with or without the French acronym expansion. Results are shown in the last two rows of Table 7. The fact that we used a language speci c sentence splitting did not produce any signi cant change in the performance of the classi er. This is probably due to the fact that the Hungarian and Italian death certi cates are much more structured (from a language standpoint) than French ones. For example, we could rarely nd complex sentences with words or terms listed in Table 6 in the Italian certi cates. It seems that punctuation marks work su ciently well for these two languages. Moreover, by removing the French acronym expansion, we obtained a slight improvement due to the fact that we removed the noise introduced by a module in the pipeline (the acronym expansion). In this case, results are better in terms of both Precision and Recall compared to the o cial runs.

Final remarks and Future Work

The aim of our second participation to the CLEF eHealth Task 1 was to test the reproducibility of the source code of the lexicon based classi er that was implemented the previous year. The performance of the French run was good and we consider to use it as a baseline to build a new and improved classi er. The application of this classi er to two di erent language gave interesting results: the results of the Hungarian run was surprisingly high and close to the average of the results of the participant. However, the high value of the median of F1 (close to 90%) suggests that this subtask may be easier than the French one. For the Italian run, we obtained a worse performance the reasons of which we will investigate in a failure analysis.

As current and future work, we are studying { the adaptation of the pipeline to the two new languages (better split sentence and acronym expansion [ 4 ]); { the possibility to include multiple acronym expansions; { how to assign multiple labels to the same line (when scores are tied).

Jacob

Eisenstein . Unsupervised learning for lexicon-based classi cation . In Proceedings of the Thirty-First AAAI Conference on Arti cial Intelligence, February 4-9 , 2017 , San Francisco, California, USA., pages 3188 { 3194 , 2017 .

Neveol , Robert A.,

Grippo ,

Morgand ,

Orsi ,

Pelikan ,

Ramadier , G. Rey, and

Zweigenbaum . Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certi cates in french, hungarian and italian . In CLEF 2018 Evaluation Labs and Workshop: Online Working Notes. CEUR-WS.org , September 2018 .

Giorgio

Maria Di Nunzio , Federica Beghini, Federica Vezzani, and

Genevieve

Henrot . A lexicon based approach to classi cation of ICD10 codes. IMS unipd at CLEF ehealth task 1 . In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum , Dublin, Ireland, September 11-14 , 2017 ., 2017 .

Borbala

Siklosi and

Attila

Novak . Detection and expansion of abbreviations in hungarian clinical notes . In Felix Castro, Alexander Gelbukh, and Miguel Gonzalez, editors, Advances in Arti cial Intelligence and Its Applications , pages 318 { 328 , Berlin, Heidelberg, 2013 . Springer Berlin Heidelberg.

Hanna

Suominen , Liadh Kelly, Lorraine Goeuriot, Evangelos Kanoulas, Leif Azzopardi, Rene Spijker,

Dan

Li ,

Aurelie

Neveol , Lionel Ramadier, Aude Robert, Guido Zuccon, and Joao Palotti, editors. Overview of the CLEF eHealth Evaluation Lab 2018 . CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science . Springer, September 2018 .