Classification of ICD10 Codes with no Resources but Reproducible Code. IMS Unipd at CLEF eHealth Task 1 Giorgio Maria Di Nunzio Dept. of Information Engineering – University of Padua giorgiomaria.dinunzio@unipd.it Abstract. In this paper, we describe the second participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 Task 1. In this task, participants are required to extract causes of death from multilingual death reports (French, Hungarian and Italian) and label them with the correct International Classification Diseases (ICD10) code. We tackled this task by focusing on the reproducible code, that we published last year, which produces a clean dataset that can be used to implement more sophisticated approaches. 1 Introduction In this paper, we report the experimental results of the second participation of the IMS group to the CLEF eHealth Lab [5], in particular to Task 1: “Multilin- gual Information Extraction - ICD10 coding” [2]. This task consists in automat- ically labelling death certificates written in different languages (French, Hungar- ian, and Italian) with International Classification Diseases (ICD10) codes. The main goal of our participation to the task this year was to test the effectiveness of the reproducible code made available by [3] which builds a clas- sification system that i) converts raw data into a cleaned dataset following a ‘tidyverse’ approach1 , ii) implements a set of manual rules to split sentences and translate medical acronyms, and iii) implement a lexicon based classification approach [1]. The contribution of our experiments to this task can be summarized as fol- lows: – A study of a reproducibility framework to explain each step of the pipeline from raw data to cleaned data; – An evaluation of the application of a classification system prepared for a language (French) and applied without any additional training or changes to the source code to two different languages (Hungarian and Italian). We submitted three official runs, one for each language and prepared a num- ber of additional unofficial runs that we will evaluate and compare in order to study the change in performance when adding more information in the pipeline. 1 https://www.tidyverse.org Table 1. Expressions in French or punctuation marks used to split a line of a death certificate. French avec sur par suite à un[e] dans un contexte de après “,”, “;”, “/” 2 Method In this section, we summarize the pipeline used in [3] that has been reproduced in this work for each run. 2.1 Pipeline for Data Cleaning In order to produce a clean dataset, we followed the same pipeline for data ingestion and preparation for all the experiments: – read a line of a death certificate, – split the line according to the expression listed in Table 1; – remove extra white space (leading, trailing, internal); – transform letters to lower case; – remove punctuation; – expand acronyms (if any); – correct common patterns (if any). Acronym Expansion Acronym expansion is a crucial step to normalize data and make the death certificate clearer and more coherent with the ICD10 codes. For the French experiments, we used. the original list of 1179 acronyms prepared by a semi-automated approach by [3]. We show the first ten acronym expansions in Table 2. We want to stress the fact that this particular implementation of the expansion selects, in those cases where there is more than once choice (for example “aa”), only the first choice. This is part of our current work in order to improve this step of the pipeline. 2.2 Classification We used a simple unsupervised lexicon based approach to label each (segment of a) line of a death certificate [1]. The procedure to assign an ICD10 code that does not require any training is the following: Table 2. Acronym table (fist 10 rows) used to expand acronyms. acronym expansion 5-hiaa acide 5-hydroxyindolactique 5-ht 5-hydroxytryptamine 5-ht srotonine a1at alpha-1-antitrypsine a1at a1-antitrypsine aa aorte ascendante aa affection actuelle aa acide amin aa antiarthrosique aaa anvrisme de l’aorte abdominale Table 3. Example of classification of a line of a certificate. The definition of the ICD10 labels are shown in Table 4 step data line pneumopathie infectieuse lobaire inférieure droite terms pneumopathie, infectieuse, lobaire, inferieure, droite ICD10 scores J181 = 7, J13 = 1 – for each term in the (segment of a) line, sum one for each ICD10 label that contains the term, – for each (segment of a) line compute the score of each ICD10 label; – group the ICD10 labels that have the maximum score; – assign the most frequent code within this group. The score of each label is the sum of the binary weights. In those cases where two or more classes have the same number of entries with the maximum score, the first class in the list is assigned by default. This is another part of the pipeline that requires more effort in order to improve the effectiveness of the classifier. In Table 3, we show an example of the first three steps, while in Table 4 the definition of the ICD10 codes that received the highest score. 3 Experiments and Results We submitted three official runs, one for each language: French, Hungarian, and Italian. The idea of these experiments was to test the effectiveness of the original French ICD10 classifier on two new languages without any modification to the source code. That is, acronym expansion and sentence splitting are done using French resources. We used only the raw dataset for all the languages. Table 4. Example of definitions (translitterated) of ICD10 selected in Table 3 ICD10 definition J13 pneumopathie franche lobaire inferieure J181 pneumopathie commune lobaire inferieure J181 pneumopathie infectieuse lobaire aigue J181 pneumopathie infectieuse lobaire moyenne J181 pneumopathie infectieuse lobaire superieure J181 pneumopathie lobaire inferieure J181 pneumopathie lobaire inferieure aigue J181 pneumopathie lobaire inferieure bilaterale Table 5. Results for the official runs French Hungarian Italian Pre Rec F-1 Pre Rec F-1 Pre Rec F-1 runs 0.6534 0.3963 0.4933 0.7609 0.7482 0.7545 0.5353 0.4844 0.5086 baseline 0.3410 0.2005 0.2525 0.2425 0.1735 0.2023 0.1648 0.1723 0.1685 average 0.7228 0.4102 0.5066 0.8266 0.7830 0.8025 0.8441 0.7606 0.7992 median 0.7981 0.4750 0.5790 0.9221 0.8972 0.9095 0.8995 0.8239 0.8630 3.1 Official Runs The results of the three experiments are shown in Table 5. The French run performed sufficiently well, and comparable to the results presented in [3]. The F1 measure is close to the average of the results of all the participants in this task. This confirms that a solid clean dataset is a good starting point to build a classifier, even a simple classifier like the one we implemented. The Hungarian and Italian results are, as we expected, worse than the average scores (much worse for Italian). However, it seems that the Hungarian dataset was in a sense “easier” compared to the our results of our experiments in the Italian subtask. We are going to investigate the reasons for this large difference in performance as future work. Another interesting fact is that, while for the French task Precision was much higher than Recall, for the Hungarian and Italian dataset these two measures seem more “balanced”. This may suggest that a better acronym expansion and better sentence splitting may favour Precision over Recall. 3.2 Unofficial Runs As part of current and future work, we have prepared a set of unofficial runs. A first set of runs study the effect of an alternative weighting scheme, tf-idf instead of binary weighting, another set of runs (for Hungarian and Italian) explore the effectiveness of splitting the sentence with the correct words, see Table 6, as well as expand acronym with the appropriate language. More runs will be created Table 6. Expressions in Hungarian and Italian or punctuation marks used to split a line of a death certificate. Hungarian Italian a con tovább a causa di által per után a a seguito di összefüggésben conseguentemente a után dopo “,”, “;”, “/” “,”, “;”, “/” Table 7. Results for the unofficial runs, tf-idf vs binary weighting Hungarian Italian Pre Rec F-1 Pre Rec F-1 official 0.7609 0.7482 0.7545 0.5353 0.4844 0.5086 official + tf-idf 0.6870 0.6720 0.6794 0.3642 0.3195 0.3404 binary 0.7559 0.7478 0.7519 0.5353 0.4878 0.5104 binary w/o acronym exp 0.7729 0.7652 0.7690 0.5495 0.5061 0.5269 with additional parameters concerning the multiple label assignment and a better acronym expansion algorithm. At present time, we have been able to evaluate the effectiveness of some combinations of these parameters. In particular, we tested the binary weighting approach vs the tf-idf approach, using the original French source code (‘inap- propriate’ acronyms and sentence splitting), results are shown in the first two lines of Table 7. These results confirms that for Hungarian and Italian the binary weighting approach performs better than tf-idf (the only language that showed some improvement in this task with the tf-idf weights was English [3]) Then, we performed an experiment with binary weights and a ‘correct’ sen- tence splitting (see Table 6) with or without the French acronym expansion. Results are shown in the last two rows of Table 7. The fact that we used a lan- guage specific sentence splitting did not produce any significant change in the performance of the classifier. This is probably due to the fact that the Hungarian and Italian death certificates are much more structured (from a language stand- point) than French ones. For example, we could rarely find complex sentences with words or terms listed in Table 6 in the Italian certificates. It seems that punctuation marks work sufficiently well for these two languages. Moreover, by removing the French acronym expansion, we obtained a slight improvement due to the fact that we removed the noise introduced by a module in the pipeline (the acronym expansion). In this case, results are better in terms of both Precision and Recall compared to the official runs. 4 Final remarks and Future Work The aim of our second participation to the CLEF eHealth Task 1 was to test the reproducibility of the source code of the lexicon based classifier that was implemented the previous year. The performance of the French run was good and we consider to use it as a baseline to build a new and improved classifier. The application of this classifier to two different language gave interesting results: the results of the Hungarian run was surprisingly high and close to the average of the results of the participant. However, the high value of the median of F1 (close to 90%) suggests that this subtask may be easier than the French one. For the Italian run, we obtained a worse performance the reasons of which we will investigate in a failure analysis. As current and future work, we are studying – the adaptation of the pipeline to the two new languages (better split sentence and acronym expansion [4]); – the possibility to include multiple acronym expansions; – how to assign multiple labels to the same line (when scores are tied). References 1. Jacob Eisenstein. Unsupervised learning for lexicon-based classification. In Proceed- ings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 3188–3194, 2017. 2. A. Névéol, Robert A., F. Grippo, C. Morgand, C. Orsi, L. Pelikán, L. Ramadier, G. Rey, and P. Zweigenbaum. Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certificates in french, hungarian and italian. In CLEF 2018 Evaluation Labs and Workshop: Online Working Notes. CEUR-WS.org, September 2018. 3. Giorgio Maria Di Nunzio, Federica Beghini, Federica Vezzani, and Geneviève Hen- rot. A lexicon based approach to classification of ICD10 codes. IMS unipd at CLEF ehealth task 1. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017. 4. Borbála Siklósi and Attila Novák. Detection and expansion of abbreviations in hun- garian clinical notes. In Félix Castro, Alexander Gelbukh, and Miguel González, ed- itors, Advances in Artificial Intelligence and Its Applications, pages 318–328, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. 5. Hanna Suominen, Liadh Kelly, Lorraine Goeuriot, Evangelos Kanoulas, Leif Az- zopardi, Rene Spijker, Dan Li, Aurélie Névéol, Lionel Ramadier, Aude Robert, Guido Zuccon, and Joao Palotti, editors. Overview of the CLEF eHealth Evalu- ation Lab 2018. CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science. Springer, September 2018.