Introduction

A Lexicon Based Approach to Classi cation of ICD10 Codes. IMS Unipd at CLEF eHealth Task 1

Giorgio Maria Di Nunzio

giorgiomaria.dinunzio@unipd.it 0

Federica Beghini

fede.beghini92@gmail.com 1

Federica Vezzani

Genevieve Henrot

genevieve.henrot@unipd.it 1 0 Dept. of Information Engineering 1 Dept. of Linguistic and Literary Studies 2 University of Padua

In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 1. In this task, participants are required to extract causes of death from death reports (in French and in English) and label them with the correct International Classi cation Diseases (ICD10) code. We tackled this task by focusing on the replicability and reproducibility of the experiments and, in particular, on building a basic compact system that produces a clean dataset that can be used to implement more sophisticated approaches.

Introduction

In this paper, we report the experimental results of the IMS group that participated for the rst time to the CLEF eHealth Lab [ 8 ], in particular to Task 1: \Multilingual Information Extraction - ICD10 coding" [ 11 ]. This task consists in labelling with International Classi cation Diseases (ICD10) codes death certi cate texts written in English or in French. This work is usually performed by experts in medicine; however, when large volumes of data need to be organized and labelled, manual work is not only expensive but also time consuming and probably not feasible when hundreds of thousands of death certi cates need to be classi ed according to a taxonomy of thousands of codes. For this reason, a possible solution is to approach this task either from a machine learning perspective and/or a natural language processing perspective by using syntactic and/or semantic decision rules [ 2 ].

The main goal of our participation to this task was to build a reproducible set of experiments of a system that i) converts raw data into a cleaned dataset, ii) implements a set of manual rules to split sentences and translate medical acronyms, and iii) implement a lexicon based classi cation approach with the aim of building a su ciently strong baseline (our initial objective was to achieve a classi er with precision and recall equal 0.5) . We intentionally did not make use of any machine learning approach to improve the accuracy of the classi cation of death certi cates; in fact, the main objective was to build a modular system that can be easily enhanced in order to make use of the cleaned training data available. For this purpose, we devised a pipeline for processing each death certi cate and producing a `normalized' version of the text. Indeed, death certi cates are standardized documents lled by physicians to report the death of a patient but the content of each document contains heterogeneous and noisy data that participants had to deal with [ 9 ]. For example, some certi cates contain non-diacritized text, or a mix of cases and diacritized text, acronyms and/or abbreviations, and so on.

The main points of our contribution to this task can be summarized as follows: { A reproducibility framework to explain each step of the pipeline from raw data to cleaned data; { A minimal expert system based on rules to split sentences and translate acronyms; { Experimenting di erent weighting approach to retrieve the items in the dictionary most similar to the portion of the certi cate of death; { A simple classi cation approach to select the ICD code with the highest weight.

For this task, we submitted 2 o cial English runs plus 3 uno cial English runs and 8 uno cial French runs. 2

Method

In this section, we describe the main aspects of our contribution: the software used to build the reproducibility framework, the data cleaning pipeline, and the classi cation approach. 2.1

R Markdown for Reproducible Research

The problem of reproducibility in Information Retrieval has been addressed by many researchers in the eld in the last years [ 6, 4, 12 ]. The main concerns for reproducibility in IR are related to system runs; in fact, even if a researcher uses the same datasets and the same open source software, there are many hidden parameters that make the full reproducibility of the experiment very di cult. For this reason, there are important initiatives in the main IR conferences that support this kind of activity (see for example the open source information retrieval reproducibility challenge at SIGIR3 or the Reproducibility track at ECIR [ 5 ]) as well as in the Natural Language Processing community [ 1 ].

During the same time span, the Data Science community has questioned the same issues4 and has produced interesting solutions from a software point of

3 https://github.com/lintool/IR-Reproducibility 4 http://www.nature.com/news/reproducibility-1.17552

view. The R Markdown framework5 is now considered one of the possible solutions to document the results of an experiment and, at the same time, reproduce each step of the experiment itself. Following the indications given by [ 7 ], we developed the experimental framework in R and publish the source code on github to allow other participants to reproduce our results.6 2.2

Pipeline for Data Cleaning

In order to produce a clean dataset, we implemented the following pipeline for data ingestion and preparation for all the experiments: { read a line of a death certi cate, { split the line according to the expression listed in Table 1; { remove extra white space (leading, trailing, internal); { transform letters to lower case; { remove punctuation; { expand acronyms (if any); { correct common patterns (if any).

Acronym Expansion Acronym expansion is a crucial step to normalize data and make the death certi cate clearer and more coherent with the ICD10 codes. For the English experiments, we used a manual approach to build the list of expanded acronyms and an automatic approach that gathers acronym from the Web. For the French experiments, we automatically created a list of expanded medical acronyms available on Wikipedia and a manual cleaning of the same list.

Indeed, the automatically creation of a list of acronyms gathered from the Web presents some problems: { sometimes acronyms have more than one expansion, some of which do not belong to the medical eld;

5 http://rmarkdown.rstudio.com 6 https://github.com/gmdn/CLEF-eHealth-Task-1

{ some entries contain more than one language, for example English and/or

French and/or the Latin expanded acronym; { some others have some spelling mistakes.

In order to deal with these issues, we referred to the ICD10 dictionary code list which contained a list of diseases and causes of death, to other French dictionaries,7;8 and to some reliable websites.9

Moreover, we removed the wrong de nitions and the acronym expansions written in English and in Latin, and we corrected the spelling mistakes concerning some of the accents (especially on the grapheme <e>) and some typos (e.g. "isoniazide" instead of "izoniazide"). Additionally, there were some variants that di ered only in the hyphen, e.g. broncho-pulmonaire/bronchopulmonaire, antiagregant plaquettaire/anti-agregant plaquettaire. In these cases, we chose the de nition present in the ICD10 dictionary and, if both variants were present, we entered the one that had more occurrences on the Web. 2.3

Classi cation

We used a simple unsupervised lexicon based approach to label each (segment of a) line of a death certi cate [ 3 ]. The procedure to assign an ICD10 code that does not require any training is the following: { for each (segment of a) line compute the score of each entry of the dictionary; { group the ICD10 codes that have the maximum score; { assign the most frequent code within this group.

The score of each entry is the sum of the weights of each term either binary weighting (term present or absent) or a term frequency - inverse document frequency (Tf-Idf) approach [ 10 ]. In those cases where two or more classes have the same number of entries with the maximum score, the rst class in the list is assigned by default. 3

Experiments and Results

In our experiments, we implemented: 1. a minimal expert system based on rules to translate acronyms, together with 2. a binary weighting approach or a Tf-Idf approach to retrieve the items in the dictionary most similar to the portion of the certi cate of death, and 3. a lexicon based classi cation approach that selects the most frequent class with the highest weight.

We submitted two o cial runs for the English raw dataset. Then, we submitted 3 uno cial English runs and 8 uno cial French runs (four for the raw dataset and four for the aligned dataset). 7 Larousse http://www.larousse.fr/dictionnaires/francais-monolingue 8 Le Tresor de la Langue Francaise Informatise http://atilf.atilf.fr/tlfi.htm 9 http://www.cnci.univ-paris5.fr/medecine/abreviations.html,http: //dictionnaire.doctissimo.fr/ For the two o cial English runs, we pre-processed the raw dataset in the following way: 1. Read the rst three elds of the American dictionary (DiagnosisText, Icd1, Icd2, Icd3) and skip lines from 69328 to 69332 since there were some problems with the data format as shown below ...

LATE EFFECTS TRAUMATIC DUODENAL HEMATOMA;CTS TRAUMATIC ... LATE EFFECTS TRAUMATIC DUODENUM HEMORRHAGE;FECTS TRAUMATIC ... LATE EFFECTS TRAUMATIC ELBOW HEMATOMA; TRAUMATIC ELBOW HEMORRHAGE; ... LATE EFFECTS TRAUMATIC EMPHYSEMATOUS BULLOUS DISEASE;; LATE EFFECTS TRAUMATIC EMPHYSEMATOUS LUNG BLEB; ... 2. Index the dictionary using either binary weights or Tf-Idf weights; 3. Build a test run by reading (and cleaning) the causes brutes le and { split the sentence according to the following set of patterns: \with", \due to", \also due to", \that caused", \sec to", \on top of", { expand each acronym using a table of manually curated acronyms, 4. classify each line by assigning the ICD code with the highest score, if one, or the most frequent code if more than a code matches the line of the death certi cate.

The expansion of the acronym was done by manually checking the acronyms in the training data and building a table of expanded acronyms by means of the Web page https://www.allacronyms.com/_medical.

The results of the two runs, Unipd-run1 for the binary weighting approach and Unipd-run2 for the Tf-Idf weighing approach are reported in Table 2.

The results of the binary weighting run was very close to our expectations, that is to classify correctly almost half of the ICD10 codes (both in terms of Recall and Precision) by just cleaning and normalizing the data without the help of any expert of the eld.

The poor result of the Tf-Idf weighting approach on the second run was unexpected. For this reason, we investigated this matter and, thanks to the reproducibility approach, we were able to immediately spot two bugs in the code: 1) we unintentionally selected the Tf weights instead of TfIdf during the indexing phase, 2) more importantly, we made a mistake in the classi cation code (step 4 in the above list) that prevented the algorithm to select the most frequent code (it just assigned the rst ICD code in the initial list of results). For this reason, we decided to correct the code and submit a second version of Tf-Idf as an uno cial run. 3.2

Uno cial

We also submitted uno cial runs both for French and English with the same original goal but a slightly di erent approach for the collection of acronyms and the use of transliteration of French diacritics. In particular, we were interested in automatically gathering medical acronyms from a Wikipedia page and manually cleaning the table of expanded acronyms (for example, duplicated entries, both English and French version, wrong diacritics, and so on).

For the expansion of French acronyms, we used the Wikipedia page \Liste d'abreviations en medecine"10 that contains 1,059 acronyms. After a manual cleaning of the broken/missing/duplicated entries, we produced a table of 1,179 expanded acronyms.

The increase in the number of acronyms is due to the fact that for the same acronym there were several solutions relevant to the medical eld. Indeed, we decided to place each variant in a di erent row with the aim of providing a more complete overview of medical terminology. Furthermore, we applied the same procedure when two acronyms corresponded to the same expansion by keeping both alternatives and positioning them in di erent rows. Finally, we decided to remove the acronym expansions that were not relevant to the medical eld.

For the expansion of the English acronyms, we decided not to use the English Wikipedia list of medical abbreviation page since it is much less informative compared to the French version. Instead, we chose a public Web page that contains 445 common medical abbreviations.11 For the English uno cial runs, we did not perform any manual corrections of the table of expanded acronyms. 10 https://fr.wikipedia.org/wiki/Liste_d\%27abr\'eviations_en_m\'edecine 11 http://www.spinalcord.org/resource-center/askus/index.php?pg=kb.page& id=1413

The results for the uno cial English runs are reported in Table 3. The rst half of the table shows the results of the uno cial runs, while the second half reports the o cial results for comparison.

French Run Results For the French dataset, we had to lightly change the code that read the aligned and the raw causes since some lines (less than 1% of the data) had some issues with the number of elds (more than expected) and/or contained a semicolon in the death certi cate (being the semicolon the separating characters of the elds). See the les available for the reproducibility track for more details.

A total of sixteen uno cial French runs were submitted: eight for the raw dataset, eight for the aligned dataset. For each type of dataset we tried the following settings: { Unipd-run6 (raw), Unipd-run14 (aligned): binary weights, automatic creation of expanded acronyms, without transliteration of diacritics; { Unipd-run7 (raw), Unipd-run15 (aligned): binary weights, automatic creation of expanded acronyms, with transliteration of diacritics; { Unipd-run8 (raw), Unipd-run16 (aligned): binary weights, manually curated expanded acronyms, without transliteration of diacritics; { Unipd-run9 (raw), Unipd-run17 (aligned): binary weights, manually curated expanded acronyms, with transliteration of diacritics; { Unipd-run10 (raw), Unipd-run18 (aligned): Tf-idf weights, automatic creation of expanded acronyms, without transliteration of diacritics; { Unipd-run11 (raw), Unipd-run19 (aligned): Tf-idf weights, automatic creation of expanded acronyms, with transliteration of diacritics; { Unipd-run12 (raw), Unipd-run20 (aligned): Tf-idf weights, manually curated expanded acronyms, without transliteration of diacritics; { Unipd-run13 (raw), Unipd-run21 (aligned): Tf-idf weights, manually curated expanded acronyms, with transliteration of diacritics.

The results for the uno cial French runs are reported in Table 4. 4

Final remarks and Future Work

The aim of our participation was to implement a reproducible lexicon based classi er that can be used as a baseline for further experiments. The performance was su ciently good and in some cases the classi er achieved a classi cation performance above 50% both for Recall and Precision which was our initial ideal threshold as a baseline.

Moreover, the preliminary results of the experiments (o cial and uno cial) have shown interesting di erences between the English and French dataset: { Tf-Idf works better for English while binary weighting performs consistently better for the French dataset; { For the expansion of the acronym there seems to be a trade-o between manual curation of data and quantity of data gathered from the Web; a lot of noisy data is comparable to a small curated set (see for example Unipd-run3 and Unipd-run5). With lots of data, a round of manual curation allows for small (if not negligible) improvements in terms of accuracy of classi cation; { for the French dataset, the normalization of diacritics was a key factor that led to improvements of 10 points percent over the non-normalized version.

Before turning to a more complex system (based on a machine learning approach), we will investigate other forms of data cleaning. In particular, we want to investigate better the problem with diacritics and include an automatic correction of wrong spellings of words (very frequent in the dataset) based, for example, on the Hamming distance among the words of the ICD10 codes.

1. Kevin

B Cohen

, Jingbo Xia, Christophe Roeder, and Lawrence Hunter. Reproducibility in natural language processing: A case study of two r libraries for mining pubmed/medline . In In LREC 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language , pages 6 { 12 . European Language Resources Association (ELRA), 2016 .

Mohamed

Dermouche , Vincent Looten, Remi Flicoteaux, Sylvie Chevret, Julien Velcin, and

Namik

Taright . ECSTRA-INSERM @ CLEF ehealth2016-task 2: ICD10 code extraction from death certi cates . In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum , Evora, Portugal, 5 - 8 September, 2016 ., pages 61 { 68 , 2016 .

Jacob

Eisenstein . Unsupervised learning for lexicon-based classi cation . In Proceedings of the Thirty-First AAAI Conference on Arti cial Intelligence, February 4-9 , 2017 , San Francisco, California, USA., pages 3188 { 3194 , 2017 .

Nicola

Ferro . Reproducibility challenges in information retrieval evaluation . J. Data and Information Quality , 8 ( 2 ):8: 1 { 8 :4, January 2017 .

Nicola

Ferro , Fabio Crestani, Marie-Francine

Moens

, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hau , and Gianmaria Silvello, editors. Advances in Information Retrieval - 38th European Conference on IR Research , ECIR 2016 , Padua, Italy, March 20 -23, 2016 . Proceedings, volume 9626 of Lecture Notes in Computer Science. Springer, 2016 .

Nicola

Ferro , Norbert Fuhr, Kalervo Jarvelin, Noriko Kando, Matthias Lippold, and

Justin

Zobel . Increasing reproducibility in ir: Findings from the dagstuhl seminar on "reproducibility of data-oriented experiments in e-science" . SIGIR Forum , 50 ( 1 ): 68 { 82 , 2016 . http://sigir.org/ les/forum/2016J/p068.pdf.

Christopher

Gandrud . Reproducible Research with R and R Studio. Chapman and Hall/CRC, second ed. edition, 2015 .

Lorraine

Goeuriot , Liadh Kelly, Hanna Suominen, Aurelie Neveol, Aude Robert, Evangelos Kanoulas, Rene Spijker, Joa~o Palotti, and Guido Zuccon, editors. CLEF 2017 eHealth Evaluation Lab Overview . CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science . Springer, 2017 .

Liadh

Kelly , Lorraine Goeuriot, Hanna Suominen, Aurelie Neveol, Joa~o

R. M.

Palotti , and

Guido

Zuccon . Overview of the CLEF ehealth evaluation lab 2016 . In Experimental IR Meets Multilinguality , Multimodality, and Interaction - 7th International Conference of the CLEF Association, CLEF 2016 , Evora, Portugal, September 5- 8 , 2016 , Proceedings, pages 255 { 266 , 2016 .

10. Christopher D. Manning , Prabhakar Raghavan, and Hinrich Schutze. Scoring, term weighting, and the vector space model . In Introduction to Information Retrieval, pages 100 { 123 . Cambridge, 2008 .

11. Aurelie

Neveol

, Robert N. Anderson , K.

Bretonnel Cohen , Cyril Grouin, Thomas Lavergne, Gregoire Rey, Aude Robert, Claire Rondet, and Pierre

Zweigenbaum . Clef ehealth 2017 multilingual information extraction task overview: Icd10 coding of death certi cates in english and french . In CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR Workshop Proceedings. CEUR-WS.org , 2017 .

12. Aurelie

Neveol

, Kevin Cohen, Cyril Grouin, and

Aude

Robert . Replicability of research in biomedical natural language processing: a pilot evaluation for a coding task . In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis , pages 78 { 84 , Auxtin , TX , November 2016 . Association for Computational Linguistics .