=Paper=
{{Paper
|id=Vol-2125/paper_110
|storemode=property
|title=MAMTRA-MED at CLEF eHealth 2018: A Combination of Information Retrieval Techniques and Neural Networks for ICD-10 Coding of Death Certificates
|pdfUrl=https://ceur-ws.org/Vol-2125/paper_110.pdf
|volume=Vol-2125
|authors=Mario Almagro,Soto Montalvo,Arantza Díaz de Ilarraza,Alicia Pérez
|dblpUrl=https://dblp.org/rec/conf/clef/AlmagroMIP18
}}
==MAMTRA-MED at CLEF eHealth 2018: A Combination of Information Retrieval Techniques and Neural Networks for ICD-10 Coding of Death Certificates==
MAMTRA-MED at CLEF eHealth 2018: A Combination of Information Retrieval Techniques and Neural Networks for ICD-10 Coding of Death Certificates Mario Almagro1 , Soto Montalvo2 , A. Dı́az de Ilarraza3 , and A. Pérez4 1 Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain malmagro@lsi.uned.es 2 Universidad Rey Juan Carlos (URJC), Madrid 28933, Spain soto.montalvo@urjc.es 3 University of the Basque Country UPV/EHU (IXA NLP), Donostia-San Sebastian 20018, Spain a.diazdeilarraza@ehu.eus 4 University of the Basque Country UPV/EHU (IXA NLP), Bilbao 48013, Spain alicia.perez@ehu.eus Abstract. This paper describes the systems proposed by LSI UNED team in Task 1 of the CLEF eHealth 2018 challenge. The main objec- tive is the automatic coding of death certificates in French, Italian and Hungarian languages according to the ICD-10. This task has been tack- led through supervised learning methods such as neural networks, and techniques based on Information Retrieval (IR) systems. The first ap- proach has been implemented by training one model for each of the most frequent ICD-10 codes in the corpus. For this purpose, a bag-of-words approach has been applied using the TF-BNS value for terms contained in death certificate statements. As for the IR approach, Lucene has been used as a search engine, indexing dictionaries and the content of the death certificates in the training corpus. Finally, a combination of both methods has been proposed to balance precision and recall, using the IR system for diseases not classified by any learning model. Similar F1 scores are obtained on the test datasets of each language by supervised methods and the combined system giving the latter greater recall values. Keywords: ICD-10 Coding · ICD-10 Codes · Information Retrieval · Neural Networks · Deep Learning · CepiDC · Biomedical Text Mining · CLEF eHealth. 1 Introduction The amount of health text available in electronic format is immense —scientific papers, websites, forums, social networks or Electronic Health Records (EHRs)— so managing health information to support medical decisions is no easy task. An example of this complexity is the analysis of numerous clinical texts generated by health care centres, which requires a large amount of resources that are often unavailable. The 2018 CLEF eHealth Evaluation Lab [9] is intended to address these challenges through different tasks aimed at facilitating access to health information. Our proposals have focused on the resolution of the CLEF task [6] for the ICD-10 coding of death certificates in French, Italian and Hungarian. The classi- fication of clinical texts according to the International Classification of Diseases (ICD) is one of the most pressing problems in hospital management due to its statistical purposes for morbidity and mortality. The 10th version of this coding assigns a unique identifier of between 3 and 7 alphanumeric symbols to disorders, grouping together nearly 16,000 possible diagnoses with a wealth of nuances. As a possible resolution to the described task, different approaches supported by supervised learning and search engines have been proposed in this paper. Due to the nature of the ICD-10 codes, the data generally present a very biased distribution, with a small set of frequently occurring diagnoses [1]. The same distribution can be seen in the corpus used in this task, and therefore, it is considered the combination of approaches that could try to maximize precision —such as machine learning approximations— and methods that tend to max- imize recall —such as the search engines on which Information Retrieval (IR) systems are built— could be of great interest. These two aspects will give rise to joint proposals. 2 Related Work In general, the approaches used in the state of the art for the recommendation and assignment of ICD-10 codes can be divided into two groups: those based on medical language processing (MLP) and those based on classification techniques. The first ones use unsupervised techniques to find correspondences between the concepts in standard descriptions and health concepts identified through medical knowledge bases and ontologies in health documents. For example, Ning et al. [7] apply an example-based model generated from a Chinese terminology containing correspondences with 4-digit ICD-10 codes, thus taking advantage of the hierarchical structure in the standard coding. Chen et al. [2] explore semantic similarity by applying the Longest Common Subsequence (LCS) method to the diagnoses and names given by ICD-10 codes. Other systems following this trend have participated in previous versions of the CLEF task [3, 10]. On the other hand, the second approaches generate classifiers by using super- vised learning algorithms. Zweigenbaum and Lavergne [11] apply two classifiers: one trained with a set of EHRs, and the other trained with different medical dictionaries; Miftakhutdinov and Tutubalina [5] use word embeddings trained from a corpus of medical user opinions, along with recurrent neural networks to assign codes. At the same time, mixed approaches combining both methods can be found. For example, Seva et al. [8] use an IR approach to search for possible candidate ICD-10 codes in different dictionaries, along with several classifiers to filter them. Jatunarapit et al. [4] employ English corpus-based classifiers and a set of IR techniques to establish similarities with Thai terms. 3 Proposed Approach In this paper we have explored two alternative approaches for the assignment of ICD-10 codes to death certificates. On the one hand, a supervised approach is proposed through Vector Support Machines (SVMs) and neural networks that aims to take advantage of the training corpus by generating One-Vs-Rest (OVR) models for the most frequent ICD-10 codes. The dependence on examples makes this approach less robust in the face of the possibility of coding new diseases, given the immense number of ICD-10 codes with little or no representation in the corpus. For this reason, it is also proposed to complement learning models with an unsupervised approach based on IR techniques to achieve greater recall. The machine learning approach is based on the training of a binary model for each of the target ICD-10 codes, indicating the presence or absence of the code. As this is a multiclass and multilabel problem, the coding of diseases is carried out by grouping the results of all the binary classifiers. The particularization of a model for each ICD code allows the processing of the data adapted to each class, as will be seen later when applying the weighting with Bi-Normal Separation (BNS). To implement these models, different configurations have been developed with linear SVMs and Multi-Layer Perceptrons (MLPs). The proposed IR approach consists of a search engine in which information relating to codes has been indexed, both terminology from provided dictionaries and associated sentences from training data. In this way, the coding of death certificates is reduced to the generation of queries based on their terms, choosing the result with the highest score. As a drawback, retrieving a fixed number of results (in this case only one) implies the loss of the ability to adapt the number of codes assigned to a line in death certificates, which may contain several disorders. These two approaches yield different results. As expected in the experimen- tation, while the supervised approach achieves higher precision values, a hybrid method involving IR techniques ensures a better balance between the correct coding rate and the number of different codes capable of coding. 4 Experiments 4.1 Datasets The training data is organized in three separate corpus, one for each language: French, Italian and Hungarian. Although each corpus is structured in two modes —aligned, for line-level annotation, and raw, for document-level annotation—, line-level annotation is only available for the French corpus. For this reason, this proposal has focused only on coding at document level for all three languages. Each corpus has different metadata on diagnoses, dictionaries with equiva- lences between ICD-10 codes and terms, the text of death certificates and line- level equivalences between its processed content and ICD-10 codes. Table 1. Overall corpus statistics. The overlap of codes is intended to measure the percentage of the different types of ICD-10 disorders that are present in other corpora. Italian Hungarian French Total Number of certificates in training 14,502 84,702 65,843 165,047 Number of certificates in test 3,617 21,175 24,375 49,167 Number of ICD-10 codes 60,954 392,019 527,940 980,913 Number of unique ICD-10 codes 1,442 3,123 3,829 5,011 Overlapping with Italian codes 100% 34% 30% - Overlapping with Hungarian codes 73% 100% 57% - Overlapping with French codes 79% 70% 100% - A general summary of the amount of data grouped by corpus is given in Table 1. Although the Hungarian corpus contains more death certificates, the number of ICD-10 codes present in the French corpus far exceeds that of the rest. As can be seen, most of the ICD-10 codes in the Italian corpus are also present in some of the other corpora, with an average overlap of 76%. Given this overlapping, a large part of the results achieved on the Italian corpus could be considered extrapolable since the model is expected to behave similarly in at least the same codes. For this reason, the experimentation shown in this paper is only carried out on it, taking advantage of its lower volume. In terms of distribution, the frequency of codes follows a power law, with most of the entries corresponding to a small group of codes. This implies that a supervised approach alone has a more restrictive limit to improvement than other techniques. 4.2 Experimental Setting Regardless of the approach used, a common pre-processing has been developed. A lowercase conversion and accent removal has been applied, as well as a stop word filter and a stemming process for each language. Supervised approach Here the problem has been addressed through the com- bination of different binary classifiers, each one determining the presence or absence of a specific code. Due to the scarcity of data in training collections on some ICD-10 disorders, model generation has been limited to only those ICD codes that appear more than a certain number of times. With this it is under- stood that the rest of the ICD codes (those absent in the corpora or with little presence) cannot be represented by supervised models since data lack sufficient examples with which to abstract the corresponding patterns. In order to find the configuration that best suits this task, multilayer per- ceptrons with different numbers of neurons and hidden layers have been imple- mented, as well as variations of the rest of the hyperparameters. In addition, linear SVMs have been trained to compare the efficiency of both models in ICD- 10 coding. As for the input data, once the pre-processing has been applied, different tex- tual representations have been used. On the one hand, the models have been gen- erated with the frequency of terms weighted with Inverse Document Frequency (IDF) and Bi-Normal Separation (BNS) values. IDF is calculated at document level, determining the relevant terms based on the number of documents in which they appear. This may penalize those terms relevant to a class but too frequent. For this reason, the BNS feature is introduced in the experimentation, since it es- timates the representation of terms at class level, avoiding this type of error. This weight is defined as BN S = |Icdf ( P (W | class +) ) − Icdf ( P (W | class −) )|, where Icdf is the inverse cumulative distribution function, P (W | class +) is the probability of finding a word in the positive classes and P (W | class −) is the probability of finding a word in the negative classes. Based on both measures, n-grams of two and three words have been considered. On the other hand, a feature filtering has been performed using Chi-Square (χ̃2 ). In Table 2 different configurations implemented in the experimentation are presented, which include the different options mentioned above. The structure of the MLPs shown consists of 4 hidden layers and 80 neurons each. Table 2. Configurations for supervised learning models. The Bias parameter is the minimum frequency in the collection for considering an ICD code in the training pro- cess. System Models Bias Bag Of Words Measure Filtering S1 SVMs 10 Unigrams Tf-Idf 50 S2 SVMs 10 Unigrams Tf-Bns 50 S3 SVMs 10 Unigrams Tf-Bns 5000 S4 SVMs 10 Unigrams Tf-Bns 10 S5 MLPs 40 Unigrams Tf-Bns 1000 S6 MLPs 100 Unigrams Tf-Bns 1000 S7 MLPs 200 Unigrams Tf-Bns 1000 S8 MLPs 100 Bigrams Tf-Bns 1000 S9 MLPs 100 Trigrams Tf-Bns 1000 S10 MLPs 100 Trigrams Tf-Idf 1000 Unsupervised IR approach This approach uses Lucene as the search engine. To enrich the indexes, the diseases present in the CépiDC dictionaries have been added as well as the content of the death certificates in the training corpus for each language. The aim is to make each of the possible ICD-10 codes accessible through a set of descriptions and associated terms. There are fewer descriptions for less common codes, so it has been decided to remove duplicate descriptions during indexing to avoid penalties. In addition, it has been considered to in- clude the official description of codes as it appears in the ICD standard, taking advantage of the electronic versions provided by some governments. Each query has been generated from the terms contained in each line of the death certificates. As it is a multilabel problem, the number of classes assigned to a document line varies. Since Lucene’s output consists of a ranking of results, the evaluation has been carried out according to the number of results chosen for each query (1 or 2). The different configurations are presented in Table 3. Table 3. Configurations of the IR techniques based on Lucene. System Method Number of results Using Official Descriptions S11 IR 1 Yes S12 IR 2 Yes S13 IR 1 No S14 IR 2 No Combined approach The method based on the combination of supervised models together with search engine aims to take advantage of the effectiveness of the first ones with an increase in robustness for codes that are absent or hardly present in the training corpus. Since less common diseases —no learning model— should be left without ICD-10 code assigned after applying multiple trained models, it seems reasonable to use search engines only with those death certificate statements not classified by the supervised approach. Table 4 shows the combinations of learning model and IR system configurations that give the best results. Table 4. Configurations for the combination of methods. System Combination S15 S5 + S13 S16 S6 + S13 S17 S7 + S13 S18 S8 + S13 S19 S10 + S13 4.3 Results The results of the configurations are only shown in the Italian corpus, as this rep- resents to a large extent the type of disorders present in the other corpora. This choice is based on the lower number of certificates and the higher percentage of common diseases in other corpora. Different configurations have been evaluated using a k-fold cross-validation of 5 folds and a 94/6 split. The results are shown in Table 5. Table 5. Results of the different configurations on the Italian training dataset. Method System Precision Recall F1 score S1 0.91 0.59 0.71 S2 0.90 0.78 0.83 S3 0.87 0.48 0.63 S4 0.89 0.79 0.84 S5 0.94 0.88 0.90 Supervised approach S6 0.95 0.84 0.89 S7 0.95 0.78 0.86 S8 0.95 0.79 0.86 S9 0.94 0.74 0.83 S10 0.97 0.83 0.89 S11 0.71 0.58 0.64 S12 0.43 0.71 0.54 IR approach S13 0.75 0.61 0.67 S14 0.45 0.74 0.56 S15 0.92 0.87 0.89 S16 0.93 0.89 0.91 Mixed approach S17 0.92 0.89 0.90 S18 0.92 0.82 0.87 S19 0.94 0.87 0.90 The use of official descriptions in the IR system worsens both Precision and Recall, which could be an indication of how different the diagnoses in practice and descriptions in the standard are. Thus, although the use of official descriptions does not seem advisable in itself due to noise, it would be interesting to use synonyms to enrich them. The configuration with the highest F1 score is the combination of MLPs models to assign ICD-10 codes with frequencies greater than 100 occurrences on the training corpus, and search engines selecting only the result with the highest affinity. The models chosen have been trained with the Tf-BNS of the 1,000 most relevant features. Finally, the proposals S5 and S16 —called LSI UNED-run2 and LSI UNED- run1 respectively—- have been used on the official test dataset provided by each language. S6 has been chosen as the system offering the best results in the supervised approach. In [6] you can see the ranking of the task published. Our results are shown in Table 6. In principle, it appears from this data that combining IR techniques with supervised approaches decreases the Precision value to the same extent as it increases the Recall value, so the F1 score does not change. Nevertheless, in our opinion the combination of both approaches results in a more robust system compared to other possible distributions with a greater number of infrequent codes, so it would be preferable to a single approach based on Machine Learning. Table 6. Results of the S5 and S16 configurations on the test datasets. Dataset System Precision Recall F1 score S5 0.93 0.86 0.89 Italian S16 0.91 0.87 0.89 S5 0.94 0.91 0.92 Hungarian S16 0.93 0.92 0.92 S5 0.88 0.54 0.66 French S16 0.84 0.55 0.67 5 Conclusions and Future Work Different methods have been proposed for the automatic coding of diseases ac- cording to the ICD-10 standard. Although a supervised learning approach seems an appropriate solution at first glance, we understand that in a distribution as complex as the one presented by the data, it is necessary to extend these models with other techniques that offer greater coverage, such as the IR approach. The development of automatic systems for coding death certificates can pro- vide a major boost to health administrations in managing their resources. And to this end, the results published in the CLEF task seem promising. One of the main problems of natural language processing in health scope is multilingualism, since it is a very broad and specialized domain, and at the same time it requires a large amount of textual resources that do not yet exist for certain languages. Therefore, in the near future we hope to limit the dependence on these textual resources by improving IR techniques and advance in different ways of combining the methods described. Acknowledgements This work has been supported by the Spanish Ministry of Science and Innovation MAMTRA-MED Project (TIN2016-77820-C3-2-R). References 1. Almagro, M., Martı́nez, R., Fresno, V., Montalvo, S. (2018). Estudio preliminar de la anotación automática de códigos CIE-10 en informes de alta hospitalarios. Procesamiento del Lenguaje Natural, Revista no 60, pp. 45-52. DOI 10.26342/2018- 60-5. 2. Chen, Y., Lu, H., Li, L. (2017). Automatic ICD-10 coding algorithm using an im- proved longest common subsequence based on semantic similarity. In PloS one, vol. 12(3). 3. Ho-Dac, L. M., Fabre, C., Birski, A., Boudraa, I., Bourriot, A., Cassier, M., Del- venne, L., Garcia-Gonzalez, C., Kang, E., Piccinini, E., Rohrbacher, C., Séguier, A. (2017). LITL at CLEF eHealth2017: automatic classification of death reports. CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS. 4. Jatunarapit, P., Piromsopa, K., Charoeanlap, C. (2016). Development of thai text- mining model for classifying ICD-10 TM. In Proceedings of ECAI 2016, pages 1–6. 5. Miftakhutdinov, Z., Tutubalina, E. (2017). Kfu at clef ehealth 2017 task 1: Icd- 10 coding of english death certificates with recurrent neural networks. CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS. 6. Névéol, A., Robert, A., Grippo, F., Lavergne, T., Morgand, C., Orsi, C., Pelikán, L., Ramadier, L., Rey, G., Zweigenbaum, P. (2018). CLEF eHealth 2018 Multilin- gual Information Extraction task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS. 7. Ning, W., Yu, M., Zhang, R. (2016). A hierarchical method to automatically en- code Chinese diagnoses through semantic similarity estimation. In BMC Medical Informatics and Decision Making, vol. 1:16–30. 8. Ševa, J., Kittner, M., Roller, R., Leser, U. (2017). Multi-lingual ICD-10 coding using a hybrid rule-based and supervised classification approach at CLEF eHealth 2017. CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS. 9. Suominen, Hanna, Kelly, Liadh, Goeuriot, Lorraine, Kanoulas, Evangelos, Az- zopardi, Leif, Spijker, Rene, Li, Dan, Névéol, Aurélie, Ramadier, Lionel, Robert, Aude, Palotti, Joao, Jimmy, Zuccon, Guido. (2018). Overview of the CLEF eHealth Evaluation Lab 2018. CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS), Springer. 10. van Mulligen, E. M., Afzal, Z., Akhondi, S. A., Vo, D., Kors, J. A. (2017). Erasmus MC at CLEF eHealth 2016: Concept Recognition and Coding in French Texts. CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS. 11. Zweigenbaum, P., Lavergne, T. (2017). Multiple methods for multi-class, multi- label ICD-10 coding of multi-granularity, multilingual death certificates. CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.