-

MAMTRA-MED at CLEF eHealth 2018: A Combination of Information Retrieval Techniques and Neural Networks for ICD-10 Coding of Death Certi cates

Mario Almagro

malmagro@lsi.uned.es 0

Soto Montalvo

soto.montalvo@urjc.es 1

A. D az de Ilarraza

a.diazdeilarraza@ehu.eus 3

A. Perez

alicia.perez@ehu.eus 2 0 Universidad Nacional de Educacion a Distancia (UNED) , Madrid 28040 , Spain 1 Universidad Rey Juan Carlos (URJC) , Madrid 28933 , Spain 2 University of the Basque Country UPV/EHU (IXA NLP) , Bilbao 48013 , Spain 3 University of the Basque Country UPV/EHU (IXA NLP) , Donostia-San Sebastian 20018 , Spain

This paper describes the systems proposed by LSI UNED team in Task 1 of the CLEF eHealth 2018 challenge. The main objective is the automatic coding of death certi cates in French, Italian and Hungarian languages according to the ICD-10. This task has been tackled through supervised learning methods such as neural networks, and techniques based on Information Retrieval (IR) systems. The rst approach has been implemented by training one model for each of the most frequent ICD-10 codes in the corpus. For this purpose, a bag-of-words approach has been applied using the TF-BNS value for terms contained in death certi cate statements. As for the IR approach, Lucene has been used as a search engine, indexing dictionaries and the content of the death certi cates in the training corpus. Finally, a combination of both methods has been proposed to balance precision and recall, using the IR system for diseases not classi ed by any learning model. Similar F1 scores are obtained on the test datasets of each language by supervised methods and the combined system giving the latter greater recall values.

ICD-10 Coding ICD-10 Codes Neural Networks Deep Learning CepiDC CLEF eHealth

The amount of health text available in electronic format is immense |scienti c papers, websites, forums, social networks or Electronic Health Records (EHRs)| so managing health information to support medical decisions is no easy task. An example of this complexity is the analysis of numerous clinical texts generated by health care centres, which requires a large amount of resources that are often unavailable. The 2018 CLEF eHealth Evaluation Lab [ 9 ] is intended to address these challenges through di erent tasks aimed at facilitating access to health information.

Our proposals have focused on the resolution of the CLEF task [ 6 ] for the ICD-10 coding of death certi cates in French, Italian and Hungarian. The classication of clinical texts according to the International Classi cation of Diseases (ICD) is one of the most pressing problems in hospital management due to its statistical purposes for morbidity and mortality. The 10th version of this coding assigns a unique identi er of between 3 and 7 alphanumeric symbols to disorders, grouping together nearly 16,000 possible diagnoses with a wealth of nuances.

As a possible resolution to the described task, di erent approaches supported by supervised learning and search engines have been proposed in this paper. Due to the nature of the ICD-10 codes, the data generally present a very biased distribution, with a small set of frequently occurring diagnoses [ 1 ]. The same distribution can be seen in the corpus used in this task, and therefore, it is considered the combination of approaches that could try to maximize precision |such as machine learning approximations| and methods that tend to maximize recall |such as the search engines on which Information Retrieval (IR) systems are built| could be of great interest. These two aspects will give rise to joint proposals. 2

Related Work

In general, the approaches used in the state of the art for the recommendation and assignment of ICD-10 codes can be divided into two groups: those based on medical language processing (MLP) and those based on classi cation techniques.

The rst ones use unsupervised techniques to nd correspondences between the concepts in standard descriptions and health concepts identi ed through medical knowledge bases and ontologies in health documents. For example, Ning et al. [ 7 ] apply an example-based model generated from a Chinese terminology containing correspondences with 4-digit ICD-10 codes, thus taking advantage of the hierarchical structure in the standard coding. Chen et al. [ 2 ] explore semantic similarity by applying the Longest Common Subsequence (LCS) method to the diagnoses and names given by ICD-10 codes. Other systems following this trend have participated in previous versions of the CLEF task [ 3, 10 ].

On the other hand, the second approaches generate classi ers by using supervised learning algorithms. Zweigenbaum and Lavergne [ 11 ] apply two classi ers: one trained with a set of EHRs, and the other trained with di erent medical dictionaries; Miftakhutdinov and Tutubalina [ 5 ] use word embeddings trained from a corpus of medical user opinions, along with recurrent neural networks to assign codes.

At the same time, mixed approaches combining both methods can be found. For example, Seva et al. [ 8 ] use an IR approach to search for possible candidate ICD-10 codes in di erent dictionaries, along with several classi ers to lter them. Jatunarapit et al. [ 4 ] employ English corpus-based classi ers and a set of IR techniques to establish similarities with Thai terms. 3

Proposed Approach

In this paper we have explored two alternative approaches for the assignment of ICD-10 codes to death certi cates. On the one hand, a supervised approach is proposed through Vector Support Machines (SVMs) and neural networks that aims to take advantage of the training corpus by generating One-Vs-Rest (OVR) models for the most frequent ICD-10 codes. The dependence on examples makes this approach less robust in the face of the possibility of coding new diseases, given the immense number of ICD-10 codes with little or no representation in the corpus. For this reason, it is also proposed to complement learning models with an unsupervised approach based on IR techniques to achieve greater recall.

The machine learning approach is based on the training of a binary model for each of the target ICD-10 codes, indicating the presence or absence of the code. As this is a multiclass and multilabel problem, the coding of diseases is carried out by grouping the results of all the binary classi ers. The particularization of a model for each ICD code allows the processing of the data adapted to each class, as will be seen later when applying the weighting with Bi-Normal Separation (BNS). To implement these models, di erent con gurations have been developed with linear SVMs and Multi-Layer Perceptrons (MLPs).

The proposed IR approach consists of a search engine in which information relating to codes has been indexed, both terminology from provided dictionaries and associated sentences from training data. In this way, the coding of death certi cates is reduced to the generation of queries based on their terms, choosing the result with the highest score. As a drawback, retrieving a xed number of results (in this case only one) implies the loss of the ability to adapt the number of codes assigned to a line in death certi cates, which may contain several disorders.

These two approaches yield di erent results. As expected in the experimentation, while the supervised approach achieves higher precision values, a hybrid method involving IR techniques ensures a better balance between the correct coding rate and the number of di erent codes capable of coding. 4 4.1

Experiments Datasets

The training data is organized in three separate corpus, one for each language: French, Italian and Hungarian. Although each corpus is structured in two modes |aligned, for line-level annotation, and raw, for document-level annotation|, line-level annotation is only available for the French corpus. For this reason, this proposal has focused only on coding at document level for all three languages.

Each corpus has di erent metadata on diagnoses, dictionaries with equivalences between ICD-10 codes and terms, the text of death certi cates and linelevel equivalences between its processed content and ICD-10 codes.

Italian Hungarian French Total Number of certi cates in training 14,502 84,702 65,843 165,047 Number of certi cates in test 3,617 21,175 24,375 49,167 Number of ICD-10 codes 60,954 392,019 527,940 980,913 Number of unique ICD-10 codes 1,442 3,123 3,829 5,011 Overlapping with Italian codes 100% 34% 30% Overlapping with Hungarian codes 73% 100% 57%

Overlapping with French codes 79% 70% 100%

A general summary of the amount of data grouped by corpus is given in Table 1. Although the Hungarian corpus contains more death certi cates, the number of ICD-10 codes present in the French corpus far exceeds that of the rest. As can be seen, most of the ICD-10 codes in the Italian corpus are also present in some of the other corpora, with an average overlap of 76%. Given this overlapping, a large part of the results achieved on the Italian corpus could be considered extrapolable since the model is expected to behave similarly in at least the same codes. For this reason, the experimentation shown in this paper is only carried out on it, taking advantage of its lower volume.

In terms of distribution, the frequency of codes follows a power law, with most of the entries corresponding to a small group of codes. This implies that a supervised approach alone has a more restrictive limit to improvement than other techniques. 4.2

Experimental Setting

Regardless of the approach used, a common pre-processing has been developed. A lowercase conversion and accent removal has been applied, as well as a stop word lter and a stemming process for each language.

Supervised approach Here the problem has been addressed through the combination of di erent binary classi ers, each one determining the presence or absence of a speci c code. Due to the scarcity of data in training collections on some ICD-10 disorders, model generation has been limited to only those ICD codes that appear more than a certain number of times. With this it is understood that the rest of the ICD codes (those absent in the corpora or with little presence) cannot be represented by supervised models since data lack su cient examples with which to abstract the corresponding patterns.

In order to nd the con guration that best suits this task, multilayer perceptrons with di erent numbers of neurons and hidden layers have been implemented, as well as variations of the rest of the hyperparameters. In addition, linear SVMs have been trained to compare the e ciency of both models in ICD10 coding.

As for the input data, once the pre-processing has been applied, di erent textual representations have been used. On the one hand, the models have been generated with the frequency of terms weighted with Inverse Document Frequency (IDF) and Bi-Normal Separation (BNS) values. IDF is calculated at document level, determining the relevant terms based on the number of documents in which they appear. This may penalize those terms relevant to a class but too frequent. For this reason, the BNS feature is introduced in the experimentation, since it estimates the representation of terms at class level, avoiding this type of error. This weight is de ned as BN S = jIcdf ( P (W j class +) ) Icdf ( P (W j class ) )j, where Icdf is the inverse cumulative distribution function, P (W j class +) is the probability of nding a word in the positive classes and P (W j class ) is the probability of nding a word in the negative classes. Based on both measures, n-grams of two and three words have been considered. On the other hand, a feature ltering has been performed using Chi-Square ( ~2). In Table 2 di erent con gurations implemented in the experimentation are presented, which include the di erent options mentioned above. The structure of the MLPs shown consists of 4 hidden layers and 80 neurons each. Unsupervised IR approach This approach uses Lucene as the search engine. To enrich the indexes, the diseases present in the CepiDC dictionaries have been added as well as the content of the death certi cates in the training corpus for each language. The aim is to make each of the possible ICD-10 codes accessible through a set of descriptions and associated terms. There are fewer descriptions for less common codes, so it has been decided to remove duplicate descriptions during indexing to avoid penalties. In addition, it has been considered to include the o cial description of codes as it appears in the ICD standard, taking advantage of the electronic versions provided by some governments.

Each query has been generated from the terms contained in each line of the death certi cates. As it is a multilabel problem, the number of classes assigned to a document line varies. Since Lucene's output consists of a ranking of results, the evaluation has been carried out according to the number of results chosen for each query (1 or 2). The di erent con gurations are presented in Table 3. Combined approach The method based on the combination of supervised models together with search engine aims to take advantage of the e ectiveness of the rst ones with an increase in robustness for codes that are absent or hardly present in the training corpus. Since less common diseases |no learning model| should be left without ICD-10 code assigned after applying multiple trained models, it seems reasonable to use search engines only with those death certi cate statements not classi ed by the supervised approach. Table 4 shows the combinations of learning model and IR system con gurations that give the best results. The results of the con gurations are only shown in the Italian corpus, as this represents to a large extent the type of disorders present in the other corpora. This choice is based on the lower number of certi cates and the higher percentage of common diseases in other corpora. Di erent con gurations have been evaluated using a k-fold cross-validation of 5 folds and a 94/6 split. The results are shown in Table 5.

The use of o cial descriptions in the IR system worsens both Precision and Recall, which could be an indication of how di erent the diagnoses in practice and descriptions in the standard are. Thus, although the use of o cial descriptions does not seem advisable in itself due to noise, it would be interesting to use synonyms to enrich them. The con guration with the highest F1 score is the combination of MLPs models to assign ICD-10 codes with frequencies greater than 100 occurrences on the training corpus, and search engines selecting only the result with the highest a nity. The models chosen have been trained with the Tf-BNS of the 1,000 most relevant features.

Finally, the proposals S5 and S16 |called LSI UNED-run2 and LSI UNEDrun1 respectively|- have been used on the o cial test dataset provided by each language. S6 has been chosen as the system o ering the best results in the supervised approach. In [ 6 ] you can see the ranking of the task published. Our results are shown in Table 6.

In principle, it appears from this data that combining IR techniques with supervised approaches decreases the Precision value to the same extent as it increases the Recall value, so the F1 score does not change. Nevertheless, in our opinion the combination of both approaches results in a more robust system compared to other possible distributions with a greater number of infrequent codes, so it would be preferable to a single approach based on Machine Learning. Di erent methods have been proposed for the automatic coding of diseases according to the ICD-10 standard. Although a supervised learning approach seems an appropriate solution at rst glance, we understand that in a distribution as complex as the one presented by the data, it is necessary to extend these models with other techniques that o er greater coverage, such as the IR approach.

The development of automatic systems for coding death certi cates can provide a major boost to health administrations in managing their resources. And to this end, the results published in the CLEF task seem promising.

One of the main problems of natural language processing in health scope is multilingualism, since it is a very broad and specialized domain, and at the same time it requires a large amount of textual resources that do not yet exist for certain languages. Therefore, in the near future we hope to limit the dependence on these textual resources by improving IR techniques and advance in di erent ways of combining the methods described.

Acknowledgements

This work has been supported by the Spanish Ministry of Science and Innovation MAMTRA-MED Project (TIN2016-77820-C3-2-R).

1. Almagro , M. , Mart

nez

, R., Fresno , V. , Montalvo , S. ( 2018 ). Estudio preliminar de la anotacion automatica de codigos CIE-10 en informes de alta hospitalarios . Procesamiento del Lenguaje Natural, Revista no 60 , pp. 45 - 52 . DOI 10.26342/2018- 60-5.

2. Chen , Y. , Lu , H. , Li , L. ( 2017 ). Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity . In PloS one , vol. 12 ( 3 ).

3. Ho-Dac , L. M. , Fabre , C. , Birski , A. , Boudraa , I. , Bourriot , A. , Cassier , M. , Delvenne , L. , Garcia-Gonzalez , C. , Kang , E. , Piccinini , E. , Rohrbacher , C. , Seguier , A. ( 2017 ). LITL at CLEF eHealth2017: automatic classi cation of death reports . CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.

4. Jatunarapit , P. , Piromsopa , K. , Charoeanlap , C. ( 2016 ). Development of thai textmining model for classifying ICD-10 TM . In Proceedings of ECAI 2016 , pages 1 { 6 .

5. Miftakhutdinov , Z. , Tutubalina , E. ( 2017 ). Kfu at clef ehealth 2017 task 1: Icd10 coding of english death certi cates with recurrent neural networks . CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.

6. Neveol , A. , Robert , A. , Grippo , F. , Lavergne , T. , Morgand , C. , Orsi , C. , Pelikan , L. , Ramadier , L. , Rey , G. , Zweigenbaum , P. ( 2018 ). CLEF eHealth 2018 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certi cates in French, Hungarian and Italian . CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.

7. Ning , W. , Yu , M. , Zhang , R. ( 2016 ). A hierarchical method to automatically encode Chinese diagnoses through semantic similarity estimation . In BMC Medical Informatics and Decision Making , vol. 1 : 16 { 30 .

8. Seva , J. , Kittner , M. , Roller , R. , Leser , U. ( 2017 ). Multi-lingual ICD-10 coding using a hybrid rule-based and supervised classi cation approach at CLEF eHealth 2017 . CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.

9. Suominen , Hanna, Kelly, Liadh, Goeuriot, Lorraine, Kanoulas, Evangelos, Azzopardi, Leif, Spijker, Rene, Li, Dan, Neveol, Aurelie, Ramadier, Lionel, Robert, Aude, Palotti, Joao, Jimmy, Zuccon, Guido. ( 2018 ). Overview of the CLEF eHealth Evaluation Lab 2018 . CLEF 2018 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer.

10. van Mulligen , E. M. , Afzal , Z. , Akhondi , S. A. , Vo , D. , Kors , J. A. ( 2017 ). Erasmus MC at CLEF eHealth 2016 : Concept Recognition and Coding in French Texts . CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.

11. Zweigenbaum , P. , Lavergne , T. ( 2017 ). Multiple methods for multi-class, multilabel ICD-10 coding of multi-granularity, multilingual death certi cates . CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS.