-

Fusion Methods for ICD10 Code Classi cation of Death Certi cates in Multilingual Corpora

Mike Ebersbach

Robert Herms

robert.herms@cs.tu-chemnitz.de 0

Maximilian Eibl

maximilian.eibl@cs.tu-chemnitz.de 0 0 Chair Media Informatics, Chemnitz University of Technology , 09107 Chemnitz , Germany

In this working notes paper, we present our methodology and the results for Task 1 of the CLEF eHealth Evaluation Lab 2017. This benchmark addresses information extraction in written text with focus on unexplored languages corpora, speci cally English and French. The goal is to automatically assign codes (ICD10) to text content of death certi cates. Our approach is focused on fusion methods in conjunction with support vector machines for ICD10 code classi cation. First, we composed a large scale feature set comprising more than 40k features based on bag of words, bag of 2-grams, bag of 3-grams, latent Dirichlet allocation, and the ontologies of WordNet and UMLS. In the development phase, we evaluated three di erent methods: each feature type separately (no fusion), early feature-level fusion, and late fusion including the rules majority vote, maximum, and average. For the English test set, the best F-measure was 0.8187 using early fusion. For the two French test sets, we achieved 0.6692 and 0.7216 using late fusion in connection with the rule average for bag of words and bag of 2-grams.

Natural language processing Clinical texts ICD10 coding Death certi cates Machine learning Fusion

The amount of digital medical documents expands over the years, which is a major challenge regarding data processing and management in clinical institutions. However, state-of-the-art technologies can assist work ows including verbal handover supplemented with written material. For instance, the work of [ 1 ] applied automatic speech recognition to transform verbal clinical information into written free-text records. These records can then be structured by automatically identifying relevant text-snippets (e.g., [2{4]). A further aspect in hospitals and clinical institutions involves the assignment of ICD codes to reports of diseases, disorders, injuries and other related health conditions. ICD { the International Classi cation of Diseases system { is published by the World Health Organisation (WHO). Some previous work has been done for the processing of medical text corpora in conjunction with ICD codes (e.g., [5{9]). In this context, the CLEF eHealth Evaluation Lab 2017 [ 10 ] aims to ease patients and nurses in understanding and accessing eHealth information. Task 1 (Multilingual Information Extraction - ICD10 coding) [ 11 ] of this benchmark addresses information extraction in written text with focus on unexplored languages corpora, speci cally English and French. The goal of this task is to automatically assign codes (ICD10) to text content of death certi cates. This challenge can be regarded as a classi cation task.

In this working notes paper, we present our methodology and the results for Task 1 of the CLEF eHealth Evaluation Lab 2017. Our approach is focused on the investigation of fusion methods for multilingual text classi cation regarding ICD10 codes. Hence, we implemented di erent fusion techniques to evaluate which method leads to the best result in conjunction with support vector machines. First, we composed a large scale feature set comprising more than 40k features based on the types bag of words, bag of 2-grams, bag of 3-grams, latent Dirichlet allocation [ 12 ], and the ontologies of WordNet [ 13 ] and UMLS [ 14 ]. In the development phase, we evaluated three di erent methods: each feature type separately (no fusion), early feature-level fusion, and late fusion including the rules majority vote, maximum, and average.

This paper is organized as follows: In the next section, we introduce the dataset. Our approach including feature extraction and fusion methods are proposed in Section 3. In Section 4, the experimental setup and the evaluation results are described. Finally, we conclude the paper in Section 5 and give some future directions. 2

Dataset

The used dataset is divided into two parts regarding the language: the CepiDc corpus (French) and the CDC corpus (English). The documents comprise freetext descriptions of causes of death as reported by physicians in standardized forms. Each document was manually labeled with one or more ICD10 codes. Two di erent formats are considered, the so called raw and aligned format. For English, only the raw format is included whereas the French version consists of the raw and the aligned format. Altogether, we used all three di erent data subsets for the evaluation. The data is partitioned into training sets (English raw with 1,073 classes and French aligned with 3,232 classes), development sets (English raw with 663 classes and French aligned with 2,363 classes), and test sets (English raw, French raw, and French aligned). 3

Methods

In this work, ICD10 code assignment to text content of death certi cates is regarded as a classi cation task. Machine learning is performed using support vector machine (SVM). Moreover, each language is treated separately, i.e., training, development, and testing is performed on the basis of the same language. The following subsections introduce the features and the applied fusion methods. 3.1

Feature Extraction

In the preprocessing phase, all terms were stemmed and transformed to lower case and all special characters like punctuation or brackets were removed. For the French dataset, we transformed typical su xes to their English counterpart. Furthermore, infrequent terms were removed to reduce the number of features. Subsequently, the following features were extracted: { Bag of n-grams: The tf-idf (term frequency - inverse document frequency) of the terms from all documents is calculated. Feature vectors were created for bag of words (about 9k features for the French corpus and about 2k for the English corpus), bag of 2-grams, and bag of 3-grams (both with about 14k features for the French and about 3k features for the English corpus). { Latent Dirichlet allocation (LDA) features: Similarities between the documents were determined by categorizing them to a preset number of topics. The con dence values of the topic assignments were used as features.

For our experiments we used a number of 20 topics. { WordNet features: Related terms of words in the documents were extracted to enrich the feature set with semantic information. In more detail, the rst synonym and hypernym of a word (noun, verb, adjective, and adverb) ranked by WordNet was added to the feature set. The search was repeated concerning hypernyms to nd more general hypernyms which were also added to the feature set. In summary, 2,784 features were extracted for the French dataset and 1,704 features for the English dataset. { UMLS features: Semantic types of health vocabulary were extracted from the Uni ed Medical Language System (UMLS) using MetaMap [ 15 ]. There are 133 semantic types described in the UMLS. As not all types appear in the dataset, we considered a subset of 107 types (features). A feature vector was then created where each feature represents the number of search results for a particular semantic type. 3.2

Fusion

We implemented an analysis framework to investigate two fusion methods: early fusion to combine features before classi cation and late fusion to combine the outputs after classi cation. These fusion methods are illustrated in Fig. 1.

Early fusion is performed on feature-level. In this case, the feature vectors from di erent sources are concatenated into one large feature vector which will then be used for classi cation. As this vector consists of many features, training and classi cation time will increase. However, a large scale feature vector in conjunction with suitable learning methods can lead to much better performance in the end. Furthermore, only one learning phase is needed.

Late Fusion (or decision-level fusion) indicates combining the outputs after classi cation. This process predicts the nal output by considering the individual labels (hard level) or scores (soft level) of the involved classi ers [ 16 ]. The following decision rules were used: majority vote (most represented class label), maximum (class label with the highest con dence), and average (class label with the highest averaged con dence).

Bag of Words Bag of 2-Grams Bag of 3-Grams LDA UMLS WordNet Early Fusion SVM SVM

SVM SVM SVM SVM SVM

Late Fusion

In this section, we describe the setup for the experiments. Afterwards, we report the results obtained using the six feature types and the fusion methods early feature-level fusion and late fusion. The system performance is assessed by precision, recall, and F-measure (F1) for ICD10 code assignment. For development, we used only the F1 score as a reference for the best methods.

Classi cation was performed using SVM; the LIBLINEAR library [ 17 ] is used for model training. In the development phase we optimized the complexity parameter C of the SVM classi er only for early fusion. The goal is to observe the generalization performance of the classi er. We used six di erent values of C (1, 0.1, 0.01, 0.001, 0.0001, 0.00001). The evaluation of other methods was performed using complexity C = 1.

The two best performing methods (no fusion, early fusion, or late fusion) of each dataset version (language) in the development phase are applied on the corresponding test set. A series of experiments was carried out for the automatic classi cation of ICD10 codes in medical text corpora. Table 1 summarizes the development set results for the English and French dataset. Although our criterion for the selection of the best two methods of each language is the F1 score, the results of precision and recall are shown for comparison purposes.

In the feature type experiments without fusion, the best F1 results were obtained by bag of 2-grams (Bo2G) for both languages; 0.7694 for English and 0.7667 for French. In contrast, the highest recall measure for French (0.6796) was achieved with bag of words (BoW).

For early fusion, the best F1, precision, and recall measures were obtained using SVM complexity C = 1 concerning both languages (F1 is 0.7847 for English and 0.7549 for French). With C < 1, the values are too small which results in over-generalization, i.e., under tting of the SVM model.

The late fusion scheme has been applied to all feature types. Additionally, the top three feature types were selected to investigate the results without features that have a low classi cation performance (threshold is F1 = 0.7). As a consequence, the top three features types are bag of words (BoW), bag of 2grams (Bo2G), and bag of 3-grams (Bo3G). For the English language, the best F1 score is 0.7684 using BoW+Bo2G+Bo3G in connection with majority vote. However, the best precision with 0.8807 was achieved using BoW+Bo2G+Bo3G and the rule average. In case of French, BoW+Bo2G and the rule average was superior with a F1 score of 0.7775 whereas the best precision with 0.8931 was obtained using BoW+Bo2G+Bo3G and the rule maximum.

The two best performing methods of each language in the development phase were then applied on the corresponding test sets. The results are shown in Table 2. The main evaluation reference for the task refers to all ICD10 codes. Additionally, external causes, characterized by the codes V01 to Y98, are considered as a secondary reference. In this case, the evaluation addresses a speci c type of deaths such as violent deaths which are avoidable.

Regarding the English test set, the best method was early fusion which achieved a F1 score of 0.8187 (all ICD codes) and 0.2914 (external causes). For the French test set, the highest F1 score was obtained using late fusion of BoW+Bo2G in connection with the rule average (raw format: 0.6692 for all ICD codes and 0.4232 for external cases; aligned format: 0.7216 for all ICD codes and 0.4515 for external cases). However, the best results for the French test set are non-o cial, because they were submitted after the task deadline. Consequently, as shown in Table 2, the only o cial result for the French test set is obtained using the feature type Bo2G with a F1 score of 0.7191 (all ICD codes) and 0.4450 (external causes). 5

Conclusions

We presented our methodology for Task 1 of the CLEF eHealth Evaluation Lab 2017 where the goal is to automatically assign codes (ICD10) to text content of death certi cates. The corpus is made of two versions regarding the language: English and French.

Our approach is focused on fusion methods in conjunction with support vector machines for ICD10 code classi cation. We composed a set of features based on bag of words, bag of 2-grams, bag of 3-grams, latent Dirichlet allocation, and the ontologies of WordNet and UMLS. Three di erent methods were evaluated: each feature type separately (no fusion), early feature-level fusion, and late fusion. For the English test set, the best F-measure was 0.8187 using early fusion. For the two French test sets, we achieved 0.6692 and 0.7216 using late fusion in connection with the rule average for bag of words and bag of 2-grams.

However, further improvements could be achieved by more knowledge bases and other appropriate features from the eld of Natural Language Processing. Moreover, the holistic system could bene t from other machine learning methods such as arti cial neural networks, Naive Bayes, or k-nearest neighbors. Finally, fusion schemes can be optimized by input weights and the consideration of correlations between the inputs.

1. Herms , R. , Richter , D. , Eibl , M. , Ritter , M. : Unsupervised language model adaptation using utterance-based web search for clinical speech recognition . CLEF 2015 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2015 .

2. Song , Y. , He , Y. , Liu , H. , Wang , Y. , Hu , Q. , He , L. , Luo , G.: ECNU at 2016 eHealth Task 1: Handover Information Extraction . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

3. Quiroz , L. , Mennes , L. , Dehghani , M. , Kanoulas , E.: Distributional Semantics for Medical Information Extraction . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

4. Ebersbach , M. , Herms , R. , Lohr , C. , Eibl , M. : Wrappers for Feature Subset Selection in CRF-based Clinical Information Extraction . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

5. Dermouche , M. Looten , V. , Flicoteaux , R. , Chevret , S. , Velcin , J. , Taright , N.: ECSTRA-INSERM@ CLEF eHealth2016-task 2: ICD10 code extraction from death certi cates . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

6. Mottin , L. , Gobeill , J. , Mottaz , A. , Pasche , E. , Gaudinat , A. , Ruch , P.: BiTeM at CLEF eHealth Evaluation Lab 2016 Task 2: Multilingual Information Extraction . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

7. Zweigenbaum , P. , Lavergne , T. : LIMSI ICD10 coding experiments on CepiDC death certi cate statements . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

8. Cabot , C. , Soualmia , L. , Dahamna , B. , Darmoni , S.: SIBM at CLEF eHealth Evaluation Lab 2016: Extracting Concepts in French Medical Texts with ECMT and CIMIND . CLEF 2016 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, 2016 .

9. Lohr , C. , Herms , R.: A Corpus of German Clinical Reports for ICD and OPS-based Language Modeling . Proceedings of the Controlled Language Applications Workshop (CLAW) at LREC 2016 , pages 20 { 23 , 2016 .

10. Suominen , H. , Kelly , L. , Goeuriot , L. , Neveol , A. , Robert , A. , Kanoulas , E. , Spijker , R. , Zuccon , G. , Palotti , J.: Overview of the CLEF eHealth Evaluation Lab 2017 . CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer, September, 2017 .

11. Neveol , A. , Anderson , R. N. , Cohen , K. B. , Grouin , C. , Lavergne , T. , Rey , G. , Robert , A. , Zweigenbaum , P. : CLEF eHealth 2017 Multilingual Information Extraction task overview: ICD10 coding of death certi cates in English and French . CLEF 2017 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September , 2017 .

12. Blei , D. , Ng , A. , Jordan , M. : Latent dirichlet allocation . Journal of machine Learning research , 3 (Jan): 993 { 1022 , 2003 .

13. Miller , G. : WordNet: a lexical database for English . Communications of the ACM , 38 ( 11 ): 39 { 41 , ACM , 1995 .

14. Bodenreider , O. : The uni ed medical language system (UMLS): integrating biomedical terminology . Nucleic acids research , 32 ( suppl 1 ):D267{ D270 , Oxford Univ Press, 2004 .

15. Aronson , A.: E ective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program . Proceedings of the AMIA Symposium , American Medical Informatics Association, 2001 .

16. Castellano , G. , Gunes , H. , Peters , C. , Schuller , B. : Multimodal a ect recognition for naturalistic human-computer and human-robot interactions . The Oxford handbook of a ective computing , Oxford University Press, USA, 2014 .

17. Fan , R. , Chang , K. , Hsieh , C. , Wang , X. , Lin , C. : LIBLINEAR: A library for large linear classi cation . Journal of machine learning research , 9 (Aug): 1871 { 1874 , 2008 .