=Paper=
{{Paper
|id=Vol-2125/paper_149
|storemode=property
|title=Instance-based Learning for ICD10 Categorization
|pdfUrl=https://ceur-ws.org/Vol-2125/paper_149.pdf
|volume=Vol-2125
|authors=Julien Gobeill,Patrick Ruch
|dblpUrl=https://dblp.org/rec/conf/clef/GobeillR18
}}
==Instance-based Learning for ICD10 Categorization==
Instance-based learning for ICD10 categorization Julien Gobeill1-2 and Patrick Ruch1-2 1 HES-SO / HEG Geneva, Information Sciences, Geneva, Switzerland 2 SIB Text Mining, Swiss Institute of Bioinformatics, Geneva, Switzerland julien.gobeill@hesge.ch Abstract. In the framework of the CLEF 2018 eHealth campaign, we investi- gated an instance-based approach for extracting ICD10 codes from death certifi- cates. The 360,000 annotated sentences contained in the training data were in- dexed with a standard search engine. Then, the k-Nearest Neighbors (k-NN) gen- erated out of an input sentence were exploited in order to infer potential codes, thanks to majority voting. Compared to a standard dictionary-based approach, this simple and robust k-Nearest Neighbors algorithms achieved remarkable good performances (F-Measure 0.79, +13% compared to our dictionary-based ap- proach, +70% compared to the official baseline). This purely statistical approach uses no linguistic knowledge, and could a priori be applied to any language with similar performance levels. The combination of the k-NN with a dictionary-based approach is also a simple way to improve the categorization effectiveness of the system. The reported results are consistent with inter-rater agreements (79-80%) for diagnosis encoding as achieved by trained professional staff. Any significant improvement should therefore be questioned. Keywords: Information Extraction, Instance-based learning, International Clas- sification of Diseases. 1 Introduction The SIB Text Mining group [1], at the Swiss Institute of Bioinformatics in Geneva, has a long history of participation in TREC and CLEF campaigns, including TREC Ge- nomics [2], TREC Medical Records [3], TREC Clinical Decision Support [4], or im- ageCLEF [5] and CLEF eHealth [6] tracks. In parallel, the group is currently involved in several translational medicine research projects, including the MyHealthMyData project (EU H2020 Programme), and SVIP-O (Swiss Variant Interpretation Platform for Oncology, funded by the Swiss Personalized Health Network Initiative or SPHN), and SPOP (Swiss Personalized Oncology and Pathology project, also funded by SPHN), three projects, which aims at helping clinicians to retrieve similar cases within clinical health records, including narratives, and genome-associated data modalities (e.g. gene variants). The group also leaded several local projects at the University and Hospitals of Geneva [7]. One of these projects, in 2016, dealt with the automatic categorization of clinical records into descriptors from the International Classification of Diseases (ICD-10). In Swiss hospitals, ICD-10 codes are a posteriori assigned by trained curators to every episode of care, for medico economic purposes. In this local project, the available da- taset contained 5 years of clinical records (between 40,000 and 50,000 per year), along with their assigned ICD-10 codes. The goal was to learn from the training data how to automatically reproduce the human ICD-10 encoding. We investigated an approach based on instance-based learning: the k-Nearest Neighbors algorithm (kNN). In such approaches, training data are used as a Knowledge Base (KB). For any unseen record, the most similar records contained in the KB are retrieved, then their encoding is used in order to infer potential encoding to the input record. The system obtained perfor- mances competitive with human curators (~80%), consistent with [7]. We also used this instance-based learning approach in the past with another catego- rization task related to biological curation: in this task, the goal was to reproduce the human curation of protein functions from scientific articles, using Gene Ontology (GO) concepts [8]. Such as ICD-10 encoding, GO curation involves thousands of annotatable concepts, and large already curated instances can be exploited in a Knowledge Base. We demonstrated that instance-based learning outperformed standard dictionary-based approach, in which annotatable concepts are mapped in the input text. Moreover, the continual growth of the available training data made the effectiveness of the instance- based learning approach improving across the time: the more the KB was populated, the more accurate was the system. Such approach achieved top-performing results in the BioCreative challenge in 2016 [9]. We capitalized on this experience in order to participate in the CLEF eHealth 2018 campaign, Task 1: Multilingual Information Extraction – ICD10 coding [10,11]. We had a limited amount of time and effort to invest in this campaign, thus the simplicity and robustness of a kNN was seen as an asset. We limited our participation to the French aligned dataset. Yet, our approach is applicable to potentially all languages, without prior linguistic knowledge. 2 Methods Data. The gathered French training data contained 360,000 sentences from death cer- tificates, annotated with 500,000 ICD10 codes. As training instances were short (4.06 words on average), the initial plan to exploit our local Knowledge Base containing full clinical records was quickly discarded. The test set contained 70,633 sentences to en- code. Similar search engine. The first step in the kNN algorithm for treating an input in- stance is to retrieve the k most similar instances in the Knowledge Base (the so-called k nearest neighbors). For such a task, we deployed a standard search engine and indexed all training sentences as if they were individual documents. For Information Retrieval, we used the Terrier platform [12]. We used no stemming nor stop words, and an Okapi BM25 weighting scheme [13]. 3 Score computation. For a given input, once the k most similar instances from the KB are retrieved, our system simply exploits assigned ICD10 codes and uses majority vot- ing: codes that are assigned more than n times are finally submitted. The hypothesis is that similar instances are more likely to share similar codes with the input text. Additional dictionary-based module. In parallel, we exploited the manually curated ICD10 dictionary provided with the training data in order to map ICD10 concepts di- rectly in the input sentence. A manually list of 40 stop words (such as cancer, or mala- die) was designed in order to discard too general terms. A default score of m (between 1 and k) was assigned to mapped concepts in order to be combined with the kNN mod- ule. Fig. 1. Global architecture of the system. In the Machine learning module, k most similar rec- ords to the input are retrieved from the Knowledge Base, which contains training data ; then, IPC codes assigned to these records are aggregated and selected if they are present at least n times. In the dictionary-based module, past diagnosis texts are searched in the input and obtain a score of m when they are mapped. The list of output codes is combined from both modules. 3 Results Setting of parameters. We discarded from the training data a set of 3,600 sentences for setting the k and n parameters. Macro Precision, Recall and Fmeasure were com- puted in order to compare settings. Results are presented in Tables 1 to 3. Table 1. Macro Precision with different k and n. Maximum observed values are in bold. n=2 n=4 n=6 n=8 n=10 k=5 0.90 0.96 k = 10 0.80 0.89 0.94 0.96 k = 15 0.74 0.84 0.89 0.92 0.94 k = 20 0.68 0.79 0.85 0.89 0.91 k = 25 0.64 0.75 0.81 0.86 0.88 Table 2. Macro Recall with different k and n. Maximum observed values are in bold. n=2 n=4 n=6 n=8 n=10 k=5 0.69 0.53 k = 10 0.76 0.68 0.61 0.53 k = 15 0.78 0.72 0.67 0.62 0.58 k = 20 0.80 0.75 0.70 0.66 0.63 k = 25 0.82 0.77 0.72 0.69 0.66 Table 3. Macro F Measure with different k and n. Maximum observed values are in bold. n=2 n=4 n=6 n=8 n=10 k=5 0.78 0.68 k = 10 0.78 0.77 0.74 0.68 k = 15 0.76 0.78 0.77 0.74 0.72 k = 20 0.74 0.77 0.77 0.76 0.75 k = 25 0.72 0.76 0.77 0.76 0.76 Combination with the dictionary-based module. Taken alone, the dictionary-based approach achieves on the same tuning set performances of P 0.71, R 0.68 and FM 0.69. We combined both modules with different values of m, with k=10 and n=2 (P 0.80, R 0.76 and FM 0.78), and we finally achieved with m=2 performances of P 0.79, R 0.79 5 and FM 0.79. These final setting were used in order to compute runs with the official test set. Official results. The SIB Text Mining group submission achieves performances of P 0.76, R 0.76 and FM 0.76. Our official FM performance represent an improvement of +70% above the baseline, +20% above the participants mean, and +19% above the par- ticipants median 4 Discussion The data used for these experiments are somehow relatively cleaner than standard EHR reports as they are significantly shorter. Basically, a realistic diagnosis encoding task would involve longer documents (surgery of anatomo-pathology report, discharge let- ter, etc). In the same spirit, potentially more than one report is generated by clinicians per episode of care, which is traditionally the time unit where encoding is performed. Further, it is important to question the stability of the data, which were provided and thus the stability of the resulting models. In particular, if we look at historical data used by [7], it is estimated that temporal drifts, which are intrinsically associated with diag- nosis encoding (e.g. revision of billing/encoding guidelines, annual updates of ICD-10 by WHO and national authorities, etc), significantly reduce the validity of any genera- tive models to a few months. Fig. 2. Duration of the validity of data-driven categorization models. We see that the model gen- erated with data acquired between quarter 1 in 2005 and quarter 2 in 2006 is performing well on posterior cases from quarter 2 of 2006 until quarter 4 of the same year. Beyond that time stand the results drops significantly. Furthermore, the inter-encoder agreement achieved by trained professional in hospitals is in the range of 79-83%, see e.g. [7]. This score is the theoretical upper bound thresh- old achievable by automatic systems for such tasks. Any score higher is therefore likely to be caused by data biases or over-fitting phenomena. 5 Conclusion Our simple and robust approach, mostly based on instance-based learning, but also combined with dictionary-based mapping, achieves remarkable performances for ex- tracting ICD10 codes from death certificates sentences. Best observed F-Measure is 0.79, but different settings achieve high level of Precision (P 0.93 and R 0.53 with k=10 and n=8), or high level of Recall (R 0.82 and P 94 with k=25 and n=2). This purely statistical approach uses no linguistic knowledge, and could a priori be applied to any language with similar performances. Acknowledgements Results reported in this article have been partially supported by the HUG (University Hospitals of Geneva), which is a previous affiliation of the authors. They would not have been possible without the contribution of several HUG team members, including Drs. Robert Baud, Phedon Tahintzi, Claudine Bréant, Francois Borst, Rodolphe Meyer and Prof. Antoine Geissbühler. References 1. http://bitem.hesge.ch/ 2. Gobeill, J., Tbahriti, I., Ehrler, F., & Ruch, P. (2007). Vocabulary-Driven Passage Retrieval for Question-Answering in Genomics. In TREC. 3. Gobeill, J., Gaudinat, A., Ruch, P., Pasche, E., Teodoro, D., & Vishnyakova, D. (2011). Bitem group report for TREC medical records track 2011. In TREC. 4. Gobeill, J., Gaudinat, A., & Ruch, P. (2015). Exploiting incoming and outgoing citations for improving Information Retrieval in the TREC 2015 Clinical Decision Support Track. In TREC. 5. Gobeill, J., Ruch, P., & Zhou, X. (2008, September). Query and document expansion with medical subject headings terms at medical imageclef 2008. In Workshop of the Cross-Lan- guage Evaluation Forum for European Languages (pp. 736-743). Springer, Berlin, Heidel- berg. 6. Mottin, L., Gobeill, J., Mottaz, A., Pasche, E., Gaudinat, A., & Ruch, P. (2016). BiTeM at CLEF eHealth Evaluation Lab 2016 Task 2: Multilingual Information Extraction. In CLEF (Working Notes) (pp. 94-102). 7. Ruch, P., Gobeill, J., Tbahriti, I., & Geissbühler, A. (2008). From episodes of care to diag- nosis codes: automatic text categorization for medico-economic encoding. In AMIA Annual Symposium Proceedings (Vol. 2008, p. 636). American Medical Informatics Association. 7 8. Gobeill, J., Pasche, E., Vishnyakova, D., & Ruch, P. (2013). Managing the data deluge: data- driven GO category assignment improves while complexity of functional annotation in- creases. Database, 2013. 9. Mao, Y., Van Auken, K., Li, D., Arighi, C. N., McQuilton, P., Hayman, G. T., ... & Gobeill, J. (2014). Overview of the gene ontology task at BioCreative IV. Database, 2014, bau086. 10. Suominen, H., Kelly, L., Goeuriot, L., Kanoulas, E., Azzopardi, L., Spijker, R., Li, D., Né- véol, A., Ramadier, L., Robert, A., Palotti, J. & Zuccon, G. (2018). Overview of the CLEF eHealth Evaluation Lab 2018. CLEF 2018 - 8th Conference and Labs of the Evaluation Fo- rum, Lecture Notes in Computer Science (LNCS), Springer, September 2018. 11. Névéol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikán, L., Ramadier, L., Rey, G. & Zweigenbaum, P. (2018). CLEF eHealth 2018 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. CLEF 2018 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September, 2018. 12. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., & Lioma, C. (2006, August). Terrier: A high performance and scalable information retrieval platform. In Proceedings of the OSIR Workshop (pp. 18-25). 13. Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models for information retrieval based on divergence from randomness.