Terminology Extraction in Electronic Health Records. The ExaMode Project⋆ Stefano Marchesin1 , Giorgio Maria Di Nunzio1 and Gianmaria Silvello1 1 Department of Information Engineering, University of Padua, Italy Abstract Medical free-text records store a lot of useful information that can be exploited in developing computer- supported medicine. Nevertheless, extracting terminological knowledge from unstructured text is difficult because the volume of medical texts created every year keeps growing at a very fast pace and it is highly dependent on the language under examination. In this work, we present an initial study of a Natural Language Processing pipeline in order to extract terminological information from pathology reports and link this information to medical ontologies. Keywords Terminology Extraction, Electronic Health Records, Named Entity Recognition, Entity Linking 1. Introduction Modern medical specialties rely on clinical context for an accurate interpretation of medical data. The literature of medical imaging analysis shows an important trend where both the Electronic Health Records (EHR) and medical images are leveraged in an approach called ‘fusion paradigm’ for solving complex tasks that cannot be tackled by a single modality [1]. In fact, medical free-text records store a lot of useful information that can be exploited in developing computer-supported medicine [2]. These medical free-text records can also be produced by patients in the so-called patient-reported diagnosis [3]. This type of documents can reveal if a patient left a medical encounter knowing the diagnosis explained to them and can ultimately inform on whether there are language differences between training health care professionals and patients without medical training. However, extracting terminological knowledge from unstructured text is difficult for at least two reasons: firstly, the volume of medical texts created every year keeps growing at a very fast pace. The time required by clinicians to retrieve relevant information from such an amount of literature using standard systems is often prohibitive. Therefore, there has been a strong interest in Clinical Decision Support (CDS) systems [4, 5] designed to produce effective and timely 1st International Conference on “Multilingual Digital Terminology Today. Design, representation formats, and manage- ment systems”, June 16–17, 2022, Padua, Italy " stefano.marchesin@unipd.it (S. Marchesin); giorgiomaria.dinunzio@unipd.it (G. Di Nunzio); gianmaria.silvello@unipd.it (G. Silvello) ~ https://www.dei.unipd.it/~marches1 (S. Marchesin); https://www.dei.unipd.it/~dinunzio (G. Di Nunzio); https://www.dei.unipd.it/~silvello (G. Silvello)  0000-0003-0362-5893 (S. Marchesin); 0000-0001-9709-6392 (G. Di Nunzio); 0000-0003-4970-4554 (G. Silvello) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) information that can help clinicians in the decision making process for patient care.1 Secondly, the extraction of relevant terminological knowledge highly depends on the language (and not only on the specialized language). For example, in [6], the authors explore terminology related to the semantic field of terms that indicate or suggest the presence of implants in electronic medical records EHRs written in Swedish with techniques that are highly optimized (in a justified way) for that language, making that software only partially reusable for other languages. In this work, we focus on digital pathology, a specialized field that studies digital histopathol- ogy images to diagnose cancer cases and related diseases. In particular, we propose a Natural Language Processing (NLP) pipeline in order to extract terminological information from pathol- ogy reports and link this information to medical ontologies. We evaluate the effectiveness of this approach on entity linking and text classification tasks, considering different use-cases concerning different types of cancer. Moreover, an unsupervised multilingual hybrid knowledge extraction system that combines rule-based techniques with pre-trained deep neural models to extract knowledge from pathology reports will be assessed against the manual extraction of terms from the same medical reports. 2. Proposal This work has been carried out under the umbrella of the ExaMode project,2 an interdisciplinary Horizon 2020 project the goal of which is to design efficient methodologies to manage the vast amount of heterogeneous medical data produced every day in different forms, and to ensure easier, faster discovery and consultation with little human supervision. The ultimate goal of the ExaMode project is to have these resources adopted not only by specialists, but also by non-experts, so that the latter can achieve a better understanding of medical information. The dataset for this study is a selection of real medical reports produced in the Hospital of Catania, one of the project partners. These reports are divided into four pathologies, namely three types of cancer (cervix, colon and lung) and celiac disease for a total of 200 medical reports. The proposed method is based on a NLP pipeline that adopts a combination of pre-trained Named Entity Recognition (NER) models [7] and unsupervised Entity Linking (EL) methods [8] to extract key terms from the medical reports and to link them to the reference ontology. NER is the task of identifying and categorizing key information, or “entities”, within a text. An entity can be any multi-word term that consistently refers to the same concept. Each entity is therefore classified, i.e. “linked”, into a predefined category, such as disease or protein.3 Entity Linking is therefore the task of assigning unique meanings to entities mentioned in a text. Our approach proposes a combination of ad-hoc and similarity matching techniques to connect the extracted entities to unique concepts. Finally, the approach uses a set of rules to merge entities into multi-word terms. For example, the terms “colon” and “transverse” may be considered as separate entities in a text, while the “transverse colon” is the correct entity to be linked to the ontology. 1 https://www.trec-cds.org 2 https://www.examode.eu 3 We use italics to indicate categories. Acknowledgment This work was partially supported by the ExaMode Project, as a part of the European Union Horizon 2020 Program under Grant 825292. References [1] S.-C. Huang, A. Pareek, S. Seyyedi, I. Banerjee, M. P. Lungren, Fusion of medical imag- ing and electronic health records using deep learning: a systematic review and imple- mentation guidelines, npj Digital Medicine 3 (2020) 136. URL: https://doi.org/10.1038/ s41746-020-00341-z. doi:10.1038/s41746-020-00341-z. [2] A. G. Dobrakowski, A. Mykowiecka, M. Marciniak, W. Jaworski, P. Biecek, Interpretable segmentation of medical free-text records based on word embeddings, Journal of Intelligent Information Systems 57 (2021) 447–465. URL: https://doi.org/10.1007/s10844-021-00659-4. doi:10.1007/s10844-021-00659-4. [3] K. Gleason, M. R. Dahm, How patients describe their diagnosis compared to clinical documentation, Diagnosis 9 (2022) 250–254. URL: https://doi.org/10.1515/dx-2021-0070. doi:doi:10.1515/dx-2021-0070. [4] E. S. Berner, Clinical decision support systems : theory and practice, 3rd ed., Springer, 2016. [5] M. Agosti, G. Di Nunzio, S. Marchesin, G. Silvello, Medical retrieval using structured information extracted from knowledge bases, in: SEBD, volume 2400 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. [6] O. Jerdhaf, M. Santini, P. Lundberg, A. Karlsson, A. Jönsson, Focused terminology extraction for cpss the case of "implant terms" in electronic medical records, in: 2021 IEEE International Conference on Communications Workshops (ICC Workshops), 2021, pp. 1–6. doi:10.1109/ ICCWorkshops50388.2021.9473700. [7] A. Goyal, V. Gupta, M. Kumar, Recent named entity recognition and classification techniques: A systematic review, Computer Science Review 29 (2018) 21–43. URL: https://www.sciencedirect.com/science/article/pii/S1574013717302782. doi:https://doi. org/10.1016/j.cosrev.2018.06.001. [8] X. Liao, Z. Zhao, Unsupervised approaches for textual semantic annotation, a survey, ACM Comput. Surv. 52 (2019). URL: https://doi.org/10.1145/3324473. doi:10.1145/3324473.