Extraction of Medical Concepts from Italian Natural Language Descriptions (Discussion Paper) Patrizia Agnello1 , Silvia Maria Ansaldi1 , Fabio Azzalini2,3 , Giovanni Gangemi2 , Davide Piantella2 , Emanuele Rabosio3 and Letizia Tanca2 1 INAIL - Dipartimento Innovazioni Tecnologiche 2 Politecnico di Milano - Dipartimento di Elettronica, Informazione e Bioingegneria 3 Human Technopole - Center for Analysis, Decisions and Society Abstract In this paper we present a Natural Language Processing (NLP) pipeline to automatically extract medical concepts from a free text written in a language other than English. To do so, we use common NLP techniques and the metathesaurus of Unified Medical Language System (UMLS). Specifically, our goal is to automatically extract ontological concepts representing which part of the human body is injured and what is the nature of the injury, given an Italian textual description of a work accident. We start by partitioning the text into tokens and assigning to each token its part-of-speech, and then use an appro- priate tool to extract relevant concepts to be searched within UMLS. We tested our system on a public large repository containing textual descriptions of work accidents produced by INAIL. Experimental results confirm that our system is able to correctly extract relevant medical concepts from texts written in Italian. Keywords Ontology, EHR, NLP, Work accident 1. Introduction The term Electronic Health Records (EHRs) describes the concept of a comprehensive, cross- institutional, and longitudinal collection of healthcare data, trying to group the entire clinical life of a patient [1]. EHRs store information both in structured (e.g. diagnosis codes, laboratory results, etc) and unstructured (e.g. clinical notes, discharge summaries, etc.) formats. Unstruc- tured data usually contain a more complete, and broader, view of the patient’s conditions, as well as additional valuable information that would be difficult to represent in a structured manner (e.g. social history, special conditions, etc.). To leverage all the advantages of a systematic adoption of EHRs, many technical and non-technical requirements must be fulfilled [2], such as SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " p.agnello@inail.it (P. Agnello); s.ansaldi@inail.it (S. M. Ansaldi); fabio.azzalini@polimi.it (F. Azzalini); giovanni.gangemi@mail.polimi.it (G. Gangemi); davide.piantella@polimi.it (D. Piantella); emanuele.rabosio@fht.org (E. Rabosio); letizia.tanca@polimi.it (L. Tanca)  0000-0003-0631-2120 (F. Azzalini); 0000-0003-1542-0326 (D. Piantella); 0000-0003-3722-7789 (E. Rabosio); 0000-0003-2607-3171 (L. Tanca) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) privacy, data security, portability, performance, maintainability, reliability, interoperability, and usability. In this work we focus on the natural-language texts included in EHRs, since unstructured data analytics is one of the most challenging task of EHRs automated analysis. Our main contributions are the development of a system capable of automatically extracting medical ontological concepts from an Italian natural language text, and the experimental evaluation of our system using a real-world dataset regarding work injuries. The paper is organized as follows. Section 1 gives an overview of the problem, Section 2 reviews the state of the art regarding medical concept extraction, Section 3 describes our methodology, Section 4 presents the experiments and, finally, Section 5 concludes the paper. 2. State of the Art Concept extraction from natural language texts related to clinical information consists of three phases [3]: (1.) Identification of concept mentions such as medications, drugs, anatomical parts, and diseases; (2.) Coreference resolution regarding relationships between different mentions referring to the same entity; and (3.) Extraction of relationships between concepts. We now briefly describe one of the most complete medical ontology system (UMLS) and two state-of-the-art frameworks for clinical concept extraction: cTAKES and MetaMap. 2.1. Unified Medical Language System The Unified Medical Language System (UMLS) [4] is a compendium of many controlled vocabu- laries in the biomedical sciences, produced and distributed by the National Library of Medicine (NLM). It also provides a mapping structure among these vocabularies and thus allows to trans- late among the various terminology systems. It can be therefore considered a comprehensive thesaurus and ontology of biomedical concepts. UMLS is composed by three modules: Metathesaurus, Semantic Network, and Specialist Lexicon. We now provide a brief explanation of each of these modules. 2.1.1. Metathesaurus The Metathesaurus of UMLS includes over one million biomedical concepts and five million concept names, enclosing many vocabularies such as ICD-10, SNOMED CT, MeSH, and more. The Metathesaurus is structured to facilitate the identification of synonyms between concepts, also in different languages ensured by leveraging hierarchical concept identifiers, in turn linked to: • Concepts (CUI): identifying the meaning to be expressed, it never changes over time, no matter the updates in the vocabularies. • Strings (SUI): each string representing a concept is assigned with a permanent string identifier. Any character variation (e.g. case sensitivity, punctuation, etc.) will result in a different SUI, for each language. A SUI can in principle be linked to more than one CUI. • Atoms (AUI): being building blocks of the Metathesaurus, atoms represent specific entries in the vocabularies included in UMLS. An AUI is linked to one and only one CUI. Figure 1: Hierarchical concept identifiers of UMLS, as reported in [5] • Lexical terms (LUI): a lexical term comprises different strings (i.e. SUI) that are lexical variants or minor variants. This layer is often used to reduce the computational complexity of exploration and for a more effective concept lookup. It is currently available for all the English strings, and only partially for other languages. 2.1.2. Semantic Network The Semantic Network provides a consistent categorization of all concepts represented in the Metathesaurus along with a set of useful relationships between these concepts. The network contains 133 semantic types and 54 relationships. Each concept in the Metathesaurus is assigned one or more semantic types, which are linked to one another through semantic relationships. The major semantic types are organisms, anatomical structures, biologic function, chemical, events, physical objects, etc. The possible relationships between semantic types range from simple ⟨is-a⟩ hierarchies to complex associations, such as ⟨physically related to⟩, ⟨spatially related to⟩, ⟨co-occurs with⟩. Relationships can be derived from associations already present in the vocabularies (intra- source relationships) or they can connect concepts from different vocabularies (inter-source relationships), including not only synonyms. Inter-source relationships enhance the integration of all the vocabularies present in UMLS and enable an easy exploration of the resulting ontology. A subset of the Semantic Network with ⟨is-a⟩ relationships is shown in Figure 2. 2.1.3. Specialist Lexicon Specialist Lexicon is a module of UMLS that addresses the high degree of variability in natural language words, allowing the abstraction of lexical variants. It covers general English lexicon and many biomedical terms, including syntactic, morphological, and orthographic information. Since only English words are covered by this module, we adopted a different approach for Italian Natural Language Processing, which we will describe in Section 3. Figure 2: A portion of UMLS Semantic Network 2.2. cTAKES Apache clinical Text Analysis and Knowledge Extraction System [6] (cTAKES, for short) is an open-source framework for knowledge extraction from clinical texts, exploiting NLP techniques including rule-based and machine learning approaches. It leverages a pipeline of six components. First of all the text is divided into sentences by the sentence boundary detector, a component which extends OpenNLP sentence detector [7]. Each sentence is then tokenized taking into consideration also context-specific occurrences (e.g. dates, time intervals, etc.). Each token is then normalized leveraging tools part of UMLS Specialist Lexicon (described in Section 2.1.3), in order to map each token in a normalized form with respect to many lexical properties (e.g. inflection, diacritics, symbols, stop words, etc.). Both the normalized and the original occurrences are maintained for further analysis. After a part-of-speech tagging, the named entity recognition annotator component performs a terminology-agnostic dictionary lookup on a subset of UMLS Metathesaurus (described in Section 2.1.1), searching all the noun-phrases identified and their respective unnormalized occurrences. 2.3. MetaMap MetaMap was developed by the National Library of Medicine (NLM) with the goal of mapping biomedical text to the UMLS Metathesaurus [8]. It relies on a pipeline similar to cTAKES, apart from the leveraging of relationships and hierarchical identifiers, present in UMLS Metathesaurus, to better identify synonyms and lexical variants of the tokenized texts. 2.4. Comparison between cTAKES and MetaMap A comparison between cTAKES and MetaMap has already been investigated in [9]: the results of the experiments proved that cTAKES slightly outperforms MetaMap, with the exception of texts in which abbreviations are present. It has been shown that abbreviations are quite common in natural language texts written by doctors and both tools have difficulties in correctly identify their correct meanings. With MetaMap, however, it is possible to partially solve this problem, specifying a list of strings that will be treated as special tokens. This possibility is Figure 3: The framework of our system not particularly investigated in the cited experimental comparison, and could be the subject of future studies. The main disadvantage of both cTAKES and MetaMap, with respect to our study, is that they are strongly English-centric, since they both rely on UMLS Specialist Lexicon which, as we already described in Section 2.1.3, fully covers only the English language. 3. Methodology We now present the methodology of our system. The final goal is to automatically extract ontological concepts representing which part of the human body is injured and what is the nature of the injury, given an Italian textual description of a work accident. Figure 3 displays the three phases of our workflow: Part-of-Speech (POS) Tagging, Keyphrase Extraction and Dictionary Lookup. The first phase receives as input a textual description representing the dynamic of an accident and gives as output a preprocessed and enriched representation of the input text. Specifically, Tint takes as input a raw text in Italian and performs a series of natural language processing operations. Tint (The Italian NLP Tool) [10] is an open-source Java-based pipeline for Natural Language Processing (NLP) in Italian. It is very fast and accurate, and implements most of the common linguistic tools, such as part-of-speech tagging and dependency parsing. This first phase is necessary since the next stage needs the text divided into tokens, lemmas and parts of speech in order to continue the execution. Example 1: Given the following accident description: “Erano in corso attività di produzione di acciaio. Mentre un agganciamento del nastro trasportatore vibrovaglio alla motopala, a causa di una manovra pericolosa rimaneva con le braccia in contrasto tra le due macchine decedendo per contusione al fegato”, the pipeline produces the tagged text visible in Table 1. The second phase receives as input the tagged text with lemmas and POS, and returns as output a new series of keyphrases ordered by importance and frequency. This step uses a tool called Keyphrase Digger (KD) [11] which analyzes the text file with tokens, lemmas and pos and returns as a result a new text file with a series of keyphrases ordered by importance and Table 1 A possible output of phase 1 Token POS Lemma Erano Verb essere in Preposition in corso Noun corso attività Noun attività di Preposition di produzione Noun produzione di Preposition di acciaio Noun acciaio frequency. Keyphrases are n-grams of different length, both single and multi-token expressions, which capture the main concepts of a given document [12]. Keyprhase extraction is essential to understand the topic covered in long text and has many applications, especially when integrated into pipelines, like this one, that perform more complex tasks. Example 2: After the second phase the pipeline extracted the following concepts: “produzione di acciaio”, “nastro trasportatore vibro”, “contusione al fegato”, “vaglio alla motopala” and “manovra pericolosa”. As can be seen from Example 2, not all the concepts extracted from the accident description contain information regarding the part of the body and the nature of the incident. The third and last phase, exploiting the Rest API provided by the UMLS, allows the system to query various databases and to discard all the concepts that do not relate to the medical field. This phase can be divided into two distinct sub-phases: • Concept lookup: we create a query that queries UMLS to get the medical concept. We use as input every possible combination of the keyphrases obtained in the previous step. • Semantic type lookup: after obtaining the medical concept, we check if it belongs to one of these four semantic types, which represent nature and location of an injury. Example 3: After the third phase the only keyphrase not discarded is “contusione al fegato”. 4. Experiments We tested our system on the accident descriptions contained in the InforMo dataset [13] made available by INAIL, a repository containing the results of a survey on mostly fatal accidents occurring during work time. This dataset contains 636 entries, each with detailed information on the incident. Concepts are extracted from the description written in natural language in the questionnaire (called dynamic) by those who compiled it. In the original questionnaire there is not always consistency between what is written in the dynamic section and the nature and location of the lesion’s attributes. As an example, we may find the concept “skull injury" in the dynamic of the accident, and “contusion" manually written as the nature of the injury. These two concepts might be considered as synonyms for someone who is filling out the questionnaire, but in an ontology they are two different concepts. To solve this problem, we decided to manually create a Golden Truth, analyzing all the dynamics to understand what could be extracted from them. For each text, our system must extract two concepts that will form a pair constituting the nature and location of an injury. What is extracted is compared with the golden truth to assess the accuracy of the framework. What we want to achieve is an exact extraction of the nature-location pair directly from the textual description of the accident. After analyzing each textual description, we have evidence to say that most of the times when a nature-location pair is present in the text, it is in the same period, so we analyze each period. If a couple is present in a period of the text we keep it, otherwise we delete the couple. This allowed to greatly reduce the extracted concepts while keeping the performance unchanged. To evaluate the performance of our pipeline we used recall, precision and F1-score, whose definitions are reported reported here: 𝑇𝑃 𝑇𝑃 Precision · Recall Recall := Precision := F1-Score := 2 · 𝑇𝑃 + 𝐹𝑁 𝑇𝑃 + 𝐹𝑃 Precision + Recall TP represents a correctly guessed nature-location pair, the FP instead are all those incorrect but still extracted, FN are those that are mistakenly not recognized as a match. In the UMLS query we can specify a parameter called “searchType" which can take two different values: “words" (by default) or “exact". With the first, a similarity search is carried out, resulting in the list of concepts most similar to the one given in input, ordered by decreasing similarity. With the second, on the other hand, a result is obtained only if the input word really exists in the database. We tested the system checking both of these parameters so as to understand which is the best one. In the Table 2 we list the results in the two cases. The results Table 2 Experimental results exact words Recall 0.90 0.85 Precision 0.51 0.20 F1-Score 0.65 0.32 show that the “exact" case is better than the second. A deeper analysis highlights that this it is due to the fact that in the second case many more concepts are extracted, most of which are quite different from the original one. 5. Conclusion and Future Work In this paper we presented a methodology for extracting medical concepts from accident de- scriptions written in natural language, specifically tailored for the Italian language. The system, still being in a preliminary phase, suffers from some limitations: (i) there is a strong dependence on UMLS and its provided APIs, this often makes the system pretty slow in its computation (ii) the experimental campaign carried out is pretty limited, this may cause problems in the applicability of the framework to other input sources. Acknowledgements This work has been funded by INAIL within the BRiC 2018, ID09 framework, project RECKON. References [1] B. Shickel, P. J. Tighe, A. Bihorac, P. Rashidi, Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis, IEEE journal of biomedical and health informatics 22 (2017) 1589–1604. [2] A. Hoerbst, E. Ammenwerth, Electronic health records, Methods Inf Med 49 (2010) 320–336. [3] Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, et al., Clinical information extraction applications: a literature review, Journal of biomedical informatics 77 (2018) 34–49. [4] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical terminology, Nucleic Acids Research 32 (2004) D267–D270. doi:10.1093/nar/gkh061. [5] National Library of Medicine, UMLS reference manual, https://www.ncbi.nlm.nih.gov/ books/NBK9676/, 2021. Online; accessed 24-April-2021. [6] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, C. G. Chute, Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications, Journal of the American Medical Informatics Association 17 (2010) 507–513. [7] Apache Software Foundation, openNLP website, https://opennlp.apache.org/, 2021. Online; accessed 28-April-2021. [8] A. R. Aronson, F.-M. Lang, An overview of metamap: historical perspective and recent advances, Journal of the American Medical Informatics Association 17 (2010) 229–236. [9] R. Reátegui, S. Ratté, Comparison of metamap and ctakes for entity extraction in clinical notes, BMC medical informatics and decision making 18 (2018) 13–19. [10] A. Palmero Aprosio, G. Moretti, Tint 2.0: an all-inclusive suite for nlp in italian, Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 10 (2018) 12. [11] G. Moretti, R. Sprugnoli, S. Tonelli, Digging in the dirt: Extracting keyphrases from texts with kd, CLiC it 198 (2015). [12] P. D. Turney, Learning algorithms for keyphrase extraction, Information retrieval 2 (2000) 303–336. [13] INAIL, Informo dataset, https://www.inail.it/sol-informo/, 2021. Online; accessed 28-April- 2021.