=Paper=
{{Paper
|id=Vol-1749/paper4
|storemode=property
|title=Relation Mining from Clinical Records
|pdfUrl=https://ceur-ws.org/Vol-1749/paper4.pdf
|volume=Vol-1749
|authors=Anita Alicante,Anna Corazza,Francesco Isgrò,Stefano Silvestri
|dblpUrl=https://dblp.org/rec/conf/clic-it/AlicanteCIS16
}}
==Relation Mining from Clinical Records==
Relation mining from clinical records Anita Alicante, Anna Corazza, Francesco Isgrò Department of Electrical Engineering and Information Technologies (DIETI) Università di Napoli Federico II via Claudio 21, 80125 Napoli, Italy {anita.alicante|anna.corazza|francesco.isgro}@unina.it Stefano Silvestri Institute for High Performance Computing and Networking, ICAR-CNR via P. Castellino, 111, 80131 Napoli, Italy stefano.silvestri@icar.cnr.it Abstract in entities and relations connecting them (Alicante et al., 2016b). In the cited work, we extensively English. We propose a system to extract discuss a domain entity and relation recognition entities and relations from a set of clini- system for Italian. Such step is at the basis of more cal records in Italian based on two preced- sophisticated analyses, including semantics-based ing works (Alicante et al., 2016b) and (Al- indexing of documents for improved retrieval, ad- icante et al., 2016a). This approach does vanced query based information extraction, and not require annotated data and is based on the application of ontology-based strategies for existing domain lexical resources and un- privacy protection. supervised machine learning techniques. General tools, such as TextPro (Pianta et al., Italiano. Proponiamo un sistema per e- 2008), are not adapted for technical domains such strarre entità e relazioni da un insieme di as the medical one, as they are trained on generic cartelle cliniche in Italiano basato su due documents, rather than domain-specific ones. Fur- precedenti lavori (Alicante et al., 2016b) e thermore, a lot of tools are available for English (Alicante et al., 2016a). Questo approc- and only a few of them have been ported to Italian. cio non richiede dati annotati e si basa Another problem to take into account is the occur- su risorse lessicali di dominio già esistenti rence, in clinical records, of typos and nonstan- e tecniche di apprendimento automatico dard abbreviations, in addition to the most usual senza supervisione. acronyms. Last but not least, passing from text to knowledge processing raises tricky privacy prob- lems. In fact, especially but not only in small hos- 1 Introduction pitals, obscuring the patient names is not sufficient The digitization of medical documents in hospitals to hide their identity as the medical information re- has produced plenty of information which should ported in records are often sufficient to reconstruct be adequately organized. While part of the mate- a precise profiling of the patients. rial, mainly including international scientific pub- Therefore, ad hoc solutions represent the only lications, is in English, increasingly more mate- way to build effective applications to solve this rial is being created in the language of the country kind of problems. For example, not only domain of the medical institution. The main part of the entities and relations can help identifying poten- local language material is represented by patient tially dangerous information, but also ontological records. They contain important information not information can be exploited to better protect pa- only for preparing care plans or solve problems for tient privacy (Bonatti and Sauro, 2013). Again, the particular patient, but also to extract statistics ontologies construction and population are based useful for research and also for logistics adminis- on entity and relation extraction. tration. Efforts to port systems to languages different Automatic processing of such repositories still from English require, first of all, the development can not be straightforwardly applied. One of the of lexical resources for the considered language. principal issues to be solved is the automatic ex- However, they are not sufficient, because of the traction of relevant information, usually consisting intrinsic differences between languages. A widely adopted way to tackle such difficulties is repre- proach proposed in (Alicante et al., 2016a). The sented by machine learning approaches. decision about how a relation can be labeled is Although supervised approaches are usually only based on the terms involved in the corre- more effective, they require large corpora of an- sponding entity pair, without considering the con- notated data, which are quite expensive to obtain, text in which it occurs. In fact, this is complemen- as they require that domain experts invest time tary with respect to the task of deciding whether in a long and tedious annotation activity. In the two entities are related, which should be decided medical domain, staff should invest part of their on the basis of the context where the two entities precious time to annotate data with information occur, as in (Alicante et al., 2016b). On the other about the presence and the type of domain rele- hand, by considering only the two involved en- vant entities and relations in records to be used for tities, we can only decide the type of a relation. the training phase. Things would be much eas- Then, to decide whether the relation is stated or ier if domain experts are only required to check negated, also the context should be considered in an automatically produced annotation. We there- the analysis. fore propose to integrate a knowledge-based and The third module of the framework is based on a text mining approaches to develop an applica- Word Embeddings (WEs) (Mikolov et al., 2013) tion which requires the expert intervention only to to represent the words involved in each entity with check on medical and pharmaceutical labels asso- a real valued array. WEs most interesting char- ciated to groups of relations. acteristic consists in the fact that the mutual posi- More in detail, we propose here to integrate tion of words in a metric space strongly depends the systems discussed in (Alicante et al., 2016b) on their meanings, so that words having similar and in (Alicante et al., 2016a): the former adopts semantics have large similarity, when this is com- domain dependent lexical resources to extract puted, for example, by cosine similarity. Embed- entities and unsupervised machine learning ap- dings can be automatically built from a large col- proaches to decide where relations occur in the lection of unannotated text with a very efficient al- text. The latter clusters and labels the extracted gorithm. Therefore, they can be easily applied to relations with an approach based on lexical seman- any language, in our case to Italian, provided that tics. enough texts are available. We used documents The paper is organized with Section 2 detail- extracted from Wikipedia for training. In particu- ing the approach implementation and Section 3 for lar, we considered pages flagged as Medicine, Bi- conclusions and future works. ology and Pharmacy in Italian. For the extraction, we used CatScan v3.01 , Wikipedia Export tool2 2 Proposed approach and Wikiextractor3 . For each entity, we then consider the embed- The framework proposed is composed by three dings corresponding to each token. As shown modules, and its logical structure is depicted in in (Paperno and Baroni, 2016), a good represen- Figure 1. The first one is devoted to domain entity tation for a string of words is given by the sum (i.e., medical and pharmaceutical entities) iden- of the corresponding WEs. However, as we do tification and classification, and exploits domain not want that such representation depends on the related lexical resources and standard natural lan- string length, we normalize the sum by the number guage tools. The second one is based on an unsu- of words involved in the entity, obtaining the av- pervised machine learning approach, namely clus- erage or centroid of the corresponding WEs. Each tering, to avoid the necessity of annotating data, pair of entities occurring in the same sentence rep- for the relation extraction. A potential relation is resents a possible candidate for a relation. We hypothesized among all pairs of the entities iden- therefore build the feature vector for each entity tified in the preceding phase. Clustering is then pair by juxtaposing the average vectors for each applied to group similar entity pairs. Small clus- ters indicate the lack of repetitive patterns and will 1 https://tools.wmflabs.org/catscan2/ therefore be considered as entity pairs which are catscan2.php 2 not in relation to each other, while larger clusters https://en.wikipedia.org/wiki/ Special:Export are likely to correspond to different relation types. 3 medialab.di.unipi.it/Project/ Relations are clustered and labeled using the ap- SemaWiki/Tools/WikiExtractor.py *(%&%2)*+%,$-%&'( text as a candidate to be further analysed. *#"-%,'(&- 3%$#&$( !"#$%&!'()* 97;$00-,3! Afterwards, for each token occurring in the 4"5&-$#) !"-',50 #)+$,-.$( ')8!#733$( 9$::7&-.$( 6$<-=70!>,&-&--$2 "?698!5!'@A* identified pattern, we search for matches of the corresponding lemma in the dictionaries. In case of multi-word expressions, when several patterns !"#$%&'()*+%,$-%&'( A7((-$(!C$7&1($ apply to overlapping strings of tokens, we apply >%&(7=&-), /012&$(-,3 '7-(!H<$,&-I-=7&-), "456$7,2* a greedy approach by choosing the longest one F5G(7:2 >%&(7=&-), matching the input. The output is produced following the TextPro !"#$%&'().#/0%",&(1 format, that is a line for each token, and a col- B)(:;$<<-,3! B)(:;$<<-,3! C$7&1($! >%&(7=&-), C$7&1($!D-=&-),7(E files are enriched by the information about Medi- cal and Pharmaceutical entities obtained from the /012&$(-,3 "456$7,2* .#/0%",)6$7"#&(1 dictionaries provided by UMLS4 and PRB5 . These information are labeled as MED for the medical entities, and FAR for the pharmaceutical ones (the Figure 1: Architecture for Relation mining from whole entity tag list is shown in the Table 1). clinical records. Table 1: List of medical sub-categories entity and input this representation into a k-means Description Label clustering (Manning et al., 2008; Shalev-Shwartz Medical MED and Ben-David, 2014). Pharmaceutical FAR Anatomy ANA 2.1 Input Preprocessing Organisms ORG The text, processed by our system, is extracted Diseases MAL from anonymized medical records, in the form of Chemicals and Drugs CHE plain text encoded in UTF-8. The text includes Technical medical equipment TEC a small set of special characters, used as delim- Psychology and Psychiatric PSI iters and/or formatters. The largest part of these Biology BIO medical records has been produced by an HL7- Natural Sciences NAT compatible information system. At the end of each Anthropology and Social Science SOC medical record, there is often an ICD9M (Interna- Technology, Industry and Agriculture IND tional Standard for Encoding and Classifying Dis- Humanities UMA eases) disease code, which we disregard together Computer Science INF with the rest of the structured part of the records. Groups of People GRU The text is initially preprocessed for extracting Health care ASS textual parts from the medical records, and to get Characteristics of Publication PUB rid of non-textual characters. The plain text, pro- Locations LOC duced by this preprocessing step, is passed to the natural language processing suite TextPro to per- In addition to a label indicating whether the en- form tokenization, sentence splitting, PoS tagging tity is medical (MED) or pharmaceutical (FAR), and lemmatization. we also add to each medical entity annotation the sub-categories included in the UMLS database in 2.2 Entity Extraction correspondence to the dictionary entry. The list of sub-categories labels are summarized in Table 1. Entity extraction is crucial for our analysis, and A side-effect of such sub-categorization is that the a specific module has been implemented with the number of potential relations increases while it be- goal of extracting entities which are relevant for comes possible to find more specific relations. the application domain: biomedical and pharma- 4 ceutical entities in our case. The module follows a Unified Medical Language System, http://www. nlm.nih.gov/research/umls pattern matching approach by identifying each oc- 5 Pharmaceutical Reference Book, officially mantained by currence of a number of PoS patterns in the input Agenzia Italiana del Farmaco 2.3 Relation Clustering larity from the cluster centroid: the first four pairs We apply the k-means approach that identifies are then chosen to characterize the cluster. groups of relations of the same type appearing in As discussed above, each FV can be partitioned the data set. Each pair of entities occurring in the in two parts: the first half corresponds to the first same sentence identifies a potential relation, there- entity in the pair, the second one to the other. fore all possible entity pairs must be considered. Such partition is consistently maintained during We then apply a clustering algorithm to the set of the whole processing. Also in the computation of all the potential relations identified. We will disre- centroids in the k-means clustering algorithm, the gard all entity pairs belonging to clusters having a former half of each centroid derives from the av- size smaller than a given threshold. erage of the former half of the involved FVs and We then concentrate on the remaining entity then corresponds to the first entity. Correspond- pairs, which are likely to represent actual relations ingly, the latter half of each centroid vector only and semantically cluster them. The approach pro- depends on the second entity of each involved pair. posed for this is structured in three main modules: The choice of the cluster to which a given item Feature Construction, Clustering, and Cluster La- is assigned is based on the cosine similarity. Its beling. The first module builds a feature vector computation can be divided in three parts: the dot based on WEs for each relation candidate; for do- product of the part of the two FVs corresponding ing this, first it constructs a WE dictionary by us- to the first entity, the same for the second entity ing a large collection of unannotated texts, in our and eventually the normalization with respect to case extracted from Wikipedia. This module is the whole FV. Therefore, the evaluation of the co- based on word2vec6 (Mikolov et al., 2013). For sine similarity is based on a trade-off between how the feature vectors length we chose 500, which similar are the first and the second entities in each is the default choice, and set the minimum word pair. In other words, they represent actual enti- count to 3, to exclude the less frequent words from ties pairs which are similar to the (abstract) cluster the dictionary, obtaining a set of 260, 680 vectors. representative, corresponding to the centroid. After that, the k-means clustering is applied to 3 Conclusions and future work the set of feature vectors obtained by the first mod- ule. For every entity pair we then construct a Fea- In this paper we presented a system for the extrac- ture Vector (FV) starting from the WE of each tion of information from clinical records in Italian. word involved. Each entity can be composed by A first part of the system aims to extract domain one or more words, as for example conati di vom- relevant entities from medical reports by a pattern ito: in this case, for each entity, we take the aver- matching approach. A second part takes the out- age among the WEs of the words composing the put of the former step and applies a clustering ap- entity associated to the entity pair. Finally, we proach to explore possible relations between such concatenate the FVs of the two entities, obtaining entities. A third part is based on WE and aims to a FV of 1, 000 entries. give cues about the type of the relations. The clustering algorithm is then applied to the Interestingly, the approach does not require an- FV data set by means of the C Clustering li- notated data, but only easily available data such as brary (de Hoon et al., 2004), a fast C imple- Wikipedia and off-the-shelf tools in addition to the mentation of the k-means algorithm. As the k- documents to process. Naturally, available tools means is characterized by a random initial choice have been trained on annotated data, but without of the seeds, we repeated each run 10 times, al- any adaptation to the specific domain. It would ways choosing the best solution. We considered therefore be interesting to port it to a new lan- the cosine similarity, choosing a number of clus- guage, possibly different from English, which rep- ters equal to 40, which seemed a reasonable choice resents the most widely studied among all lan- given the results from the experiments in (Alicante guages. et al., 2016b) and in (Alicante et al., 2016a). Eventually, to label each cluster we ordered the Acknowledgments pairs in each cluster according to its cosine simi- The research presented in this paper was partially 6 The software is freely available at https://code. supported by the national projects CHIS - Cultural google.com/p/word2vec/ Heritage Information System (PON), and BIG4H - Big Data Analytics for E-Health Applications (POR). References Anita Alicante, Anna Corazza, Francesco Isgrò, and Stefano Silvestri. 2016a. Semantic cluster labeling for medical relations. In Proceeding of Innovation in Medicine and Healthcare 2016, pages 183–193, Puerto de la Cruz, Tenerife, Spain. Springer. Anita Alicante, Anna Corazza, Francesco Isgrò, and Stefano Silvestri. 2016b. Unsupervised entity and relation extraction from clinical records in Italian. Computers in Biology and Medicine, 72:263–275. Piero A. Bonatti and Luigi Sauro. 2013. A confi- dentiality model for ontologies. In Harith Alani, Lalana Kagal, Achille Fokoue, Paul T. Groth, Chris Biemann, Josiane Xavier Parreira, Lora Aroyo, Natasha F. Noy, Chris Welty, and Krzysztof Janow- icz, editors, International Semantic Web Conference (1), volume 8218 of Lecture Notes in Computer Sci- ence, pages 17–32. Springer. Michiel J.L. de Hoon, Seiya Imoto, John Nolan, and Satoru Miyano. 2004. Open source clustering soft- ware. Bioinformatics, 20(9):1453–1454. C.D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Tomas Mikolov, Greg Corrado, Kai Chen, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. Proc. of the Inter- national Conference on Learning Representations (ICLR 2013), pages 1–12. Denis Paperno and Marco Baroni. 2016. When the Whole is Less than the Sum of its Parts: How Composition Affects PMI Values in Distribu- tional Semantic Vectors. Computational Linguis- tics, 42(2):345–350. Emanuele Pianta, Christian Girardi, and Roberto Zanoli. 2008. The TextPro Tool Suite. In Proceed- ings of the Sixth International Conference on Lan- guage Resources and Evaluation (LREC’08), pages 28–30, Marrakech, Morocco. European Language Resources Association (ELRA). Shai Shalev-Shwartz and Shai Ben-David. 2014. Un- derstanding Machine Learning: From Theory to Al- gorithms. Cambridge University Press, New York, NY, USA.