Automatic extraction of semantic relations between medical entities: Application to the treatment relation Asma Ben Abacha Pierre Zweigenbaum LIMSI-CNRS LIMSI-CNRS BP 133 - F-91403 Orsay Cedex BP 133 - F-91403 Orsay Cedex asma.benabacha@limsi.fr pz@limsi.fr Abstract search engine. Information extraction is a complex task which is necessary to develop high- But, while these search engines have a big precision information retrieval tools. In contribution in making large volumes of medi- this paper, we present MeTAE, a platform cal knowledge accessible, their users have often to extract medical entities and the medical to deal with the burden of browsing and filtering relations linking them. The proposed ap- the numerous results of their queries in order to proach relies on linguistic patterns and do- find the precise information they were looking for. main knowledge and consists in two steps: This point is more crucial for practitioners who (i) recognition of medical entities and (ii) may need an immediate answer to their queries identification of the correct semantic re- during their work. lation between each pair of entities. The first step is achieved by an enhanced use In this context, we need systems able to respond of MetaMap which improves the preci- to users queries with precise answers. Such tools sion obtained by MetaMap by 19.59% in need deep analysis of biomedical documents in or- our evaluation. The second step relies on der to extract relevant information. At the first linguistic patterns which are built semi- level of this information come the medical enti- automatically from a corpus selected ac- ties (e.g. diseases, drugs, symptoms). At the sec- cording to semantic criteria. We evaluate ond, more complicated level comes the extraction our system’s ability to identify medical en- of semantic relationships between these entities. tities of 16 types. We also evaluate the extraction of treatment relations between a In this paper, we present our method to extract treatment (e.g., medication) and a problem semantic relations between medical entities, with (e.g., disease): we obtain 75.72% of pre- an empirical study on the “treatment” relation. We cision and 60.46% of recall. We achieve first propose an enhanced use of MetaMap (Aron- encouraging results w.r.t similar research son, 2001) to extract medical entities and com- works in the literature. pare it with the simple application of MetaMap on the same test corpora. To extract occurrences of 1 Introduction the target relations, we then design linguistic pat- Medical knowledge is growing significantly every terns based on selected sentences from PubMed year. According to some studies, the volume of Central articles. We present a method to ob- this knowledge doubles every five years (Engel- tain such sentences by leveraging UMLS Metathe- brecht, 1997), or even every two years (Hotvedt, saurus knowledge and MeSH indexing of PubMed 1996). With large-scale digitisation, several med- Central. We evaluate entity and relation extraction ical search engines went on display, such as on a distinct corpus of 580 sentences and obtain PubMed1 for searching biomedical literature, CIS- promising results. We also present MeTAE, a plat- MeF2 , catalog and index of french medical Web form for automatic semantic annotation and explo- sites or Health On the Net3 , a public medical ration of medical texts which incorporates these 1 http://www.pubmed.com information extraction components and lets a user 2 http://www.chu-rouen.fr/cismef query the obtained information. We finally discuss 3 http://www.healthonnet.org our results and conclude on further work. 1 2 Background formed. Their second method (Lee et al., 2004) targeted the precise extraction of “treatment” rela- The reference tool for medical entity recognition tions between drugs and diseases. Manually writ- is MetaMap (Aronson, 2001), a system which ten linguistic patterns were constructed from med- maps medical text to UMLS concepts. Using ical abstracts talking about cancer. Their system MetaMap therefore provides a strong baseline to reached 84% recall but an overall 48.14% preci- start with. MetaMap is able to identify most con- sion. (Embarek and Ferret, 2008) proposed an ap- cepts in the titles of articles from MEDLINE (Pratt proach to extract four kinds of relations (Detect, and Yetisgen-Yildiz, 2003). (Meystre and Haug, Treat, Sign and Cure) between five kinds of med- 2005) obtained good precision and recall measures ical entities. The patterns used were constructed (resp. 0.753 and 0.892) with an approach based on automatically using an alignment algorithm wich MetaMap for extracting “medical problems”. maps sentence parts using an edit distance (defined However, the use of MetaMap leads to some between two sentences) and different word-level residual problems at two levels: (i) in the seg- clues. mentation and the extraction of medical entities: MetaMap considers some general words and some SemRep (Rindflesch et al., 2000), a natural lan- verbs as medical entities (e.g. best, normal, take, guage processing application, targeted the extrac- reduce) and (ii) in the categorization of medical tion of semantic relationships in biomedical text entities: MetaMap may propose several concepts through a rule-based approach. SemRep (Fiszman for the same term as well as several semantic types et al., 2007) obtained a 53% recall and 67% pre- for the same concept. We address these two issues cision in identifying risk factors and biomarkers in our system by performing independent segmen- for diseases asserted in MEDLINE citations. An tation of the text given to MetaMap, then impos- enhanced version of SemRep (Ahlers et al., 2007) ing constraints on the semantic types of concepts was proposed to identify core assertions on phar- it detects. macogenomics and obtained an overall 55% recall Domain-independent relation extraction has and 73% precision. been studied by a wide range of approaches Domain-independent relation extraction meth- which can be classified in four categories. Sta- ods are not directly applicable to the medical do- tistical approaches based on term frequency and main due to the lack of domain independent mark- co-occurrence of specific terms (Hindle, 1990), ers that may help to recognise medical entities machine learning techniques (Zhu et al., 2009), (e.g. capital letters, regular grammatical structure) linguistic approaches (Hearst, 1992) (e.g. using and to the variety in the expression of domain con- manually written extraction rules) and hybrid ap- cepts (e.g. Amoxicillin = amoxycillin = AMOX). proaches which combine two or more of the pre- To bypass these problems, medical relation extrac- ceding methods (Suchanek et al., 2006). tion approaches often rely on domain knowledge In the medical domain, the same strategies such as the UMLS Metathesaurus and Semantic can be found but the specificities of the domain Network. But the post-use of extracted relations led to specialised methods. (Cimino and Bar- is not always taken into account in the extraction nett, 1993) used linguistic patterns to extract re- procedure. For instance, if the extracted relations lations from titles of Medline articles. The au- are to be used in keyword querying systems, we thors used MeSH headings and co-occurrence of should either give priority to recall or give the target terms in the title field of a given arti- same priority for recall and precision, while, if the cle to construct relation extraction rules. (Khoo final application is a question answering system et al., 2000) focused on extracting causal re- for practitioners, priority should be given to the lations from abstracts of biomedical articles by precision of extraction. Medical relation extrac- aligning manually-constructed graph patterns with tion approaches sometimes also do not care about syntactic dependency trees. (Lee et al., 2003) extracting the arguments of a relation (e.g. (Lee et used UMLS to identify semantic relations between al., 2004)), or evaluate their approaches by count- medical entities. Their first method could extract ing relations extracted with only one argument as 68% of the semantic relations in their test cor- correct (e.g. (Pustejovsky et al., 2002)), consider- pus but if many relations were possible between ing that recall is the most important measure. In the relation arguments no disambiguation was per- our context we are interested in medical question 2 answering systems as back-end and give priority a disambiguation step is required on the obtained to precision, considering the correct extraction of concepts. arguments as mandatory to validate the identified To solve these problems, we propose an ap- relations. proach in three points: Most relation extraction methods rely on a cor- pus where example occurrences of the target rela- 1. Split the biomedical texts into sentences and tions can be found. For instance, given pairs of extract noun phrases with non-specialized seed terms which are known to entertain the tar- tools. We use LingPipe4 and treetagger- get relation, semi-supervised methods such as that chunker which offer a better segmentation ac- introduced in (Hearst, 1992) collect occurrences cording to empirical observations. of these term pairs in the corpus and use them to 2. Determine medical entities as well as UMLS build relation patterns. concepts and semantic types with MetaMap. The selection of a relevant corpus is a key point here: for such a method to work, the corpus must 3. Filter the obtained medical entities with (i) a contain mentions of the target relationship be- list of the most frequent/noticeable errors and tween these pairs of terms. We propose a method (ii) a restriction on the semantic types used to increase the chances that such mentions are ac- by MetaMap in order to keep only semantic tually found in the selected texts. types which are sources or targets for the tar- geted relations (cf. Table 1). 3 Annotation Method Our method is twofold. In a first step, we ex- Category Example Semantic Types tract medical entities from sentences and deter- Problem Anatomical Abnormality, Injury mine their categories. In a second step, we extract or Poisoning, Disease or Syn- semantic relations between the extracted entities. drome Treatment Pharmacologic Substance, Ther- 3.1 Medical Entity Recognition apeutic or Preventive Procedure By “medical entity”, we refer to an instance of a Test Diagnostic Procedure, Labora- medical concept such as Disease or Drug. Medical tory Procedure entity recognition consists in: (i) identifying med- ical entities in the text and (ii) determining their Table 1: Examples of categories and correspond- categories. For instance, in the following sentence ing UMLS semantic types “ACE inhibitors reduce major cardiovascular dis- ease outcomes in patients with diabetes.”, the med- 3.2 Relation Extraction ical entity ACE inhibitors should be identified as Our approach is based on the use of linguistic pat- a treatment and the medical entity cardiovascular terns. For every couple of medical entities, we disease outcomes should be identified as a prob- collect the possible relations between their seman- lem. tic types in the UMLS Semantic Network (e.g. be- One of the most important obstacles to identi- tween the semantic types Therapeutic or Preven- fying medical entities is the high terminological tive Procedure and Disease or Syndrome there are variation in the medical domain (e.g Swine in- five relations: treats, prevents, complicates, etc.). fluenza = swine flu = pig flu). MetaMap (Aron- We construct patterns for each relation type (cf. son, 2001) deals with this variation by using mor- Section 3.3) and match them with the sentences phological knowledge found in the UMLS Spe- in order to identify the correct relation. The rela- cialist Lexicon and term variants present in the tion extraction process relies on two criteria: (i) UMLS Metathesaurus. However, as mentioned in a degree of specialization associated to each pat- the Background section, some issues must still be tern and (ii) an empirically-fixed order associated addressed. According to empirical observations, to each relation type which allows to order the pat- the sentence and noun phrase segmentations pro- terns to be matched. We target six relation types, vided by MetaMap is not as performant as the seg- described in Figure 1. mentation provided by other non-specialized tools 4 known in Natural Language Processing. Besides, http://alias-i.com/lingpipe/ 3 Relation Pattern number Simplified examples causes 28 . . . E1 may trigger E2 . . . diagnoses 12 E1 is the best test for (the diagnoses of)? E2 treats 46 . . . E1 was found to reduce E2 . . . prevents 13 . . . E1 for prophylaxis against E2 . . . Table 2: Examples of relation patterns looked for. We build this corpus by querying the PubMed Central database5 (PMC) of biomedical articles with focused queries. These queries try to identify articles that have high chances of contain- ing the target relation between the two seed con- cepts. We aimed to optimize precision, therefore we applied the following principles. • Since PMC, like PubMed, is indexed with MeSH headings, we restrict our set of seed concepts to those which can be expressed by a MeSH term. • We impose a MeSH-based search mode to PMC by adding the /MH qualifier to the con- Figure 1: Excerpt of the Relations Ontology cepts. • We also want these concepts to play an im- 3.3 Pattern Construction portant role in the article. One way to spec- ify this is to ask for them to be ‘major top- Semantic relations are not always expressed with ics’ of the paper they index ([MAJR] field explicit words such as treat or prevent. They in PubMed or PMC; note that this implies are also frequently expressed with combined and /MH). complex expressions. Therefore, it is difficult to build patterns which can cover all relevant expres- • Finally, the target relation should be present sions. However, the use of patterns is one of the between the two concepts. MeSH and PMC most effective methods for automatic information provide a way to approximate a relation: extraction from textual corpora if they are effi- some of the MeSH subheadings (e.g., therapy ciently designed (Cimino and Barnett, 1993; Lee or prevention and control) can be taken as et al., 2004; Embarek and Ferret, 2008). representing underspecified relations, where To build patterns for a target relation R, we used only one of the concepts is provided. For in- a corpus-based strategy akin to that of (Hearst, stance, Rhinitis, Vasomotor/TH can be seen 1992) and followers. We illustrate it with the as describing a treats relation (/TH) between treats relation. To apply this strategy we first some unspecified treatment and a rhinitis. need seed terms corresponding to pairs of concepts Unfortunately, MeSH indexing does not al- known to entertain the target relation R. To obtain low the expression of full binary relations such pairs, we extracted from the UMLS Metathe- (i.e., linking two concepts), so we had to keep saurus all the couples of concepts connected by this approximation. the relation R. For instance, for the treats Seman- tic Network relation, the Metathesaurus contains Queries are thus designed according to the 45,145 treatment-problem pairs linked with the following model: /TH[MAJR] and “may treat” Metathesaurus relation (e.g. Diazox- /MH. ide may treat Hypoglycemia). They are submitted to PMC to obtain full-text We then need a corpus of texts where occur- articles on the required topics. This method should 5 rences of both terms of each seed pair will be http://www.ncbi.nlm.nih.gov/pmc/ 4 increase the chances of obtaining sentences where one of the reference relations occurs, and provides C + 0.5 × B + 0 × T a large variety of expressions of the target relation. P recision = (1) N The resulting corpus contains a set of medical articles in XML format. From each article we con- • C: number of correct entities. struct a text file by extracting relevant fields such • B: number of entities with correct semantic as the title, the summary and the body (if they type but incorrect boundaries. are available). Then, we split every text into sen- tences using the segmentation model of the Ling- • T: number of entities with wrong semantic Pipe project. We apply MetaMap on each sentence types. and keep the sentences which contain at least one couple of concepts (c1, c2) connected by the target • N: total number of retrieved entities. (C + B relation R according to the Metathesaurus. + T = N) This semantic pre-analysis reduces the manual The recall of named entity rceognition was not effort required for subsequent pattern construc- measured due to the difficulty of annotating man- tion, which allows us to enrich the patterns and ually all the medical entities in our corpus. For the to increase their number. The patterns constructed relation extraction evaluation, recall is the number from these sentences consist in regular expressions of correct treatment relations found divided by the taking into account the occurrence of medical enti- total number of treatment relations. Precision is ties at precise positions. Table 2 presents the num- the number of correct treatment relations found di- ber of patterns constructed for each relation type vided by the number of treatment relations found. and some simplified examples of regular expres- sions. A similar process was performed to extract 4.2 Results another different set of articles for our evaluation. Table 3 shows the precision of medical entity recognition obtained by our entity extraction ap- 4 Evaluation proach (text to sentences segmentation with Ling- In this section, we present our evaluation method Pipe, sentence to noun phrase segmentation with and the obtained results for medical entity recog- treetagger-chunker and stoplist filtering), using nition and the extraction of treatment relations. LTS+MetaMap, compared to the simple use of MetaMap. Entity type errors are denoted by T , 4.1 Evaluation Method boundary-only errors are denoted by B and preci- To build an evaluation corpus, we queried Pub- sion is denoted by P . MedCentral with MeSH queries (e.g. Rhinitis, The LTS+MetaMap method led to a significant Vasomotor/th[MAJR] AND (Phenylephrine OR increase in the precision of medical entities rec- Scopolamine OR tetrahydrozoline OR Ipratropium ognized by MetaMap. Actually, LingPipe out- Bromide)). Then we chose a subset of 20 varied ar- performed MetaMap in sentence segmentation on ticles (e.g. reviews, comparative studies). We veri- our test corpus. LingPipe found 580 correct sen- fied that no article of the evaluation corpus is used tences where MetaMap found 743 sentences con- in the pattern construction process. The last stage taining boundary errors and some sentences were of preparation was the manual annotation of med- even cut in the middle of medical entities (most ical entities and treatment relations in these 20 ar- often due to abbreviations). A qualitative study ticles (total = 580 sentences). Figure 2 shows an example of an annotated sentence. We use the standard measures of recall, pre- treat A subsequent study of patients with cision and F-measure. The precision of named cSSSI also found that daptomycin resulted entity recognition depends both on the textual in faster clinical improvement boundaries of the extracted entity and on the cor- established-known daptomycin rectness of its associated category (semantic type). cSSSI In our evaluation, boundary-only errors cost half a point and the precision is calculated according to the following formula: Figure 2: Example of manual annotations 5 LTS + MetaMap MetaMap Tr Br P Tr Br P Disease Or Syndrome 9.81 26.48 76.94 9.09 52.27 64.77 Injury or poisoning 26.19 35.71 55.95 33.33 34.84 49.24 Neoplastic Process 37.5 12.50 56.25 29.03 6.45 67.74 Anatomical Abnormality 40.00 0.00 60.00 85.71 0.00 14.28 Cell or Molecular Dysfunction 44.44 44.44 27.79 66.66 25.00 20.83 Total 12.23 27.10 74.21 30.08 30.52 54.62 Table 3: Medical entity extraction according to semantic types. Tr = T/N, type error rate; Br = B/N, boundary error rate; P = precision. All results are percentages. of the noun phrases extracted by MetaMap and Treetagger-chunker also shows that the latter pro- duces less boundary errors. For the extraction of treatment relations, we obtained 60.46% recall, 75.72% precision and 67.23% F-measure. Other relevant approaches to our work like (Lee et al., 2004) obtained 84% recall, 48.14% precision and 61.20% F-measure for the extraction of treatment relations. Semrep Figure 3: MetAE - Annotation Interface (Ahlers et al., 2007) obtained 54% recall, 84% precision and 68.21% F-measure on a set of pred- ications including the treatment relationship (i.e. administrated to, manifestation of, treats). How- ever, given the differences in corpora and in the na- ture of relations, these comparisons must be con- sidered with caution. 5 Annotation and exploration platform: MeTAE We implemented our approach in the MeTAE6 Figure 4: MeTAE - Exploration Interface platform which allows to annotate medical texts or files and writes the annotations of medical en- 6 Discussion tities and relationships in RDF format in exter- nal supports (cf. Figure 3). MeTAE allows also Several semantic relation extraction approaches to explore sematically the available annotations only address relation detection (e.g. find that through a form-based interface. User queries are a sentence contains the searched relation (Lee reformulated in SPARQL language according to a et al., 2004)). In the context of medical domain ontology which defines the semantic types question-answering systems, we are not only in- associated to the medical entities and the seman- terested in relation detection but also in the tic relationships with their possible domains and linked medical entities. We focus on search- ranges. Answers consist in sentences whose anno- ing triples such that the tations conform to the user query and their corre- source and the target have known categories (se- sponding documents (cf. Figure 4). mantic types) and such that the relation is valid w.r.t domain knowledge and w.r.t linguistic con- 6 siderations (i.e. the sentence really says that An enhanced version of the platform MeTAE will be available online very shortly at http://www.limsi.fr/ the source treats the target). In this context, Individu/abacha/metae.html the same sentence may contain several triples 6 . Rolf Engelbrecht. 1997. Expert systems for A first analysis of the false positives shows that medicine functions and developments. Zentralbl Gynakol;119(9):428-34. the main error causes are: (i) errors in the extrac- tion of medical entities (ii) patterns of the treat- Marcelo Fiszman, Graciela Rosemblat, Caroline B ment relation that cover also forms of expression Ahlers, Thomas C Rindflesch. 2007. Identifying of other relations and (iii) sentences that contain risk factors for metabolic syndrome in biomedical text. AMIA Annu Symp Proc, 249-253. possible source and target entities without them being connected with the treatment relation. Marti A. Hearst. 1992. Automatic Acquisition of Hy- We obtained good results in precision and F- ponyms from Large Text Corpora. Proceedings of the 14th conference on Computational linguistics, measure compared to other semantic relation ex- 539-545. traction approaches. This meets our initial ob- jective, which is to have a high precision in rela- Donald Hindle. 1990. Noun classification from predi- cate argument structures. In Proceedings of the 28th tion extraction in order to build efficient question- annual meeting on Association for Computational answering systems. Linguistics, 268-275. 7 Conclusion Martyn O. Hotvedt. 1996. Continuing medical educa- tion: actually learning rather than simply listening. In this paper, we presented a knowledge and JAMA 1996, 275:1638. linguistic-based approach for the extraction of Christopher S. G. Khoo, Syin Chan, and Yun Niu. medical entities and the semantic relations linking 2000. Extracting Causal Knowledge from a Medical them. This approach is based on two main steps: Database Using Graphical Patterns. In Proceedings (i) the recognition of medical entities with an en- of 38th Annual Meeting of the ACL, Hong Kong. hanced use of MetaMap and (ii) the exploitation of Chew-Hung Lee, Jin-Cheon Na, and Christopher linguistic patterns taking into account the semantic Khoo. 2003. Ontology Learning for Medical Dig- types of medical entities. The results obtained on ital Libraries. Proceedings of the 6th International Conference of Asian Digital Library, 302-305. a real test corpus show the effectiveness of our ap- proach and its advantages for question-answering Chew-Hung Lee, Christopher Khoo and Jin-Cheon Na. systems. 2004. Automatic identification of treatment rela- In short-term perspectives, we intend to study tions for medical ontology learning: An exploratory study. Proceedings of the Eighth International ISKO the false negatives in order to improve our pat- Conference, 245-250. terns. We also intend to design a method which extracts automatically contextual information such Stéphane M. Meystre and Peter J. Haug. 2005. Com- paring natural language processing tools to extract as the status of the relation (e.g. hypotheti- medical problems from narrative text. AMIA Annu cal, established-known) and information about pa- Symp Proc, 525-9. tients (e.g. gender, age). Wanda Pratt and Meliha Yetisgen-Yildiz. 2003. A Study of Biomedical Concept Identification: MetaMap vs. People. AMIA Annu Symp Proc, 529- References 533. Caroline B. Ahlers, Marcelo Fiszman, Dina Demner- Fushman, François-Michel Lang and Thomas C. Denys Proux, Franois Rechenmann, and Laurent Jul- Rindflesh. 2007. Extracting Semantic Predi- liard. A Pragmatic Information Extraction Strategy cations From Medline Citations for Pharmacoge- for Gathering Data on Genetic Interactions. Pro- nomics. Pacific Symposium on Biocomputing, 2007 ceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, p.279- Alan R. Aronson. 2001. Effective mapping of biomed- 285, August 19-23, 2000. ical text to the UMLS Metathesaurus: the MetaMap program. AMIA Annu Symp Proc, 17-21. James Pustejovsky, José M. Castaño, Jason Zhang, M. Kotecki, and B. Cochran. 2002. Robust Relational J.J. Cimino and G.O. Barnett. 1993. Automatic knowl- Parsing over Biomedical Literature: Extracting In- edge acquisition from MEDLINE. Methods of Infor- hibits Relations. Pacific Symposium on Biocomput- mation in Medicine;32(2);120-130. ing, 362-373. Mehdi Embarek and Olivier Ferret. 2008. Learning Thomas C. Rindflesch, Carol A. Bean and Charles A. Patterns for Building Resources about Semantic Re- Sneiderman. 2000. Argument Identification for Ar- lations in the Medical Domain. Proceedings of the terial Branching Predications Asserted in Cardiac Sixth International Language Resources and Evalu- Catheterization Reports. AMIA Annu Symp Proc, ation (LREC’08). 704-708. 7 Thomas C. Rindflesch, Jayant V. Rajan, and Lawrence Hunter. 2000. Extracting Momecular Binding Rela- tionsships from Biomedical Text. Proceedings of the sixth conference on Applied natural language pro- cessing, p.188-195, April 29-May 04, 2000, Seattle, Washington. Gunther Schadow and Clement J. McDonald. 2003. Extracting Structured information from free text pathology reports. AMIA Annu Symp Proc, 584- 588. Fabian M. Suchanek, Georgiana Ifrim and Gerhard Weikum. 2006. Combining Linguistic and Statis- tical Analysis to Extract Relations from Web Docu- ments. Proceedings of the 12th ACM SIGKDD in- ternational conference on Knowledge discovery and data mining, 712-717. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. StatSnowball: a statistical approach to extracting entity relationships. Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain. 8