A Light-weight & Robust System for Clinical Concept Disambiguation Dirk Weissenborn, Roland Roller, Feiyu Xu and Hans Uszkoreit Language Technology Lab, DFKI Alt-Moabit 91c, Berlin, Germany {dirk.weissenborn, roland.roller, feiyu, uszkoreit}@dfki.de Enrique Garcia Perez SAP Innovation Center Konrad-Zuse-Ring 10, Potsdam, Germany enrique.garcia.perez@sap.com Abstract supervised (Agirre et al., 2010) methods. Each of those techniques has its advantages, however, This paper presents a system for the nor- as seen in different disambiguation tasks, sim- malization of concept mentions in clini- ple methods (and their combination) can achieve cal narratives. We evaluate and compare very good results, such as the generation of rules it against a popular, open-source solution and heuristics from the training data (Afzal et al., that is frequently used for natural language 2015), the usage of similarity measures (Pathak et processing of clinical text. The evalu- al., 2015) or the inclusion of Information Content ation is based on a manually annotated (Leal et al., 2015). dataset of 72 discharge summaries taken In this work we develop a light-weight solution from the i2b2-corpus. Besides the demon- to the problem of clinical concept normalization, stration and evaluation of our system we that is easy to implement and does not require ex- provide an in-depth corpus analysis that pensive computations and is therefore particularly guided the development of the system. suited for industrial application. The approach is Our focus lies on the task of concept dis- mainly unsupervised and does not require large ambiguation, for which we combine two amounts of training data. In particular, the dis- unsupervised approaches that are easy to ambiguation is based on a densest-subgraph al- implement and computationally inexpen- gorithm to ensure contextual compatibility among sive. We show that some ambiguities can the normalized concepts and the string similarity only be resolved by adapting to annotation between the surface string and the preferred labels guidelines and preferences which we solve of a respective concept. We achieve very good via the introduction of heuristics. Finally, performance with this setup on a manually anno- we present an online-demo that gives in- tated dataset. An web-application was developed sights into the individual parts of the nor- for demonstration purposes and to debug the nor- malization pipeline. malization pipeline1 . 1 Introduction 2 Clinical Concept Normalization Recognizing and disambiguating clinical concepts The concept normalization task requires a well plays a central role in many information extraction defined target vocabulary. A useful resource is tasks within the clinical domain. It requires the the Unified Medical Language System (UMLS), identification of concept mentions in clinical nar- which defines biomedical concepts with various ratives and the disambiguation of their respective names, spellings and abbreviations. Concepts surface strings (normalization). In recent years, within UMLS are defined by so called concept many tasks have focused on the normalization unique identifiers (CUI) that represent concepts of clinical concepts, such as the i2b2 challenge across different biomedical vocabularies, such as (Uzuner et al., 2011), ShARe/CLEF (Pradhan et NCI, NDF-RT or RxNorm. However, natural lan- al., 2013) and SemEval (Elhadad et al., 2015). guage is highly variable and surface strings can Traditionally, disambiguation systems rely on have different meanings depending on the context. supervised (Martinez and Baldwin, 2011), semi- 1 supervised (Preiss and Stevenson, 2013) or un- http://clinical-ta.dfki.de Concept-Type (Source) #Annotations ment and the second half for testing. Symptoms (NCI) 1434 Disease (NCI) 1370 We also analyzed the ambiguities within the Medication (RxNorm) 1190 corpus based on our candidate search (§3.2). Ta- Diagnostic Procedure (NCI) 647 ble 2 lists different ambiguity classes and their Therapeutic procedure (NCI) 644 Anatomy (NCI) 593 fraction in the dataset. It shows that ambiguity Laboratory Tests (NCI) 458 arises only in 18% of mentions. Candidate search fails in about a third of all cases for which the Table 1: Concept annotations by type in our correct candidate is not found. For most of those dataset. cases no candidate is found at all. This shows that Class % of mentions the currently employed dictionary lookup has to be ambiguous 18 refined. However, this work addresses the problem ambiguous given type 12 not ambiguous 49 of disambiguation. Thus, only 18% of all cases no candidates 28 are non-trivial and are useful for evaluating dis- correct candidate not found 33 ambiguation. Table 2: Ambiguity classes and their relative fre- 3 System Architecture quency in the dataset. ambiguous - mentions with more than one candidate including the correct; 3.1 Mention Recognition ambiguous given type - subset of ambiguous that Because of the focus on disambiguation our demo remains ambiguous after removing candidates of system employs a simple approach to mention wrong type; not ambiguous - only one, correct recognition. Given a tokenized input document all candidate. word n-grams up to a predefined n are extracted. This guarantees high recall. In the subsequent can- The task of normalizing surface strings to didate search step we eliminate all extracted men- unique concepts of a given vocabulary such as tions for which no candidates are found. UMLS can be subdivided into three partial tasks: 3.2 Candidate Search Mention Recognition, Candidate Search and Dis- ambiguation. Given an input text, the mention We find concept candidates for each recognized recognition subtask identifies text-spans that are mention via a string lookup to a given dictionary. potential mentions of a medical concept. Sub- The dictionary maps surface strings to concepts. sequently, the candidate search is responsible for Those were extracted from a predefined subset of finding candidate concepts for the surface strings vocabularies in the UMLS, namely RxNorm for of each mention. Finally, the disambiguation step medications and NCI for anatomical concepts, dis- selects the candidate that fits best into the men- eases, therapeutic procedures, diagnostic proce- tions context, i.e., it resolves the ambiguity among dures, laboratory tests and symptoms. The surface its candidates. The work focuses on the disam- strings of the dictionary were expanded by includ- biguation task. ing additional lexical variations. 3.3 Disambiguation 2.1 Data The most crucial part of the concept normalization pipeline is the concept disambiguation. Given a In our experiments we used a part of the i2b22 - set of candidates for each recognized mention it corpus (Uzuner et al., 2011) that was manually re- selects the concept which fits best to the mention annotated3 . It consists of 72 discharge summaries. of interest. The disambiguation is guided by two Overall, the dataset contains 6336 annotations. Ta- algorithms, that are explained in the following. ble 1 lists annotation types and their corresponding number of annotations. The corpus was split into String-Edit-Distance Each concept in UMLS 2 distinct subsets, each covering half of the docu- may include a set of synonyms containing a range ments. The first set was used for system develop- of variations and spellings. Not all of those string variations are likely to represent a concept in free 2 https://www.i2b2.org/ text. However, a small subset of strings are indi- 3 Note, the re-annotation took place within an industrial use case and was not carried out by one of the authors. The cated as preferred labels for a concept. In a corpus data and the dictionaries we used were already given. analysis, we found that many ambiguities can be resolved by selecting the candidate concept whose 3.4 Rule-based disambiguation preferred labels contains a close match with the A problem of unsupervised disambiguation is mention string. We further found that preferred la- the inability of learning corpus-specific patterns bels of distinct UMLS concepts are usually mutual which depend on annotation guide-lines and the exclusive. Thus, we employ a string-edit-distance personal perspective of the annotators themselves. (ED) algorithm, namely Levenshtein-distance, be- Based on our observations the following set of tween the preferred labels Lc of all candidates cm i simple rules are defined and used to support both and the mention string xm . We use the minimum disambiguation techniques: of those distances to define the ED-score of a can- didate concept. Active Substance: If the given mention is a tradename (e.g., Tylenol), in most of the cases its 1 active substance (e.g., Acetaminophen) is anno- sed (cm i ) = max l∈Lcm distance(xm , l) + 1 tated. Therefore we map all concepts that refer i to a tradename to its active substance: This infor- Densest-Subgraph We employ a densest- mation is taken from the UMLS Metathesaurus re- subgraph algorithm similar to Moro et al. (2014) lation has-tradename. or Weissenborn et al. (2015) to account for the context of a mention. First we construct a graph Structure of: If a mention ‘M’ (e.g. ‘left foot’) that consists of all candidates cim for all mentions includes two candidate concepts, one containing m of a document. These are the vertices of the the preferred label ‘structure of M’ and the other graph. We connect candidate concepts from one ‘entire M’, the second concept is removed different mentions with each other, whenever they from the list of candidates. co-occurred at least once together in MEDLINE, Abbreviation validation: Abbreviations tend to a repository of abstracts from biomedical publica- be highly ambiguous (Kim et al., 2011) and are tions. This information is annually summarized difficult to disambiguate. However, in many cases by the National Institutes of Health (NIH)4 . Given those candidates are selected, whose preferred la- the concept graph G = (V, E) of a document, bels fit the mentioned abbreviation. To address we iteratively select a mention with the most this issue, abbreviations are firstly identified us- remaining candidates and remove its least con- ing the UMLS Lexical Tools. Next, candidates nected candidate until each mention has at most whose preferred labels are not valid long forms a predefined number of candidates left5 . Given of a mentioned abbreviation are removed during the pruned graph G∗ = (V ∗ , E ∗ ) we score each pre-processing. Valid long forms of abbreviations remaining candidate by the product of its number have to fulfill the following criterion: The first let- of connections to other mention candidates and ter of the abbreviation must match the first letter other mentions, i.e., number of mentions that have of the text, and the remainder of the abbreviation, at least one connected candidate concept. i.e., the abbreviation without its first letter, must be 0 0 ∗ an abbreviation for the either the remaining text or suds (cm m m m i ) = {cj |(ci , cj ) ∈ E } · the remaining words, excluding the first. 0 {m0 |∃j : (cm m ∗ i , cj ) ∈ E } 4 Online Demo suds (cm i ) sds (cm i ) = P u m The web interface of the online demo6 is based on j sds (cj ) the BRAT NLP-tool7 to visualize the implemented We tried different combinations of both scores candidate search and disambiguation. Figure 1 and found the disambiguation via sds with a fall- presents the output of our Demo after process- back to sed to work best. I.e., we select always ing a clinical narrative. The upper part ‘Candi- the candidate for each mention with the highest date Search’ displays the text including mentions sds and apply sed in case there are more than one with their respective concept candidates. Differ- candidate with the same score. ent colors indicate different types of concepts. In the given example, red refers to anatomy, green to 4 https://mbr.nlm.nih.gov/MRCOC.shtml 5 6 We use 5 in our system, which performs slightly better http://clinical-ta.dfki.de 7 or equal to other configurations. http://brat.nlplab.org/ System Pre-processing P R F1 ED Gold-standard 0.850 0.592 0.698 DS Gold-standard 0.850 0.592 0.698 DS+SE Gold-standard 0.857 0.597 0.703 ED cTAKES 0.777 0.522 0.624 DS cTAKES 0.766 0.514 0.615 DS+ED cTAKES 0.780 0.524 0.627 cTAKES cTAKES 0.743 0.499 0.597 Table 3: Normalization results in Precision (P), Recall (R) and F1-score (F1) for all mentions in testset. System Pre-processing #Mentions P ED Gold-standard 502 0.751 DS Gold-standard 502 0.751 DS+SE Gold-standard 502 0.781 ED cTAKES 270 0.730 DS cTAKES 270 0.659 DS+ED cTAKES 270 0.767 cTAKES cTAKES 270 0.481 Table 4: Precision (P) for all non-trivial mentions in testset, i.e., mentions with at least 2 candidates Figure 1: Annotations comprising candidate and containing the correct one. disambiguated view. tribute to the performance of disambiguation. Our symptom, pink to disease and turquoise to labora- system performs also better than cTAKES9 with tory test. Moving the mouse courser over a candi- the same pre-processing (mention recognition and date mention, the GUI shows the vocabulary ori- candidate search). The main problem in general gin and its concept unique identifier. lies in the low recall, which is mainly due to fail- ing candidate search. This is also a major concern 5 Experiments in future work. 5.1 Setup As mentioned in §2.1, only a fraction of men- We evaluated our system on the test part of the tions can be considered non-trivial with respect dataset with different configurations. More specif- to the disambiguation. Table 4 shows the perfor- ically, we compare the performance of the indi- mance of our system and cTAKES for all non- vidual disambiguation algorithms, namely string- trivial mentions. The observations are similar to edit-distance (ED) and densest-subgraph (DS), the previous results. We can see that the precision and their combination, as well as a widely used of our system is quite robust and much better than reference system called cTAKES8 (Savova et al., the performance of cTAKES. 2010) in combination with the disambiguation component YTEX (Garla et al., 2011). We make 6 Conclusion use of a gold-standard mention recognizer that ex- tracts only annotated mentions in the experiments. We presented a light-weight disambiguation sys- When comparing to cTAKES, we make use of its tem for the normalization of clinical concept men- internal mention extraction and candidate search tions. The system is mainly unsupervised and uti- in combination with our disambiguation to guar- lizes string similarity metrics as well as informa- antee a fair comparison. Additionally, our post- tion from concept co-occurrences. We demon- processing heuristics were applied to the output of strate its robustness with respect to disambigua- both our system and cTAKES. tion and compared it to cTAKES, a popular open- source system for clinical NLP. In addition, we 5.2 Results give examples where our unsupervised approach Table 3 shows the results on the entire testset. We fails because of annotation guidelines and prefer- achieve a high precision of over 85% which we at- ences. This problem is solved by the introduction 8 9 https://ctakes.apache.org/ standard configuration for YTEX disambiguation of simple heuristics. Finally, our system can be Parth Pathak, Pinal Patel, Vishal Panchal, Sagar Soni, accessed via a web-application. Kinjal Dani, Amrish Patel, and Narayan Choudhary. 2015. ezDI: A Supervised NLP System for Clinical Acknowledgements Narrative Analysis. In Proceedings of the 9th In- ternational Workshop on Semantic Evaluation (Se- This research was partially supported by SAP, mEval 2015), pages 412–416. Association for Com- the German Federal Ministry of Economics and putational Linguistics. Energy (BMWi) through the project MACSS Sameer Pradhan, Noémie Elhadad, Brett R. South, (01MD16011F), and by the German Federal Min- David Martı́nez, Lee M. Christensen, Amy Vogel, istry of Education and Research (BMBF) through Hanna Suominen, Wendy W. Chapman, and Guer- gana K. Savova. 2013. Task 1: ShARe/CLEF the project BBDC (01IS14013E). eHealth Evaluation Lab 2013. In Working Notes for CLEF 2013 Conference , Valencia, Spain, Septem- ber 23-26, 2013. References Judita Preiss and Mark Stevenson. 2013. DALE: A Zubair Afzal, Saber A. Akhondi, Herman van Haa- Word Sense Disambiguation System for Biomedical gen, Erik M. van Mulligen, and Jan A. Kors. 2015. Documents Trained using Automatically Labeled Biomedical Concept Recognition in French Text Us- Examples. In Proceedings of the 2013 NAACL HLT ing Automatic Translation of English Terms. In Demonstration Session, pages 1–4, Atlanta, Geor- Working Notes of CLEF 2015 - Conference and Labs gia, June. Association for Computational Linguis- of the Evaluation forum, Toulouse, France, Septem- tics. ber 8-11, 2015. Eneko Agirre, Aitor Soroa, and Mark Stevenson. Guergana K Savova, James J Masanz, Philip V Ogren, 2010. Graph-based Word Sense Disambigua- Jiaping Zheng, Sunghwan Sohn, Karin C Kipper- tion of biomedical documents. Bioinformatics, Schuler, and Christopher G Chute. 2010. Mayo 26(22):2889–2896. clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evalua- Noémie Elhadad, Sameer Pradhan, Sharon Gorman, tion and applications. Journal of the American Med- Suresh Manandhar, Wendy Chapman, and Guergana ical Informatics Association, 17(5):507–513. Savova. 2015. SemEval-2015 Task 14: Analysis of Clinical Text. In Proceedings of the 9th Interna- Özlem Uzuner, Brett R South, Shuying Shen, and tional Workshop on Semantic Evaluation (SemEval Scott L DuVall. 2011. 2010 i2b2/VA challenge on 2015), pages 303–310, Denver, Colorado, June. As- concepts, assertions, and relations in clinical text. sociation for Computational Linguistics. Journal of the American Medical Informatics Asso- ciation, 18(5):552–556. Vijay Garla, Vincent Lo Re, Zachariah Dorey-Stein, Farah Kidwai, Matthew Scotch, Julie Womack, Amy Dirk Weissenborn, Leonhard Hennig, Feiyu Xu, and Justice, and Cynthia Brandt. 2011. The Yale Hans Uszkoreit. 2015. Multi-Objective Optimiza- cTAKES extensions for document classification: ar- tion for the Joint Disambiguation of Nouns and chitecture and application. Journal of the American Named Entities. Proc. of ACLIJCNLP, Beijing, Medical Informatics Association, 18(5):614–620. China, pages 596–605. Youngjun Kim, John Hurdle, and Stéphane M Meystre. 2011. Using UMLS lexical resources to disam- biguate abbreviations in clinical text. AMIA Sym- posium, 2011:715722. André Leal, Bruno Martins, and Francisco Couto. 2015. ULisboa: Recognition and Normalization of Medical Concepts. In Proceedings of the 9th In- ternational Workshop on Semantic Evaluation (Se- mEval 2015), pages 406–411. Association for Com- putational Linguistics. David Martinez and Timothy Baldwin. 2011. Word sense disambiguation for event trigger word detec- tion in biomedicine. BMC Bioinformatics, 12(2):1– 8. Andrea Moro, Alessandro Raganato, and Roberto Nav- igli. 2014. Entity linking meets word sense disam- biguation: a unified approach. Transactions of the Association for Computational Linguistics, 2:231– 244.