A Light-weight & Robust System for Clinical Concept Disambiguation
            Dirk Weissenborn, Roland Roller, Feiyu Xu and Hans Uszkoreit
                           Language Technology Lab, DFKI
                           Alt-Moabit 91c, Berlin, Germany
     {dirk.weissenborn, roland.roller, feiyu, uszkoreit}@dfki.de

                                     Enrique Garcia Perez
                                     SAP Innovation Center
                             Konrad-Zuse-Ring 10, Potsdam, Germany
                             enrique.garcia.perez@sap.com

                    Abstract                          supervised (Agirre et al., 2010) methods. Each
                                                      of those techniques has its advantages, however,
    This paper presents a system for the nor-         as seen in different disambiguation tasks, sim-
    malization of concept mentions in clini-          ple methods (and their combination) can achieve
    cal narratives. We evaluate and compare           very good results, such as the generation of rules
    it against a popular, open-source solution        and heuristics from the training data (Afzal et al.,
    that is frequently used for natural language      2015), the usage of similarity measures (Pathak et
    processing of clinical text. The evalu-           al., 2015) or the inclusion of Information Content
    ation is based on a manually annotated            (Leal et al., 2015).
    dataset of 72 discharge summaries taken              In this work we develop a light-weight solution
    from the i2b2-corpus. Besides the demon-          to the problem of clinical concept normalization,
    stration and evaluation of our system we          that is easy to implement and does not require ex-
    provide an in-depth corpus analysis that          pensive computations and is therefore particularly
    guided the development of the system.             suited for industrial application. The approach is
    Our focus lies on the task of concept dis-        mainly unsupervised and does not require large
    ambiguation, for which we combine two             amounts of training data. In particular, the dis-
    unsupervised approaches that are easy to          ambiguation is based on a densest-subgraph al-
    implement and computationally inexpen-            gorithm to ensure contextual compatibility among
    sive. We show that some ambiguities can           the normalized concepts and the string similarity
    only be resolved by adapting to annotation        between the surface string and the preferred labels
    guidelines and preferences which we solve         of a respective concept. We achieve very good
    via the introduction of heuristics. Finally,      performance with this setup on a manually anno-
    we present an online-demo that gives in-          tated dataset. An web-application was developed
    sights into the individual parts of the nor-      for demonstration purposes and to debug the nor-
    malization pipeline.                              malization pipeline1 .
1    Introduction                                     2       Clinical Concept Normalization
Recognizing and disambiguating clinical concepts
                                                      The concept normalization task requires a well
plays a central role in many information extraction
                                                      defined target vocabulary. A useful resource is
tasks within the clinical domain. It requires the
                                                      the Unified Medical Language System (UMLS),
identification of concept mentions in clinical nar-
                                                      which defines biomedical concepts with various
ratives and the disambiguation of their respective
                                                      names, spellings and abbreviations. Concepts
surface strings (normalization). In recent years,
                                                      within UMLS are defined by so called concept
many tasks have focused on the normalization
                                                      unique identifiers (CUI) that represent concepts
of clinical concepts, such as the i2b2 challenge
                                                      across different biomedical vocabularies, such as
(Uzuner et al., 2011), ShARe/CLEF (Pradhan et
                                                      NCI, NDF-RT or RxNorm. However, natural lan-
al., 2013) and SemEval (Elhadad et al., 2015).
                                                      guage is highly variable and surface strings can
   Traditionally, disambiguation systems rely on
                                                      have different meanings depending on the context.
supervised (Martinez and Baldwin, 2011), semi-
                                                          1
supervised (Preiss and Stevenson, 2013) or un-                http://clinical-ta.dfki.de
       Concept-Type (Source)          #Annotations             ment and the second half for testing.
       Symptoms (NCI)                        1434
       Disease (NCI)                         1370                 We also analyzed the ambiguities within the
       Medication (RxNorm)                   1190              corpus based on our candidate search (§3.2). Ta-
       Diagnostic Procedure (NCI)              647             ble 2 lists different ambiguity classes and their
       Therapeutic procedure (NCI)             644
       Anatomy (NCI)                           593             fraction in the dataset. It shows that ambiguity
       Laboratory Tests (NCI)                  458             arises only in 18% of mentions. Candidate search
                                                               fails in about a third of all cases for which the
Table 1: Concept annotations by type in our
                                                               correct candidate is not found. For most of those
dataset.
                                                               cases no candidate is found at all. This shows that
       Class                         % of mentions             the currently employed dictionary lookup has to be
       ambiguous                                18             refined. However, this work addresses the problem
       ambiguous given type                     12
       not ambiguous                            49             of disambiguation. Thus, only 18% of all cases
       no candidates                            28             are non-trivial and are useful for evaluating dis-
       correct candidate not found              33             ambiguation.
Table 2: Ambiguity classes and their relative fre-             3     System Architecture
quency in the dataset. ambiguous - mentions with
more than one candidate including the correct;                 3.1    Mention Recognition
ambiguous given type - subset of ambiguous that                Because of the focus on disambiguation our demo
remains ambiguous after removing candidates of                 system employs a simple approach to mention
wrong type; not ambiguous - only one, correct                  recognition. Given a tokenized input document all
candidate.                                                     word n-grams up to a predefined n are extracted.
                                                               This guarantees high recall. In the subsequent can-
   The task of normalizing surface strings to                  didate search step we eliminate all extracted men-
unique concepts of a given vocabulary such as                  tions for which no candidates are found.
UMLS can be subdivided into three partial tasks:
                                                               3.2    Candidate Search
Mention Recognition, Candidate Search and Dis-
ambiguation. Given an input text, the mention                  We find concept candidates for each recognized
recognition subtask identifies text-spans that are             mention via a string lookup to a given dictionary.
potential mentions of a medical concept. Sub-                  The dictionary maps surface strings to concepts.
sequently, the candidate search is responsible for             Those were extracted from a predefined subset of
finding candidate concepts for the surface strings             vocabularies in the UMLS, namely RxNorm for
of each mention. Finally, the disambiguation step              medications and NCI for anatomical concepts, dis-
selects the candidate that fits best into the men-             eases, therapeutic procedures, diagnostic proce-
tions context, i.e., it resolves the ambiguity among           dures, laboratory tests and symptoms. The surface
its candidates. The work focuses on the disam-                 strings of the dictionary were expanded by includ-
biguation task.                                                ing additional lexical variations.

                                                               3.3    Disambiguation
2.1    Data                                                    The most crucial part of the concept normalization
                                                               pipeline is the concept disambiguation. Given a
In our experiments we used a part of the i2b22 -               set of candidates for each recognized mention it
corpus (Uzuner et al., 2011) that was manually re-             selects the concept which fits best to the mention
annotated3 . It consists of 72 discharge summaries.            of interest. The disambiguation is guided by two
Overall, the dataset contains 6336 annotations. Ta-            algorithms, that are explained in the following.
ble 1 lists annotation types and their corresponding
number of annotations. The corpus was split into               String-Edit-Distance Each concept in UMLS
2 distinct subsets, each covering half of the docu-            may include a set of synonyms containing a range
ments. The first set was used for system develop-              of variations and spellings. Not all of those string
                                                               variations are likely to represent a concept in free
   2
     https://www.i2b2.org/                                     text. However, a small subset of strings are indi-
   3
     Note, the re-annotation took place within an industrial
use case and was not carried out by one of the authors. The    cated as preferred labels for a concept. In a corpus
data and the dictionaries we used were already given.          analysis, we found that many ambiguities can be
resolved by selecting the candidate concept whose             3.4      Rule-based disambiguation
preferred labels contains a close match with the              A problem of unsupervised disambiguation is
mention string. We further found that preferred la-           the inability of learning corpus-specific patterns
bels of distinct UMLS concepts are usually mutual             which depend on annotation guide-lines and the
exclusive. Thus, we employ a string-edit-distance             personal perspective of the annotators themselves.
(ED) algorithm, namely Levenshtein-distance, be-              Based on our observations the following set of
tween the preferred labels Lc of all candidates cm
                                                 i            simple rules are defined and used to support both
and the mention string xm . We use the minimum                disambiguation techniques:
of those distances to define the ED-score of a can-
didate concept.                                               Active Substance: If the given mention is a
                                                              tradename (e.g., Tylenol), in most of the cases its
                                  1                           active substance (e.g., Acetaminophen) is anno-
       sed (cm
             i ) = max
                    l∈Lcm distance(xm , l) + 1                tated. Therefore we map all concepts that refer
                        i
                                                              to a tradename to its active substance: This infor-
Densest-Subgraph We employ a densest-                         mation is taken from the UMLS Metathesaurus re-
subgraph algorithm similar to Moro et al. (2014)              lation has-tradename.
or Weissenborn et al. (2015) to account for the
context of a mention. First we construct a graph              Structure of: If a mention ‘M’ (e.g. ‘left foot’)
that consists of all candidates cim for all mentions          includes two candidate concepts, one containing
m of a document. These are the vertices of the                the preferred label ‘structure of M’ and the other
graph. We connect candidate concepts from                     one ‘entire M’, the second concept is removed
different mentions with each other, whenever they             from the list of candidates.
co-occurred at least once together in MEDLINE,
                                                              Abbreviation validation: Abbreviations tend to
a repository of abstracts from biomedical publica-
                                                              be highly ambiguous (Kim et al., 2011) and are
tions. This information is annually summarized
                                                              difficult to disambiguate. However, in many cases
by the National Institutes of Health (NIH)4 . Given
                                                              those candidates are selected, whose preferred la-
the concept graph G = (V, E) of a document,
                                                              bels fit the mentioned abbreviation. To address
we iteratively select a mention with the most
                                                              this issue, abbreviations are firstly identified us-
remaining candidates and remove its least con-
                                                              ing the UMLS Lexical Tools. Next, candidates
nected candidate until each mention has at most
                                                              whose preferred labels are not valid long forms
a predefined number of candidates left5 . Given
                                                              of a mentioned abbreviation are removed during
the pruned graph G∗ = (V ∗ , E ∗ ) we score each
                                                              pre-processing. Valid long forms of abbreviations
remaining candidate by the product of its number
                                                              have to fulfill the following criterion: The first let-
of connections to other mention candidates and
                                                              ter of the abbreviation must match the first letter
other mentions, i.e., number of mentions that have
                                                              of the text, and the remainder of the abbreviation,
at least one connected candidate concept.
                                                              i.e., the abbreviation without its first letter, must be
                         0          0  ∗
                                                              an abbreviation for the either the remaining text or
       suds (cm       m    m m
              i ) = {cj |(ci , cj ) ∈ E } ·                   the remaining words, excluding the first.
                                        0
                    {m0 |∃j : (cm    m      ∗
                                i , cj ) ∈ E }                4       Online Demo
                    suds (cm
                           i )
       sds (cm
             i ) = P     u   m                                The web interface of the online demo6 is based on
                     j sds (cj )                              the BRAT NLP-tool7 to visualize the implemented
  We tried different combinations of both scores              candidate search and disambiguation. Figure 1
and found the disambiguation via sds with a fall-             presents the output of our Demo after process-
back to sed to work best. I.e., we select always              ing a clinical narrative. The upper part ‘Candi-
the candidate for each mention with the highest               date Search’ displays the text including mentions
sds and apply sed in case there are more than one             with their respective concept candidates. Differ-
candidate with the same score.                                ent colors indicate different types of concepts. In
                                                              the given example, red refers to anatomy, green to
   4
     https://mbr.nlm.nih.gov/MRCOC.shtml
   5                                                              6
     We use 5 in our system, which performs slightly better           http://clinical-ta.dfki.de
                                                                  7
or equal to other configurations.                                     http://brat.nlplab.org/
                                                          System        Pre-processing      P       R       F1
                                                          ED            Gold-standard     0.850   0.592    0.698
                                                          DS            Gold-standard     0.850   0.592    0.698
                                                          DS+SE         Gold-standard     0.857   0.597    0.703
                                                          ED            cTAKES            0.777   0.522    0.624
                                                          DS            cTAKES            0.766   0.514    0.615
                                                          DS+ED         cTAKES            0.780   0.524    0.627
                                                          cTAKES        cTAKES            0.743   0.499    0.597

                                                      Table 3: Normalization results in Precision (P),
                                                      Recall (R) and F1-score (F1) for all mentions in
                                                      testset.
                                                              System     Pre-processing    #Mentions        P
                                                              ED         Gold-standard       502          0.751
                                                              DS         Gold-standard       502          0.751
                                                              DS+SE      Gold-standard       502          0.781
                                                              ED         cTAKES              270          0.730
                                                              DS         cTAKES              270          0.659
                                                              DS+ED      cTAKES              270          0.767
                                                              cTAKES     cTAKES              270          0.481

                                                      Table 4: Precision (P) for all non-trivial mentions
                                                      in testset, i.e., mentions with at least 2 candidates
Figure 1: Annotations comprising candidate and        containing the correct one.
disambiguated view.
                                                      tribute to the performance of disambiguation. Our
symptom, pink to disease and turquoise to labora-     system performs also better than cTAKES9 with
tory test. Moving the mouse courser over a candi-     the same pre-processing (mention recognition and
date mention, the GUI shows the vocabulary ori-       candidate search). The main problem in general
gin and its concept unique identifier.                lies in the low recall, which is mainly due to fail-
                                                      ing candidate search. This is also a major concern
5       Experiments
                                                      in future work.
5.1      Setup                                           As mentioned in §2.1, only a fraction of men-
We evaluated our system on the test part of the       tions can be considered non-trivial with respect
dataset with different configurations. More specif-   to the disambiguation. Table 4 shows the perfor-
ically, we compare the performance of the indi-       mance of our system and cTAKES for all non-
vidual disambiguation algorithms, namely string-      trivial mentions. The observations are similar to
edit-distance (ED) and densest-subgraph (DS),         the previous results. We can see that the precision
and their combination, as well as a widely used       of our system is quite robust and much better than
reference system called cTAKES8 (Savova et al.,       the performance of cTAKES.
2010) in combination with the disambiguation
component YTEX (Garla et al., 2011). We make          6       Conclusion
use of a gold-standard mention recognizer that ex-
tracts only annotated mentions in the experiments.    We presented a light-weight disambiguation sys-
When comparing to cTAKES, we make use of its          tem for the normalization of clinical concept men-
internal mention extraction and candidate search      tions. The system is mainly unsupervised and uti-
in combination with our disambiguation to guar-       lizes string similarity metrics as well as informa-
antee a fair comparison. Additionally, our post-      tion from concept co-occurrences. We demon-
processing heuristics were applied to the output of   strate its robustness with respect to disambigua-
both our system and cTAKES.                           tion and compared it to cTAKES, a popular open-
                                                      source system for clinical NLP. In addition, we
5.2      Results                                      give examples where our unsupervised approach
Table 3 shows the results on the entire testset. We   fails because of annotation guidelines and prefer-
achieve a high precision of over 85% which we at-     ences. This problem is solved by the introduction
    8                                                     9
        https://ctakes.apache.org/                            standard configuration for YTEX disambiguation
of simple heuristics. Finally, our system can be         Parth Pathak, Pinal Patel, Vishal Panchal, Sagar Soni,
accessed via a web-application.                            Kinjal Dani, Amrish Patel, and Narayan Choudhary.
                                                           2015. ezDI: A Supervised NLP System for Clinical
Acknowledgements                                           Narrative Analysis. In Proceedings of the 9th In-
                                                           ternational Workshop on Semantic Evaluation (Se-
This research was partially supported by SAP,              mEval 2015), pages 412–416. Association for Com-
the German Federal Ministry of Economics and               putational Linguistics.
Energy (BMWi) through the project MACSS                  Sameer Pradhan, Noémie Elhadad, Brett R. South,
(01MD16011F), and by the German Federal Min-               David Martı́nez, Lee M. Christensen, Amy Vogel,
istry of Education and Research (BMBF) through             Hanna Suominen, Wendy W. Chapman, and Guer-
                                                           gana K. Savova. 2013. Task 1: ShARe/CLEF
the project BBDC (01IS14013E).                             eHealth Evaluation Lab 2013. In Working Notes for
                                                           CLEF 2013 Conference , Valencia, Spain, Septem-
                                                           ber 23-26, 2013.
References
                                                         Judita Preiss and Mark Stevenson. 2013. DALE: A
Zubair Afzal, Saber A. Akhondi, Herman van Haa-
                                                           Word Sense Disambiguation System for Biomedical
  gen, Erik M. van Mulligen, and Jan A. Kors. 2015.
                                                           Documents Trained using Automatically Labeled
  Biomedical Concept Recognition in French Text Us-
                                                           Examples. In Proceedings of the 2013 NAACL HLT
  ing Automatic Translation of English Terms. In
                                                           Demonstration Session, pages 1–4, Atlanta, Geor-
  Working Notes of CLEF 2015 - Conference and Labs
                                                           gia, June. Association for Computational Linguis-
  of the Evaluation forum, Toulouse, France, Septem-
                                                           tics.
  ber 8-11, 2015.
Eneko Agirre, Aitor Soroa, and Mark Stevenson.           Guergana K Savova, James J Masanz, Philip V Ogren,
  2010.    Graph-based Word Sense Disambigua-              Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-
  tion of biomedical documents. Bioinformatics,            Schuler, and Christopher G Chute. 2010. Mayo
  26(22):2889–2896.                                        clinical Text Analysis and Knowledge Extraction
                                                           System (cTAKES): architecture, component evalua-
Noémie Elhadad, Sameer Pradhan, Sharon Gorman,            tion and applications. Journal of the American Med-
  Suresh Manandhar, Wendy Chapman, and Guergana            ical Informatics Association, 17(5):507–513.
  Savova. 2015. SemEval-2015 Task 14: Analysis
  of Clinical Text. In Proceedings of the 9th Interna-   Özlem Uzuner, Brett R South, Shuying Shen, and
  tional Workshop on Semantic Evaluation (SemEval           Scott L DuVall. 2011. 2010 i2b2/VA challenge on
  2015), pages 303–310, Denver, Colorado, June. As-         concepts, assertions, and relations in clinical text.
  sociation for Computational Linguistics.                  Journal of the American Medical Informatics Asso-
                                                            ciation, 18(5):552–556.
Vijay Garla, Vincent Lo Re, Zachariah Dorey-Stein,
   Farah Kidwai, Matthew Scotch, Julie Womack, Amy       Dirk Weissenborn, Leonhard Hennig, Feiyu Xu, and
   Justice, and Cynthia Brandt. 2011. The Yale             Hans Uszkoreit. 2015. Multi-Objective Optimiza-
   cTAKES extensions for document classification: ar-      tion for the Joint Disambiguation of Nouns and
   chitecture and application. Journal of the American     Named Entities. Proc. of ACLIJCNLP, Beijing,
   Medical Informatics Association, 18(5):614–620.         China, pages 596–605.
Youngjun Kim, John Hurdle, and Stéphane M Meystre.
  2011. Using UMLS lexical resources to disam-
  biguate abbreviations in clinical text. AMIA Sym-
  posium, 2011:715722.
André Leal, Bruno Martins, and Francisco Couto.
  2015. ULisboa: Recognition and Normalization of
  Medical Concepts. In Proceedings of the 9th In-
  ternational Workshop on Semantic Evaluation (Se-
  mEval 2015), pages 406–411. Association for Com-
  putational Linguistics.
David Martinez and Timothy Baldwin. 2011. Word
  sense disambiguation for event trigger word detec-
  tion in biomedicine. BMC Bioinformatics, 12(2):1–
  8.
Andrea Moro, Alessandro Raganato, and Roberto Nav-
  igli. 2014. Entity linking meets word sense disam-
  biguation: a unified approach. Transactions of the
  Association for Computational Linguistics, 2:231–
  244.