<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approaches for Annotating Medical Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victor Christen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anika Gro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erhard Rahm</string-name>
          <email>rahmg@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Leipzig</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Annotations are useful to semantically enrich documents and other datasets with concepts of ontologies. In the medical domain, many documents are not annotated at all and manual annotation is a di cult process making automatic annotation methods highly desirable to support human annotators. We propose a linguistic-based and a reuse-based approach annotating medical documents by concepts from an ontology. The reuse-based approach utilizes previous annotations to annotate similar medical documents. The approach clusters items in documents such as medical forms according to previous ontology-based annotations and uses these clusters to determine candidate annotations for new items.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The annotation of data with concepts of standardized vocabularies and
ontologies has gained increasing signi cance due to the huge number and size of
available datasets as well as the need to deal with the resulting data heterogeneity.
Annotations of medical documents such as Electronic Health Records (EHR)
that are used to document the history of patients can also support advanced
analyses and searches. For instance, they can be used to identify signi cant
cooccurrences between the use of certain drugs and negative side e ects in terms of
occurring diseases [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover, case report forms are used for examining
clinical trials, e.g. to ask for the medical history of probands. To enable an e cient
search for medical documents, annotations can be used to semantically look for
a certain set of forms, e.g., in the MDM repository of medical data models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and to design new forms with a similar topic.
      </p>
      <p>
        To improve the value of medical documents for analysis, reuse and data
integration it is thus crucial to annotate them with concepts of ontologies. Since the
number, size and complexity of medical documents and ontologies can be very
large, a manual annotation process is time-consuming or even infeasible. Hence,
automatic annotation methods become necessary to support human annotators
with recommendations for manual veri cation. The goal of an annotation method
is the identi cation of annotations for a collection of medical documents . An
annotation is an associtation between a document and a concept from an ontology,
where the concept covers the semantics of the document. Therefore, a document
might be annotated with more than one concept to precisely describe the content
of the document. The use of annotations enables a standardized representation,
since an ontology is a uni ed set of concepts and a set of relationship
interrelating the ontology concepts by certain relationship types, e.g. is a, part of or
domain-speci c relationships such as is located in. The annotation of
documents by using concepts of an ontology is related to the entity-linking problem
that is a well studied eld [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Moreover, there exist di erent annotation methods
such as MetaMap [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that annotates medical documents with concepts of UMLS
by applying a linguistic-based approach.
      </p>
      <p>
        In our recent work, we realized di erent annotation methods to identify
annotations for medical forms based on concepts of UMLS. We initally start with a
linguistic-based annotation approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A crucial part of an annotation method
is the identi cation of annotation candidates in terms of e ectivness and
efciency. In general, a medical document or a collection of medical documents
cover topically a subset of an ontology. Moreover, the quality of annotation
candidates depends on the quality of synonyms and labels for a concept. We overcome
such issues by creating a reuse repository for utilizing veri ed annotated
documents [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We are able to build more compact and preciser representatives for
a concept based on the veri ed documents than the synonyms and labels for a
concept. Morover, the reuse of the genenerated representatives to annotate a set
of medical documents is more e cient than using the whole ontology.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Linguistic-Based Annotation Approach</title>
      <p>The work ow consists of a preprocessing, a candidate identi cation and a
selection step (see Fig. 1). The input of the work ow is a set of forms F , an
ontology O, and a similarity threshold . This kind of documents consists of a
set of question that we want to annotate with a set of concepts. In our case,
we use concepts from the Uni ed Medical Language System (UMLS) that is
an integrated knowledge system including several biomedical ontologies. First,
we normalize the labels and synonyms of ontology concepts by removing stop
words, transforming all string values to lower case and removing delimiters. The
same preprocessing steps are applied for each form Fi. We identify an
intermediate annotation mapping M0Fi;O by lexicographically comparing each question
with the labels and synonyms of ontology concepts. For this purpose, we apply
three string similarity measures, namely trigram, TF/IDF as well as a longest
common sequence string similarity approach. We keep an annotation (q; c; sim)
for a question q and a concept c, if the maximal similarity sim of the three
string similarity approaches exceeds the threshold . Finally, we select
annotations from the intermediate result by not only choosing the concepts with the
highest similarity but also by considering the similarity among the concepts. For</p>
      <sec id="sec-2-1">
        <title>Input</title>
        <p>Set of</p>
      </sec>
      <sec id="sec-2-2">
        <title>Forms</title>
        <p>1 . .</p>
      </sec>
      <sec id="sec-2-3">
        <title>UMLS</title>
      </sec>
      <sec id="sec-2-4">
        <title>Preprocessing</title>
      </sec>
      <sec id="sec-2-5">
        <title>Normalization:</title>
      </sec>
      <sec id="sec-2-6">
        <title>POS- Tagging,</title>
      </sec>
      <sec id="sec-2-7">
        <title>Tokenization,</title>
      </sec>
      <sec id="sec-2-8">
        <title>Encoding,...</title>
      </sec>
      <sec id="sec-2-9">
        <title>Candidate</title>
      </sec>
      <sec id="sec-2-10">
        <title>Identification</title>
      </sec>
      <sec id="sec-2-11">
        <title>Matching: TF-IDF,</title>
      </sec>
      <sec id="sec-2-12">
        <title>Trigram, LCS …</title>
      </sec>
      <sec id="sec-2-13">
        <title>Postprocessing</title>
      </sec>
      <sec id="sec-2-14">
        <title>Selection:</title>
      </sec>
      <sec id="sec-2-15">
        <title>Groupbased</title>
      </sec>
      <sec id="sec-2-16">
        <title>Output</title>
      </sec>
      <sec id="sec-2-17">
        <title>Set of Annotation</title>
      </sec>
      <sec id="sec-2-18">
        <title>Mappings</title>
        <p>ℳ 1, ,…,
ℳ  ,</p>
        <p>
          Fig. 1. Work ow of the linguistic-based annotation approach
this purpose, we group the concepts associated with a question based on their
mutual similarity and only choose the concept with the highest similarity per
group in order to avoid the redundant selection of highly similar concepts. This
group-based selection proved to be quite e ective in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] albeit it only
considers the string-based (linguistic) similarity between questions and concepts, and
among concepts.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Reuse-based Annotation Approach</title>
      <p>The work ow for the reuse-based annotation approach is shown in Figure 2. Its
input includes a set of veri ed annotation mappings containing the annotations
for reuse. The result is a set of annotation mappings MF;O for the unannotated
input forms F w.r.t. ontology O. In the rst step, we use the veri ed
annotations to determine a set of annotation clusters AC = facc1 ; acc2 ; :::; accm g. For
each concept ci used in the veri ed annotations, we have an annotation cluster
acci containing all questions that are associated to this concept. To calculate
the similarity between an unannotated question and the questions of an
annotation cluster we determine for each cluster a representative (feature set) accfis
consisting of relevant term groups in this cluster. A relevant term group is
either a frequently co-occuring term group in the questions of the cluster or the
maximized overlap between the terms of a question and the synonyms or the
label of a concept, i.e., we do not use term groups that build a subset of another
frequently occurring term group. As an example, Figure 3 shows the resulting
annotation cluster acC0023467 for UMLS concept C0023467 about the disease
Acute Myeloid Leukaemia. In the UMLS ontology, this concept is described by
a set of 32 synonyms (Figure 3 left). The annotation cluster also contains 25
questions associated to this concept in the veri ed annotation mappings. Most
questions only relate to some of the synonym terms of the concept while other
synonyms remain unused. So the abbreviation 'AML' that is a part of some
synonyms is often used but the abbreviation 'ANLL' does not occur in the medical
forms used to build the annotation clusters. For this example, we generate only
9 relevant term groups, i.e., the representative feature set of the cluster is much
more compact than the free text questions and large synonym set.</p>
      <p>Annotation Cluster Set of annotated Forms
Generation Input  1 . .</p>
      <p>Annotation Cluster  Generation</p>
      <p>with cluster feature sets  
Annotation Mapping
Generation Input</p>
      <p>Set of Forms
 1 . .</p>
      <p>Annotation
clusters AC
Preprocessing</p>
      <p>Mapping</p>
      <p>Generation
by using Annotation</p>
      <p>Clusters</p>
      <p>Mapping Generation
for unannotated
Questions by using</p>
      <p>UMLS</p>
      <sec id="sec-3-1">
        <title>UMLS</title>
        <p>Postprocessing
Annotation Selection:
Semantic-based
selection</p>
      </sec>
      <sec id="sec-3-2">
        <title>Output</title>
        <p>Set of
annotation
mappings
ℳ 1,</p>
        <p>…
ℳ  ,</p>
        <p>Fig. 2. Work ow of the reuse-based annotation approach
ANLL,
AML,
Acute myelocytic leukaemia,
AML - Acute myeloid
leukaemia,
acute myelogenous leukemia
(AML)
32 synonyms</p>
        <p>QC0023467
1. Previous induction-type chemotherapy for MDS or AML
2. Relapsed or treatment refractory AML
3. Patients with relapsed AML
4. Patients older than 60 years with acute myeloid leukemia
according to FAB (&gt;30 % bone marrow blasts) not
qualifying for, or not consenting to, standard induction
chemotherapy or immediate allografting
25 questions
  0023467</p>
        <p>AML,
acute myeloid leukemia,
acute promyelocytic</p>
        <p>leukemia,
acute myelodysplastic</p>
        <p>leukaemia
9 term groups</p>
        <p>After these initial steps we determine the annotation mapping for each
unannotated input form Fi. We rst preprocess a form and the ontology as in the base
approach (see Fig. 1). Then we determine an annotation mapping MFRie;uOse for
the form based on the annotation clusters. Depending on the degree of reusable
annotations the determined mapping is likely to be incomplete. We thus identify
all questions that are not yet covered by the rst mapping. For these questions
we apply the base algorithm to match them to the whole ontology and obtain a
second annotation mapping. We then take the union of the two partial mappings
to obtain the intermediate mapping M0Fi;O. Finally, we apply a context-based
selection strategy to determine the annotations for the nal mapping MF;O. The
input for the selection of annotations is a set of grouped candidate concepts for
each question in the medical forms F . To determine the nal annotations per
question, we rank the candidate concepts within each group based on a
combination of both linguistic and context-based similarity among the candidate
concepts. For this purpose, we consider two criteria for a set of candidate
concepts of a certain question: rst, the degree to which concepts co-occurred in the
annotations for the same question within the veri ed annotation mapping, and
second, the degree of semantic (contextual) relatedness of the concepts w.r.t. the
ontological structure. The goal is to give a high contextual similarity (and thus
a high chance of being selected) to frequently co-occurring concepts and to
semantically close concepts. To determine a context-based similarity, we construct
a context graph Gq = (Vq; Eq) for each question q. The vertices Vq represent
candidate concepts that are interconnected by two kinds of edges in Eq to
express that concepts have co-occurred in previous annotations or that concepts
are semantically related within the ontology. In both cases we assign distance
scores to the edges that will be used to calculate the context similarity between
concepts.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        We evaluate the proposed annotation approaches for medical forms and compare
it with the MetaMap tool. Our evaluation uses medical forms about eligibility
criteria (EC) and about quality assurance (QA) w.r.t cardiovascular procedures
from the MDM platform [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To evaluate the quality of automatically generated
annotations, we use manually created reference mappings from the MDM portal.
These reference mappings might not be perfect ("a silver standard") since the
huge size of UMLS makes it hard to manually identify the most suitable concepts
for each item. To analyze the quality of the resulting annotation mappings, we
compute precision, recall and F-measure using the union of all annotated form
items in the evaluation dataset. Table 4 shows the number of forms, items and
veri ed annotations for the reuse and evaluation datasets.
      </p>
      <p>dataset ECRD1 ECRD2 ECeval QARD1 QARD2 QAEval
#forms 200 100 25 16 32 23
#items 3125 1638 310 453 795 609
#annotations 13027 6911 578 694 1054 668
We proposed a linguistic-based and a reuse-based approach to semantically
annotate medical documents such as EHRs with concepts of an ontology. The
linguistic-based approach identi es an annotation mapping between a form and
an ontology by comparing each question of the form with the synonyms or labels
of each concept from an ontology. The reuse-based approach avoids the
comparison of each concept by utilizing already found and veri ed annotations for similar
CRFs. It builds so-called annotation clusters combining all previously annotated
questions related to the same medical concept. New questions are matched with
the identi ed cluster representatives to nd candidates for annotating concepts.
42.2% 42.1% 42.6% 50.0%
40.0%
30.0%
70.0%
60.0%</p>
      <p>To identify the most promising annotations, we proposed a context-based
selection strategy based on the semantic relatedness of concept candidates as well as
known co-occurrences from previous annotations. We compared our approaches
with MetaMap and showed that the reuse-based approach outperforms the
annotation method of MetaMap in terms of quality. However, the e ciency is lower
than MetaMap, since it uses an indexed database.</p>
      <p>For future work, we plan to use di erent annotation frameworks for
generating more candidates and to get more evidendence for correctness. We also plan
to build a reuse repository covering annotation clusters and their feature sets
for di erent medical subdomains. Such a repository can be used to identify
annotations for new medical documents. It further enables a semantic search for
existing medical document annotations. This can be useful to de ne new medical
forms by nding and reusing suitable annotated items instead of creating new
forms from scratch.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Aronson</surname>
          </string-name>
          .
          <article-title>E ective mapping of biomedical text to the umls metathesaurus: the metamap program</article-title>
          .
          <source>In Proc. AMIA Symposium</source>
          , page 17. American Medical Informatics Association,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>B.</given-names>
            <surname>Breil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kenneweg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fritz</surname>
          </string-name>
          , et al.
          <article-title>Multilingual medical data models in ODM format{a novel form-based approach to semantic interoperability between routine health-care and clinical research</article-title>
          .
          <source>Appl Clin Inf</source>
          ,
          <volume>3</volume>
          :
          <fpage>276</fpage>
          {
          <fpage>289</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>V.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>A reuse-based annotation approach for medical documents</article-title>
          . In Submmited for:
          <source>International Semantic Web Conference(ISWC)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>V.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Varghese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dugas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Annotating medical forms using UMLS</article-title>
          .
          <source>In Data Integration in the Life Sciences (DILS)</source>
          , volume
          <volume>9162</volume>
          <source>of LNCS</source>
          , pages
          <volume>55</volume>
          {
          <fpage>69</fpage>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. P. LePendu, S. Iyer,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fairon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shah</surname>
          </string-name>
          , et al.
          <article-title>Annotation Analysis for Testing Drug Safety Signals using Unstructured Clinical Notes</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          ,
          <article-title>3(S-1</article-title>
          ):
          <fpage>S5</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          .
          <article-title>Entity linking with a knowledge base: Issues, techniques, and solutions</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <volume>443</volume>
          {
          <fpage>460</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>