<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Relational Annotation of Scientific Medical Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A Case Study</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ann-Marie Eklund</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Swedish, University of Gothenburg</institution>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <fpage>27</fpage>
      <lpage>34</lpage>
      <abstract>
        <p>In life science and biomedicine much knowledge resides as unstructured information in for instance bibliographic databases. To facilitate searching and categorisation of this information the database entries are annotated with terms or keywords, describing for instance diseases, treatments and anatomy. These annotations are limited to concept level and do not describe relations between terms, for example that a given treatment may be used for a given disease, even if this information is available in both the text and terminologies. In this work we will present a possible approach to extend term annotations with relational information to add another dimension to concept focused annotation schemas. This approach could also be used to highlight implicit information and to structure knowledge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In life science and biomedicine much knowledge resides as unstructured
information in for instance bibliographic repositories or electronic health record
databases. To facilitate searching and categorisation of this information the
database entries are annotated with terms or keywords, describing for instance
diseases, treatments and anatomy. These annotations are limited to concept level
and do not describe relations between terms, for example that a given treatment
may be used for a given disease, even if this information is available in the paper.
This annotation limitation remains in spite of accessible relational information
in ontologies and terminologies.</p>
      <p>
        In this work we study the possibility to annotate a term annotated scientific
medical corpus with relations between the terms, for instance relating diseases
to treatments and organ sites. We use a Swedish text corpus of scientific medical
documents [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which has been annotated with information from the medical
terminology MeSH (Medical Subject Headings)1.
      </p>
      <p>
        Current related work has focused on for instance methods for extraction of
specific types of relations [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ] and extraction and characterisation of semantic
relations [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">1,2,4</xref>
        ] in biomedical text. For instance, Abacha and Zweigenbaum [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
Frunza and Inkpen [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Yao et al [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] studied cure, prevent and side effect
relations in medical papers and Segura-Bedmar et al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] studied resolving anaphoras
for extraction of drug-drug interaction in pharmacological documents.
      </p>
      <p>However, our work focuses not on establishing methods for extraction of
relations between terms, but on tying existing term annotations together, reflecting
relations in the text. These new relational annotations would allow both medical
scientists and search engines to take advantage of highlighted implicit
information on for instance diseases, treatments and anatomy. For example a paper
regarding prevention of myocardial infarction may be annotated with terms like
Aspirin and Myocardial Infarction and our proposal is to also annotate it with
the relation may_prevent relating these terms. Not only would this add new
useful annotations, but it would also structure existing annotations.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Materials and Methods</title>
      <sec id="sec-2-1">
        <title>Materials</title>
        <p>
          The main resources in this study are an annotated Swedish scientific medical
corpus [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and the vocabularies and terminologies of the Unified Medical Language
System2.
        </p>
        <p>
          Medical Text Corpus. As a part of the Swedish strategy for e-health, the
clinical terminology SNOMED CT3 has been translated into Swedish. For
validation and quality assessment of the translation, a Swedish medical text corpus
was created from the electronic archives of the Journal of the Swedish Medical
Association 1996-2009 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The corpus comprises 29110 documents (28 million
tokens) and has been part-of-speech tagged and annotated with Swedish and
English MeSH (release 2006) and with the Swedish SNOMED CT.
        </p>
        <p>Our study focuses on the MeSH annotated sentences in a part of the corpus
containing 2021 articles from the domain “Klinik och Vetenskap” (Medical
Practice and Science). This part contains 140458 sentences, each with a unique id.
For copyright reasons the order of the sentences have been randomised, thereby
limiting our study to relational annotations at sentence level. Of this sentence
set we used only the 102821 sentences with at least two MeSH annotations.
UMLS. The Unified Medical Language System (UMLS)4 connects
vocabularies from different biomedical and health-related sources in different languages. It
provides, among other things, databases, called Knowledge Sources. One of the
databases, Metathesaurus, contains information about more than one million
biomedical or health-related concepts. This database is divided into a number
2 www.nlm.nih.gov/research/umls
3 www.ihtsdo.org
4 We have used UMLS version 2010AB, including all source vocabularies of level 0-3.
of relational tables. One of the major tables, MRCONSO, contains the
structure for each concept, e.g. names, identifiers, languages and source vocabularies.
This table is complemented with the MRSAT connecting MeSH identifiers (or
identifiers from other source vocabularies) to concept identifiers (CUI).</p>
        <p>Metathesaurus also contains relations between concepts, where the table
MRREL contains basic relations (REL), e.g. Parent/Child, relating different
concepts. Around 25% of the relations have a label (RELA - Relationship Attribute)
which comes from the source vocabulary and specifies the relationship, e.g. isa,
treated_by, finding_site_of or has_component.</p>
        <p>Considering the MeSH part of the UMLS, a term can belong to more than
one category and thereby appear in several places in the MeSH hierarchy with
different MeSH identifiers. Moreover, by the annotation procedure used in our
study corpus, a term like Blood Pressure will be annoted as Blood Pressure, but
also as Blood and Pressure. Hence, the MeSH annotations can be nested, which
can result in more UMLS concepts than found terms in a sentence.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Methods</title>
        <p>Since this work can be seen as a feasibility study of extending term annotation
schemas with relational information, the methods have been kept simple and
analysis and validation of the approach were done by manual inspection.</p>
        <p>For each of the sentences in our corpus containing at least two MeSH
annotations, we extracted the sentence identifier and MeSH annotations, not taking into
account any nesting of the annotations. The resulting information was stored in a
MySQL database, complementing the UMLS one, thereby allowing easy mapping
of MeSH identifiers to UMLS concepts via the MRSAT table. These concepts
could then be used to derive relations from MRREL, giving relations between
the annotated terms in each sentence.</p>
        <p>Since our main interest is in the ability to provide relational annotations
among annotated terms, the analysis focused on the derived relations with
information in the RELA field of MRREL. One such RELA relation is may_prevent,
e.g. “Aspirin may_prevent Myocardial Infarction”. These relations were divided
into five different categories reflecting disease-treatment, disease-organ,
causeeffect, hierarchical and other relations among the terms to reflect relations often
found in medical papers. The division of the RELA relations into these classes
was based on our subjective interpretation of relations like e.g. may_treat,
has_finding_site and cause_of. For each of the relation categories, we randomly
picked sentences and manually compared the derived relations to the ones
expressed in the sentences to see if and how they were related. Since this work is
to be viewed as a feasibility study for future research, we limited our analysis to
a handful of sentences per relational category.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Our study corpus comprised 140458 sentences, with 102821 (73%) containing at
least two MeSH annotations and in 26251 (25%) of these we were able to identify
relations between the terms. In these sentences we found 150024 relations of
which 44754 (30%) had the specification RELA. There were 188 different RELA.
Table 1 shows the percentage of sentences per RELA relation.</p>
      <p>Figure 1 shows the 122483 sentences containing MeSH annotated terms
divided into number of MeSH terms (left) and number of sentences per number of
relations (right)5.</p>
      <p>In the rest of this section we analyse each of the defined relation categories
and exemplify relations and their corresponding sentences, Table 2, and in
Discussion we briefly elaborate on these results.</p>
      <p>Disease - Treatment – The relations between diseases and treatments
found in the studied part of the corpus are e.g. may_treat and may_prevent.
Examples of disease-treatment relations are the relations derived from sentences
77591, 50380 and 105057. From sentence 105057 it is not possible to infer what
the relation is between the terms Soft Tissue Infections and Methicillin.</p>
      <p>Disease - Organ – There are a number of different relations between
diseases and organs, e.g. is_associated_anatomic_site_of and has_finding_site.
The relation location_of is often a relation between diseases and treatments, but
sometimes refer to organ-organ relations. Disease-organ relations were found in
for example sentences 75743, 8039 and 97183. In 75743 and 8039 the relation
was not explicitly expressed.</p>
      <p>Cause - Effect – Cause and effect relations can be for example cause_of,
induces and causative_agent_of. It can be for instance viruses or bacteria which
cause diseases, or diseases that cause other diseases. Examples of these relations
were found in sentences 133130 and 125988.
5 In the UMLS, for many relations there is also an inverse, e.g. finding_site_of and
has_finding_site.
s
e
c 0
tennS 0010
e
0
0
0
0
2
0
0
0
5
1
0
0
0
5
0
0
0
0
4
1
0
0
0
0
se 01
c 0
tenne 800
S 00
6
0
0
0
4
0
0
0
2
0
1 4 7 10 13 16 19 22 25 28 31 34 37 47</p>
      <p>MeSH terms
2 4 6 8 10 12 14 16 18</p>
      <p>MeSH relations</p>
      <p>Hierarchical Relations – The hierarchical relations we have studied are
synonymy, hyponymy and sibling relations. Synonyms are for instance concepts
which have the relation same_as or has_tradename. The relations mapped_from
and primary_mapped_from can be synonym relations, sentence 77314.
Hyponymy can be relations like isa, may_be_a and part_of, sentence 43382.
Sibling relations found in the sentences are for instance sib_in_isa. In the example
of the sibling relation, where Ethanols, Methanols and Ethylene Glycols are all
Alcohols (sentence 14668), the relations have no specification in RELA, only the
REL abbreviation SIB.</p>
      <p>Other Relations – Other relations that were found in the sentences in the
corpus are for instance co-occurs_with, associated_with, may_diagnose and
occurs_before.</p>
      <p>Sentences 15403 and 12826 are examples of the relation co-occurs_with,
while sentences 52963, 51136 and 125709 are examples of the relations
associated_with, may_diagnose and occurs_before respectively.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>The examples in this work show that it is possible to derive relations from the
annotated terms in a sentence utilising only UMLS. The majority of the examples
for the different relation types are existing relations, but sometimes the derived
relations are implicit instead of directly expressed in the sentences. However, as
exemplified in sentence 75743, the derived relations may not be the ones intended
in the sentence.</p>
      <p>Since we have studied only concepts and not taken into account the syntax
of the sentences, the resulting relations between the terms can be relations not</p>
      <p>Opium may_treat Pain
Aspirin may_prevent Myocardial Infarction
Soft Tissue Infections may_be_treated_by Methicillin
Pancreas is_associated_anatomic_site_of Diabetes Mellitus
Celiak Disease has_finding_site Intestines
Central Nervous System finding_site_of Rabies
Bacillus antracis causative_agent_of Anthrax
Nitrous Oxide induces Nausea
Arthritis primary_mapped_from Joint Diseases
Estrogenes isa Hormones
SIB: Ethanols/Methanols/Ethylene Glycols
Obesity co-occurs_with Diabetes Mellitus
Diabetes Mellitus co-occurs_with Hypertension
Vaccination associated_with Immunization
Triiodothyronine may_diagnose Thyroid Disease</p>
      <p>Chickenpox occurs_before Herpes Zoster (shingles)
Sentence ID Sentence
77591
50380
105057
75743
8039
97183
133130
125988
77314
43382
14668
15403
12826
52963
51136
125709
[...] opium had a soothing effect on both anxiety and pain.
[...] observed that patients with a regular intake of Aspirin had fewer
heart attacks than expected.
[...] soft tissue infections [...] and infections caused by methicillin
resistant staphylococcus.</p>
      <p>An alternative for a few patients with diabetes mellitus has been
transplantation of pancreas [...]
[...] patients with already known celiak disease but who in spite of a
strict diet have had gastro-intestinal symptoms.
[...] the symptoms of a rabies infection begins when the virus reaches
CNS [...]
Bacillus antracis causes the disease Anthrax [...]
[...] nitrous oxide contributes to post-operative nausea.
[...] patients with [...] and arthritis symptoms who were treated with
[...] improvement of their joint disease [...]
Even though it is well documented that estrogene [...] is an important
hormone in [...]
[...] have been introduced as an alternative to ethanol in cases of
ethylene glycol and methanol poisoning.
[...] obesity is associated with a highly increased risk of developing [...]
diabetes [...]
Other risk factors for stroke are [...] hypertension, diabetes, [...]
Participation in the vaccination programs has been very high and
sufficient immunization was reached [...]
Antibodies targeting Triiodothyronine [...] in up to 10 percent of
patients with Thyroid diseases.
[...] a connection between chickenpox and herpes zoster.
derivable from a sentence. For instance, in sentence 105057 there is no indication
of any relation between the terms Soft Tissue Infections and Methicillin, but from
the UMLS we get a may_be_treated_by relation. Many of the derived UMLS
relations are not explicitly expressed in the sentences, but can be part of the
context of a sentence, for instance the relation between Diabetes Mellitus and
Pancreas in sentence 75743.</p>
      <p>Expressing hierarchical relations like isa and part_of could lead to increased
understanding of the context of the sentence. For example in 14668, where, by
the sibling relation, we learn that the terms Ethanols, Methanols and Ethylene
Glycols have something in common, i.e. they are all Alcohols, thereby framing
the concepts in the sentence.</p>
      <p>The terms which have the relation co-occurs_with can have slightly different
relations to each other in the sentences. For example in 15403 one problem leads
to another, but in 12826 the two diseases can both be the cause of a third one.</p>
      <p>One of the major reasons for only being able to identify relations in 25%
of the sentences with more than two annotations, may be that the annotations
are not at the same hierarchical ontological levels in comparison to the defined
relations in UMLS. A fundamental challenge with the proposed approach is its
dependence on the quality and source of the original term annotations, as in our
case using only MeSH in the process to identify relational annotations.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this work we have studied the feasibility of utilising the term annotations of
medical text in connection with a collection of terminologies and vocabularies
to extend the annotations with relational information. As the examples show,
this approach can be used not only for annotation, but also to highlight implicit
information and to structure knowledge.</p>
      <p>Our work complements existing research on extraction of (semantic) relations
in biomedical text, by focusing on identifying and validating relations between
conceptual annotations of a text. Hence, instead of the complex process of
extracting relations in medical papers, we utilise existing annotations to propose
potential relations covered by a paper.</p>
      <p>This work is based on only one of the source vocabularies and a limited
corpus. Hence, future work will address the ability to make use of combinations
of several source vocabularies and more elaborated use of the hierarchical
ontological relations to increase the ability to identify relations among annotated
terms. However, utilising for instance hyponymy induces challenges like degree of
hyponymy/hypernymy to allow in establishing relations among terms, and also
how to resolve problems with the complex relational structure of the UMLS with
many different types of relations and even cycles. For instance, some
vocabularies in the UMLS may treat relations like isa and part_of as synonymous and
some as distinct types of relations. Ongoing work also considers other corpora,
like parts of MEDLINE6, for relational annotation.
6 www.nlm.nih.gov/databases/databases_medline.html</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A hybrid approach for the extraction of semantic relations from MEDLINE abstracts</article-title>
          .
          <source>In: Proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing</source>
          <year>2011</year>
          ) (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Frunza</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inkpen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Extraction of disease-treatment semantic relations from biomedical sentences</article-title>
          .
          <source>In: BioNLP Workshop (ACL</source>
          <year>2010</year>
          )
          <article-title>(</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kokkinakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerdin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>A Swedish scientific medical corpus for terminology management and linguistic exploration</article-title>
          .
          <source>In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC</source>
          <year>2010</year>
          )
          <article-title>(</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>C.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimino</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Using semantic and structural properties of the unified medical language system to discover potential terminological relationships</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          <volume>16</volume>
          (
          <issue>3</issue>
          ),
          <fpage>346</fpage>
          -
          <lpage>353</lpage>
          (
          <year>2009</year>
          ), http://dx.doi.org/10.1197/jamia.M2931
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Segura-Bedmar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crespo</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Pablo-Sanchez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents</article-title>
          .
          <source>BMC Bioinformatics 11 Suppl</source>
          <volume>2</volume>
          ,
          <issue>S1</issue>
          (
          <year>2010</year>
          ), http://dx.doi.org/10.1186/
          <fpage>1471</fpage>
          - 2105-11-S2-S1
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Relationship extraction from biomedical literature using maximum entropy based on rich features</article-title>
          .
          <source>In: Proceedings International Conference on Machine Learning and Cybernetics</source>
          , (ICMLC
          <year>2010</year>
          ). pp.
          <fpage>3358</fpage>
          -
          <lpage>3361</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>