=Paper= {{Paper |id=None |storemode=property |title=Relational Annotation of Scientific Medical Corpora - A Case Study |pdfUrl=https://ceur-ws.org/Vol-744/paper4.pdf |volume=Vol-744 }} ==Relational Annotation of Scientific Medical Corpora - A Case Study== https://ceur-ws.org/Vol-744/paper4.pdf
     Relational Annotation of Scientific Medical
                     Corpora
                                A Case Study

                                Ann-Marie Eklund

            Department of Swedish, University of Gothenburg, Sweden�



      Abstract. In life science and biomedicine much knowledge resides as
      unstructured information in for instance bibliographic databases. To fa-
      cilitate searching and categorisation of this information the database
      entries are annotated with terms or keywords, describing for instance
      diseases, treatments and anatomy. These annotations are limited to con-
      cept level and do not describe relations between terms, for example that a
      given treatment may be used for a given disease, even if this information
      is available in both the text and terminologies.
      In this work we will present a possible approach to extend term anno-
      tations with relational information to add another dimension to concept
      focused annotation schemas. This approach could also be used to high-
      light implicit information and to structure knowledge.


1   Introduction

In life science and biomedicine much knowledge resides as unstructured in-
formation in for instance bibliographic repositories or electronic health record
databases. To facilitate searching and categorisation of this information the
database entries are annotated with terms or keywords, describing for instance
diseases, treatments and anatomy. These annotations are limited to concept level
and do not describe relations between terms, for example that a given treatment
may be used for a given disease, even if this information is available in the paper.
This annotation limitation remains in spite of accessible relational information
in ontologies and terminologies.
    In this work we study the possibility to annotate a term annotated scientific
medical corpus with relations between the terms, for instance relating diseases
to treatments and organ sites. We use a Swedish text corpus of scientific medical
documents [3], which has been annotated with information from the medical
terminology MeSH (Medical Subject Headings)1 .
    Current related work has focused on for instance methods for extraction of
specific types of relations [5,6] and extraction and characterisation of semantic
�
  The author would like to thank Centre for Language Technology, Gothenburg
  (clt.gu.se) for financial support.
1
  www.nlm.nih.gov/mesh



                                        27
Relational Annotation of Scientific Medical Corpora - A Case Study

  relations [1,2,4] in biomedical text. For instance, Abacha and Zweigenbaum [1],
  Frunza and Inkpen [2] and Yao et al [6] studied cure, prevent and side effect rela-
  tions in medical papers and Segura-Bedmar et al [5] studied resolving anaphoras
  for extraction of drug-drug interaction in pharmacological documents.
      However, our work focuses not on establishing methods for extraction of rela-
  tions between terms, but on tying existing term annotations together, reflecting
  relations in the text. These new relational annotations would allow both medical
  scientists and search engines to take advantage of highlighted implicit informa-
  tion on for instance diseases, treatments and anatomy. For example a paper
  regarding prevention of myocardial infarction may be annotated with terms like
  Aspirin and Myocardial Infarction and our proposal is to also annotate it with
  the relation may_prevent relating these terms. Not only would this add new
  useful annotations, but it would also structure existing annotations.


  2     Materials and Methods

  2.1   Materials

  The main resources in this study are an annotated Swedish scientific medical cor-
  pus [3] and the vocabularies and terminologies of the Unified Medical Language
  System2 .


  Medical Text Corpus. As a part of the Swedish strategy for e-health, the
  clinical terminology SNOMED CT3 has been translated into Swedish. For vali-
  dation and quality assessment of the translation, a Swedish medical text corpus
  was created from the electronic archives of the Journal of the Swedish Medical
  Association 1996-2009 [3]. The corpus comprises 29110 documents (28 million
  tokens) and has been part-of-speech tagged and annotated with Swedish and
  English MeSH (release 2006) and with the Swedish SNOMED CT.
      Our study focuses on the MeSH annotated sentences in a part of the corpus
  containing 2021 articles from the domain “Klinik och Vetenskap” (Medical Prac-
  tice and Science). This part contains 140458 sentences, each with a unique id.
  For copyright reasons the order of the sentences have been randomised, thereby
  limiting our study to relational annotations at sentence level. Of this sentence
  set we used only the 102821 sentences with at least two MeSH annotations.


  UMLS. The Unified Medical Language System (UMLS)4 connects vocabular-
  ies from different biomedical and health-related sources in different languages. It
  provides, among other things, databases, called Knowledge Sources. One of the
  databases, Metathesaurus, contains information about more than one million
  biomedical or health-related concepts. This database is divided into a number
   2
     www.nlm.nih.gov/research/umls
   3
     www.ihtsdo.org
   4
     We have used UMLS version 2010AB, including all source vocabularies of level 0-3.



                                          28
Relational Annotation of Scientific Medical Corpora - A Case Study

  of relational tables. One of the major tables, MRCONSO, contains the struc-
  ture for each concept, e.g. names, identifiers, languages and source vocabularies.
  This table is complemented with the MRSAT connecting MeSH identifiers (or
  identifiers from other source vocabularies) to concept identifiers (CUI).
      Metathesaurus also contains relations between concepts, where the table MR-
  REL contains basic relations (REL), e.g. Parent/Child, relating different con-
  cepts. Around 25% of the relations have a label (RELA - Relationship Attribute)
  which comes from the source vocabulary and specifies the relationship, e.g. isa,
  treated_by, finding_site_of or has_component.
      Considering the MeSH part of the UMLS, a term can belong to more than
  one category and thereby appear in several places in the MeSH hierarchy with
  different MeSH identifiers. Moreover, by the annotation procedure used in our
  study corpus, a term like Blood Pressure will be annoted as Blood Pressure, but
  also as Blood and Pressure. Hence, the MeSH annotations can be nested, which
  can result in more UMLS concepts than found terms in a sentence.

  2.2   Methods
  Since this work can be seen as a feasibility study of extending term annotation
  schemas with relational information, the methods have been kept simple and
  analysis and validation of the approach were done by manual inspection.
      For each of the sentences in our corpus containing at least two MeSH annota-
  tions, we extracted the sentence identifier and MeSH annotations, not taking into
  account any nesting of the annotations. The resulting information was stored in a
  MySQL database, complementing the UMLS one, thereby allowing easy mapping
  of MeSH identifiers to UMLS concepts via the MRSAT table. These concepts
  could then be used to derive relations from MRREL, giving relations between
  the annotated terms in each sentence.
      Since our main interest is in the ability to provide relational annotations
  among annotated terms, the analysis focused on the derived relations with infor-
  mation in the RELA field of MRREL. One such RELA relation is may_prevent,
  e.g. “Aspirin may_prevent Myocardial Infarction”. These relations were divided
  into five different categories reflecting disease-treatment, disease-organ, cause-
  effect, hierarchical and other relations among the terms to reflect relations often
  found in medical papers. The division of the RELA relations into these classes
  was based on our subjective interpretation of relations like e.g. may_treat,
  has_finding_site and cause_of. For each of the relation categories, we randomly
  picked sentences and manually compared the derived relations to the ones ex-
  pressed in the sentences to see if and how they were related. Since this work is
  to be viewed as a feasibility study for future research, we limited our analysis to
  a handful of sentences per relational category.

  3     Results
  Our study corpus comprised 140458 sentences, with 102821 (73%) containing at
  least two MeSH annotations and in 26251 (25%) of these we were able to identify



                                          29
Relational Annotation of Scientific Medical Corpora - A Case Study

  relations between the terms. In these sentences we found 150024 relations of
  which 44754 (30%) had the specification RELA. There were 188 different RELA.
  Table 1 shows the percentage of sentences per RELA relation.


  Table 1. Percentages of sentences (of 10854 sentences with RELA) per RELA relation.

                                    Relation              Percentage
                                        isa                   37
                                 co-occurs_with               10
                                associated_with               10
                                has_finding_site              6.8
                                    sib_in_isa                6.8
                                  mapped_from                 6.0
                                   location_of                5.3
                        is_associated_anatomic_site_of        4.0
                                    may_treat                 2.8
                                      part_of                 2.4
                                  may_prevent                 1.4
                                    may_be_a                  1.1
                              causative_agent_of              0.8



      Figure 1 shows the 122483 sentences containing MeSH annotated terms di-
  vided into number of MeSH terms (left) and number of sentences per number of
  relations (right)5 .
      In the rest of this section we analyse each of the defined relation categories
  and exemplify relations and their corresponding sentences, Table 2, and in Dis-
  cussion we briefly elaborate on these results.
      Disease - Treatment – The relations between diseases and treatments
  found in the studied part of the corpus are e.g. may_treat and may_prevent.
  Examples of disease-treatment relations are the relations derived from sentences
  77591, 50380 and 105057. From sentence 105057 it is not possible to infer what
  the relation is between the terms Soft Tissue Infections and Methicillin.
      Disease - Organ – There are a number of different relations between dis-
  eases and organs, e.g. is_associated_anatomic_site_of and has_finding_site.
  The relation location_of is often a relation between diseases and treatments, but
  sometimes refer to organ-organ relations. Disease-organ relations were found in
  for example sentences 75743, 8039 and 97183. In 75743 and 8039 the relation
  was not explicitly expressed.
      Cause - Effect – Cause and effect relations can be for example cause_of,
  induces and causative_agent_of. It can be for instance viruses or bacteria which
  cause diseases, or diseases that cause other diseases. Examples of these relations
  were found in sentences 133130 and 125988.
   5
       In the UMLS, for many relations there is also an inverse, e.g. finding_site_of and
       has_finding_site.



                                            30
Relational Annotation of Scientific Medical Corpora - A Case Study
              20000




                                                                                       14000
              15000




                                                                                       8000 10000
  Sentences




                                                                           Sentences
              10000




                                                                                       6000
                                                                                       4000
              5000




                                                                                       2000
              0




                                                                                       0
                       1   4   7   10 13 16 19 22 25 28 31 34 37 47                                 2   4   6   8    10     12   14   16   18

                                          MeSH terms                                                            MeSH relations




  Fig. 1. Number of sentences per number of MeSH terms (left) and number of sentences
  per number of relations (right).


      Hierarchical Relations – The hierarchical relations we have studied are
  synonymy, hyponymy and sibling relations. Synonyms are for instance concepts
  which have the relation same_as or has_tradename. The relations mapped_from
  and primary_mapped_from can be synonym relations, sentence 77314. Hy-
  ponymy can be relations like isa, may_be_a and part_of, sentence 43382. Sib-
  ling relations found in the sentences are for instance sib_in_isa. In the example
  of the sibling relation, where Ethanols, Methanols and Ethylene Glycols are all
  Alcohols (sentence 14668), the relations have no specification in RELA, only the
  REL abbreviation SIB.
      Other Relations – Other relations that were found in the sentences in the
  corpus are for instance co-occurs_with, associated_with, may_diagnose and
  occurs_before.
      Sentences 15403 and 12826 are examples of the relation co-occurs_with,
  while sentences 52963, 51136 and 125709 are examples of the relations asso-
  ciated_with, may_diagnose and occurs_before respectively.


  4                   Discussion
  The examples in this work show that it is possible to derive relations from the
  annotated terms in a sentence utilising only UMLS. The majority of the examples
  for the different relation types are existing relations, but sometimes the derived
  relations are implicit instead of directly expressed in the sentences. However, as
  exemplified in sentence 75743, the derived relations may not be the ones intended
  in the sentence.
      Since we have studied only concepts and not taken into account the syntax
  of the sentences, the resulting relations between the terms can be relations not



                                                                      31
Relational Annotation of Scientific Medical Corpora - A Case Study

              Table 2. Derived relations and sentences (English translations).

       Sentence ID Relation
           77591       Opium may_treat Pain
           50380       Aspirin may_prevent Myocardial Infarction
          105057       Soft Tissue Infections may_be_treated_by Methicillin
           75743       Pancreas is_associated_anatomic_site_of Diabetes Mellitus
           8039        Celiak Disease has_finding_site Intestines
           97183       Central Nervous System finding_site_of Rabies
          133130       Bacillus antracis causative_agent_of Anthrax
          125988       Nitrous Oxide induces Nausea
           77314       Arthritis primary_mapped_from Joint Diseases
           43382       Estrogenes isa Hormones
           14668       SIB: Ethanols/Methanols/Ethylene Glycols
           15403       Obesity co-occurs_with Diabetes Mellitus
           12826       Diabetes Mellitus co-occurs_with Hypertension
           52963       Vaccination associated_with Immunization
           51136       Triiodothyronine may_diagnose Thyroid Disease
          125709       Chickenpox occurs_before Herpes Zoster (shingles)

  Sentence ID Sentence
      77591        [...] opium had a soothing effect on both anxiety and pain.
      50380        [...] observed that patients with a regular intake of Aspirin had fewer
                   heart attacks than expected.
      105057       [...] soft tissue infections [...] and infections caused by methicillin
                   resistant staphylococcus.
      75743        An alternative for a few patients with diabetes mellitus has been
                   transplantation of pancreas [...]
       8039        [...] patients with already known celiak disease but who in spite of a
                   strict diet have had gastro-intestinal symptoms.
      97183        [...] the symptoms of a rabies infection begins when the virus reaches
                   CNS [...]
      133130       Bacillus antracis causes the disease Anthrax [...]
      125988       [...] nitrous oxide contributes to post-operative nausea.
       77314       [...] patients with [...] and arthritis symptoms who were treated with
                   [...] improvement of their joint disease [...]
      43382        Even though it is well documented that estrogene [...] is an important
                   hormone in [...]
      14668        [...] have been introduced as an alternative to ethanol in cases of
                   ethylene glycol and methanol poisoning.
      15403        [...] obesity is associated with a highly increased risk of developing [...]
                   diabetes [...]
      12826        Other risk factors for stroke are [...] hypertension, diabetes, [...]
      52963        Participation in the vaccination programs has been very high and
                   sufficient immunization was reached [...]
      51136        Antibodies targeting Triiodothyronine [...] in up to 10 percent of
                   patients with Thyroid diseases.
      125709       [...] a connection between chickenpox and herpes zoster.




                                              32
Relational Annotation of Scientific Medical Corpora - A Case Study

  derivable from a sentence. For instance, in sentence 105057 there is no indication
  of any relation between the terms Soft Tissue Infections and Methicillin, but from
  the UMLS we get a may_be_treated_by relation. Many of the derived UMLS
  relations are not explicitly expressed in the sentences, but can be part of the
  context of a sentence, for instance the relation between Diabetes Mellitus and
  Pancreas in sentence 75743.
      Expressing hierarchical relations like isa and part_of could lead to increased
  understanding of the context of the sentence. For example in 14668, where, by
  the sibling relation, we learn that the terms Ethanols, Methanols and Ethylene
  Glycols have something in common, i.e. they are all Alcohols, thereby framing
  the concepts in the sentence.
      The terms which have the relation co-occurs_with can have slightly different
  relations to each other in the sentences. For example in 15403 one problem leads
  to another, but in 12826 the two diseases can both be the cause of a third one.
      One of the major reasons for only being able to identify relations in 25%
  of the sentences with more than two annotations, may be that the annotations
  are not at the same hierarchical ontological levels in comparison to the defined
  relations in UMLS. A fundamental challenge with the proposed approach is its
  dependence on the quality and source of the original term annotations, as in our
  case using only MeSH in the process to identify relational annotations.


  5      Conclusions and Future Work
  In this work we have studied the feasibility of utilising the term annotations of
  medical text in connection with a collection of terminologies and vocabularies
  to extend the annotations with relational information. As the examples show,
  this approach can be used not only for annotation, but also to highlight implicit
  information and to structure knowledge.
      Our work complements existing research on extraction of (semantic) relations
  in biomedical text, by focusing on identifying and validating relations between
  conceptual annotations of a text. Hence, instead of the complex process of ex-
  tracting relations in medical papers, we utilise existing annotations to propose
  potential relations covered by a paper.
      This work is based on only one of the source vocabularies and a limited cor-
  pus. Hence, future work will address the ability to make use of combinations
  of several source vocabularies and more elaborated use of the hierarchical on-
  tological relations to increase the ability to identify relations among annotated
  terms. However, utilising for instance hyponymy induces challenges like degree of
  hyponymy/hypernymy to allow in establishing relations among terms, and also
  how to resolve problems with the complex relational structure of the UMLS with
  many different types of relations and even cycles. For instance, some vocabular-
  ies in the UMLS may treat relations like isa and part_of as synonymous and
  some as distinct types of relations. Ongoing work also considers other corpora,
  like parts of MEDLINE6 , for relational annotation.
   6
       www.nlm.nih.gov/databases/databases_medline.html



                                          33
Relational Annotation of Scientific Medical Corpora - A Case Study

  References
  1. Abacha, A.B., Zweigenbaum, P.: A hybrid approach for the extraction of semantic
     relations from MEDLINE abstracts. In: Proceedings of 12th International Confer-
     ence on Intelligent Text Processing and Computational Linguistics (CICLing 2011)
     (2011)
  2. Frunza, O., Inkpen, D.: Extraction of disease-treatment semantic relations from
     biomedical sentences. In: BioNLP Workshop (ACL 2010) (2010)
  3. Kokkinakis, D., Gerdin, U.: A Swedish scientific medical corpus for terminology
     management and linguistic exploration. In: Proceedings of the Seventh conference
     on International Language Resources and Evaluation (LREC 2010) (2010)
  4. Patel, C.O., Cimino, J.J.: Using semantic and structural properties of the unified
     medical language system to discover potential terminological relationships. J Am
     Med Inform Assoc 16(3), 346–353 (2009), http://dx.doi.org/10.1197/jamia.M2931
  5. Segura-Bedmar, I., Crespo, M., de Pablo-Sanchez, C., Martinez, P.: Resolving
     anaphoras for the extraction of drug-drug interactions in pharmacological docu-
     ments. BMC Bioinformatics 11 Suppl 2, S1 (2010), http://dx.doi.org/10.1186/1471-
     2105-11-S2-S1
  6. Yao, L., Sun, C., Wang, X., Wang, X.: Relationship extraction from biomedical liter-
     ature using maximum entropy based on rich features. In: Proceedings International
     Conference on Machine Learning and Cybernetics, (ICMLC 2010). pp. 3358–3361
     (2010)




                                           34