=Paper=
{{Paper
|id=None
|storemode=property
|title=Relational Annotation of Scientific Medical Corpora - A Case Study
|pdfUrl=https://ceur-ws.org/Vol-744/paper4.pdf
|volume=Vol-744
}}
==Relational Annotation of Scientific Medical Corpora - A Case Study==
Relational Annotation of Scientific Medical
Corpora
A Case Study
Ann-Marie Eklund
Department of Swedish, University of Gothenburg, Sweden�
Abstract. In life science and biomedicine much knowledge resides as
unstructured information in for instance bibliographic databases. To fa-
cilitate searching and categorisation of this information the database
entries are annotated with terms or keywords, describing for instance
diseases, treatments and anatomy. These annotations are limited to con-
cept level and do not describe relations between terms, for example that a
given treatment may be used for a given disease, even if this information
is available in both the text and terminologies.
In this work we will present a possible approach to extend term anno-
tations with relational information to add another dimension to concept
focused annotation schemas. This approach could also be used to high-
light implicit information and to structure knowledge.
1 Introduction
In life science and biomedicine much knowledge resides as unstructured in-
formation in for instance bibliographic repositories or electronic health record
databases. To facilitate searching and categorisation of this information the
database entries are annotated with terms or keywords, describing for instance
diseases, treatments and anatomy. These annotations are limited to concept level
and do not describe relations between terms, for example that a given treatment
may be used for a given disease, even if this information is available in the paper.
This annotation limitation remains in spite of accessible relational information
in ontologies and terminologies.
In this work we study the possibility to annotate a term annotated scientific
medical corpus with relations between the terms, for instance relating diseases
to treatments and organ sites. We use a Swedish text corpus of scientific medical
documents [3], which has been annotated with information from the medical
terminology MeSH (Medical Subject Headings)1 .
Current related work has focused on for instance methods for extraction of
specific types of relations [5,6] and extraction and characterisation of semantic
�
The author would like to thank Centre for Language Technology, Gothenburg
(clt.gu.se) for financial support.
1
www.nlm.nih.gov/mesh
27
Relational Annotation of Scientific Medical Corpora - A Case Study
relations [1,2,4] in biomedical text. For instance, Abacha and Zweigenbaum [1],
Frunza and Inkpen [2] and Yao et al [6] studied cure, prevent and side effect rela-
tions in medical papers and Segura-Bedmar et al [5] studied resolving anaphoras
for extraction of drug-drug interaction in pharmacological documents.
However, our work focuses not on establishing methods for extraction of rela-
tions between terms, but on tying existing term annotations together, reflecting
relations in the text. These new relational annotations would allow both medical
scientists and search engines to take advantage of highlighted implicit informa-
tion on for instance diseases, treatments and anatomy. For example a paper
regarding prevention of myocardial infarction may be annotated with terms like
Aspirin and Myocardial Infarction and our proposal is to also annotate it with
the relation may_prevent relating these terms. Not only would this add new
useful annotations, but it would also structure existing annotations.
2 Materials and Methods
2.1 Materials
The main resources in this study are an annotated Swedish scientific medical cor-
pus [3] and the vocabularies and terminologies of the Unified Medical Language
System2 .
Medical Text Corpus. As a part of the Swedish strategy for e-health, the
clinical terminology SNOMED CT3 has been translated into Swedish. For vali-
dation and quality assessment of the translation, a Swedish medical text corpus
was created from the electronic archives of the Journal of the Swedish Medical
Association 1996-2009 [3]. The corpus comprises 29110 documents (28 million
tokens) and has been part-of-speech tagged and annotated with Swedish and
English MeSH (release 2006) and with the Swedish SNOMED CT.
Our study focuses on the MeSH annotated sentences in a part of the corpus
containing 2021 articles from the domain “Klinik och Vetenskap” (Medical Prac-
tice and Science). This part contains 140458 sentences, each with a unique id.
For copyright reasons the order of the sentences have been randomised, thereby
limiting our study to relational annotations at sentence level. Of this sentence
set we used only the 102821 sentences with at least two MeSH annotations.
UMLS. The Unified Medical Language System (UMLS)4 connects vocabular-
ies from different biomedical and health-related sources in different languages. It
provides, among other things, databases, called Knowledge Sources. One of the
databases, Metathesaurus, contains information about more than one million
biomedical or health-related concepts. This database is divided into a number
2
www.nlm.nih.gov/research/umls
3
www.ihtsdo.org
4
We have used UMLS version 2010AB, including all source vocabularies of level 0-3.
28
Relational Annotation of Scientific Medical Corpora - A Case Study
of relational tables. One of the major tables, MRCONSO, contains the struc-
ture for each concept, e.g. names, identifiers, languages and source vocabularies.
This table is complemented with the MRSAT connecting MeSH identifiers (or
identifiers from other source vocabularies) to concept identifiers (CUI).
Metathesaurus also contains relations between concepts, where the table MR-
REL contains basic relations (REL), e.g. Parent/Child, relating different con-
cepts. Around 25% of the relations have a label (RELA - Relationship Attribute)
which comes from the source vocabulary and specifies the relationship, e.g. isa,
treated_by, finding_site_of or has_component.
Considering the MeSH part of the UMLS, a term can belong to more than
one category and thereby appear in several places in the MeSH hierarchy with
different MeSH identifiers. Moreover, by the annotation procedure used in our
study corpus, a term like Blood Pressure will be annoted as Blood Pressure, but
also as Blood and Pressure. Hence, the MeSH annotations can be nested, which
can result in more UMLS concepts than found terms in a sentence.
2.2 Methods
Since this work can be seen as a feasibility study of extending term annotation
schemas with relational information, the methods have been kept simple and
analysis and validation of the approach were done by manual inspection.
For each of the sentences in our corpus containing at least two MeSH annota-
tions, we extracted the sentence identifier and MeSH annotations, not taking into
account any nesting of the annotations. The resulting information was stored in a
MySQL database, complementing the UMLS one, thereby allowing easy mapping
of MeSH identifiers to UMLS concepts via the MRSAT table. These concepts
could then be used to derive relations from MRREL, giving relations between
the annotated terms in each sentence.
Since our main interest is in the ability to provide relational annotations
among annotated terms, the analysis focused on the derived relations with infor-
mation in the RELA field of MRREL. One such RELA relation is may_prevent,
e.g. “Aspirin may_prevent Myocardial Infarction”. These relations were divided
into five different categories reflecting disease-treatment, disease-organ, cause-
effect, hierarchical and other relations among the terms to reflect relations often
found in medical papers. The division of the RELA relations into these classes
was based on our subjective interpretation of relations like e.g. may_treat,
has_finding_site and cause_of. For each of the relation categories, we randomly
picked sentences and manually compared the derived relations to the ones ex-
pressed in the sentences to see if and how they were related. Since this work is
to be viewed as a feasibility study for future research, we limited our analysis to
a handful of sentences per relational category.
3 Results
Our study corpus comprised 140458 sentences, with 102821 (73%) containing at
least two MeSH annotations and in 26251 (25%) of these we were able to identify
29
Relational Annotation of Scientific Medical Corpora - A Case Study
relations between the terms. In these sentences we found 150024 relations of
which 44754 (30%) had the specification RELA. There were 188 different RELA.
Table 1 shows the percentage of sentences per RELA relation.
Table 1. Percentages of sentences (of 10854 sentences with RELA) per RELA relation.
Relation Percentage
isa 37
co-occurs_with 10
associated_with 10
has_finding_site 6.8
sib_in_isa 6.8
mapped_from 6.0
location_of 5.3
is_associated_anatomic_site_of 4.0
may_treat 2.8
part_of 2.4
may_prevent 1.4
may_be_a 1.1
causative_agent_of 0.8
Figure 1 shows the 122483 sentences containing MeSH annotated terms di-
vided into number of MeSH terms (left) and number of sentences per number of
relations (right)5 .
In the rest of this section we analyse each of the defined relation categories
and exemplify relations and their corresponding sentences, Table 2, and in Dis-
cussion we briefly elaborate on these results.
Disease - Treatment – The relations between diseases and treatments
found in the studied part of the corpus are e.g. may_treat and may_prevent.
Examples of disease-treatment relations are the relations derived from sentences
77591, 50380 and 105057. From sentence 105057 it is not possible to infer what
the relation is between the terms Soft Tissue Infections and Methicillin.
Disease - Organ – There are a number of different relations between dis-
eases and organs, e.g. is_associated_anatomic_site_of and has_finding_site.
The relation location_of is often a relation between diseases and treatments, but
sometimes refer to organ-organ relations. Disease-organ relations were found in
for example sentences 75743, 8039 and 97183. In 75743 and 8039 the relation
was not explicitly expressed.
Cause - Effect – Cause and effect relations can be for example cause_of,
induces and causative_agent_of. It can be for instance viruses or bacteria which
cause diseases, or diseases that cause other diseases. Examples of these relations
were found in sentences 133130 and 125988.
5
In the UMLS, for many relations there is also an inverse, e.g. finding_site_of and
has_finding_site.
30
Relational Annotation of Scientific Medical Corpora - A Case Study
20000
14000
15000
8000 10000
Sentences
Sentences
10000
6000
4000
5000
2000
0
0
1 4 7 10 13 16 19 22 25 28 31 34 37 47 2 4 6 8 10 12 14 16 18
MeSH terms MeSH relations
Fig. 1. Number of sentences per number of MeSH terms (left) and number of sentences
per number of relations (right).
Hierarchical Relations – The hierarchical relations we have studied are
synonymy, hyponymy and sibling relations. Synonyms are for instance concepts
which have the relation same_as or has_tradename. The relations mapped_from
and primary_mapped_from can be synonym relations, sentence 77314. Hy-
ponymy can be relations like isa, may_be_a and part_of, sentence 43382. Sib-
ling relations found in the sentences are for instance sib_in_isa. In the example
of the sibling relation, where Ethanols, Methanols and Ethylene Glycols are all
Alcohols (sentence 14668), the relations have no specification in RELA, only the
REL abbreviation SIB.
Other Relations – Other relations that were found in the sentences in the
corpus are for instance co-occurs_with, associated_with, may_diagnose and
occurs_before.
Sentences 15403 and 12826 are examples of the relation co-occurs_with,
while sentences 52963, 51136 and 125709 are examples of the relations asso-
ciated_with, may_diagnose and occurs_before respectively.
4 Discussion
The examples in this work show that it is possible to derive relations from the
annotated terms in a sentence utilising only UMLS. The majority of the examples
for the different relation types are existing relations, but sometimes the derived
relations are implicit instead of directly expressed in the sentences. However, as
exemplified in sentence 75743, the derived relations may not be the ones intended
in the sentence.
Since we have studied only concepts and not taken into account the syntax
of the sentences, the resulting relations between the terms can be relations not
31
Relational Annotation of Scientific Medical Corpora - A Case Study
Table 2. Derived relations and sentences (English translations).
Sentence ID Relation
77591 Opium may_treat Pain
50380 Aspirin may_prevent Myocardial Infarction
105057 Soft Tissue Infections may_be_treated_by Methicillin
75743 Pancreas is_associated_anatomic_site_of Diabetes Mellitus
8039 Celiak Disease has_finding_site Intestines
97183 Central Nervous System finding_site_of Rabies
133130 Bacillus antracis causative_agent_of Anthrax
125988 Nitrous Oxide induces Nausea
77314 Arthritis primary_mapped_from Joint Diseases
43382 Estrogenes isa Hormones
14668 SIB: Ethanols/Methanols/Ethylene Glycols
15403 Obesity co-occurs_with Diabetes Mellitus
12826 Diabetes Mellitus co-occurs_with Hypertension
52963 Vaccination associated_with Immunization
51136 Triiodothyronine may_diagnose Thyroid Disease
125709 Chickenpox occurs_before Herpes Zoster (shingles)
Sentence ID Sentence
77591 [...] opium had a soothing effect on both anxiety and pain.
50380 [...] observed that patients with a regular intake of Aspirin had fewer
heart attacks than expected.
105057 [...] soft tissue infections [...] and infections caused by methicillin
resistant staphylococcus.
75743 An alternative for a few patients with diabetes mellitus has been
transplantation of pancreas [...]
8039 [...] patients with already known celiak disease but who in spite of a
strict diet have had gastro-intestinal symptoms.
97183 [...] the symptoms of a rabies infection begins when the virus reaches
CNS [...]
133130 Bacillus antracis causes the disease Anthrax [...]
125988 [...] nitrous oxide contributes to post-operative nausea.
77314 [...] patients with [...] and arthritis symptoms who were treated with
[...] improvement of their joint disease [...]
43382 Even though it is well documented that estrogene [...] is an important
hormone in [...]
14668 [...] have been introduced as an alternative to ethanol in cases of
ethylene glycol and methanol poisoning.
15403 [...] obesity is associated with a highly increased risk of developing [...]
diabetes [...]
12826 Other risk factors for stroke are [...] hypertension, diabetes, [...]
52963 Participation in the vaccination programs has been very high and
sufficient immunization was reached [...]
51136 Antibodies targeting Triiodothyronine [...] in up to 10 percent of
patients with Thyroid diseases.
125709 [...] a connection between chickenpox and herpes zoster.
32
Relational Annotation of Scientific Medical Corpora - A Case Study
derivable from a sentence. For instance, in sentence 105057 there is no indication
of any relation between the terms Soft Tissue Infections and Methicillin, but from
the UMLS we get a may_be_treated_by relation. Many of the derived UMLS
relations are not explicitly expressed in the sentences, but can be part of the
context of a sentence, for instance the relation between Diabetes Mellitus and
Pancreas in sentence 75743.
Expressing hierarchical relations like isa and part_of could lead to increased
understanding of the context of the sentence. For example in 14668, where, by
the sibling relation, we learn that the terms Ethanols, Methanols and Ethylene
Glycols have something in common, i.e. they are all Alcohols, thereby framing
the concepts in the sentence.
The terms which have the relation co-occurs_with can have slightly different
relations to each other in the sentences. For example in 15403 one problem leads
to another, but in 12826 the two diseases can both be the cause of a third one.
One of the major reasons for only being able to identify relations in 25%
of the sentences with more than two annotations, may be that the annotations
are not at the same hierarchical ontological levels in comparison to the defined
relations in UMLS. A fundamental challenge with the proposed approach is its
dependence on the quality and source of the original term annotations, as in our
case using only MeSH in the process to identify relational annotations.
5 Conclusions and Future Work
In this work we have studied the feasibility of utilising the term annotations of
medical text in connection with a collection of terminologies and vocabularies
to extend the annotations with relational information. As the examples show,
this approach can be used not only for annotation, but also to highlight implicit
information and to structure knowledge.
Our work complements existing research on extraction of (semantic) relations
in biomedical text, by focusing on identifying and validating relations between
conceptual annotations of a text. Hence, instead of the complex process of ex-
tracting relations in medical papers, we utilise existing annotations to propose
potential relations covered by a paper.
This work is based on only one of the source vocabularies and a limited cor-
pus. Hence, future work will address the ability to make use of combinations
of several source vocabularies and more elaborated use of the hierarchical on-
tological relations to increase the ability to identify relations among annotated
terms. However, utilising for instance hyponymy induces challenges like degree of
hyponymy/hypernymy to allow in establishing relations among terms, and also
how to resolve problems with the complex relational structure of the UMLS with
many different types of relations and even cycles. For instance, some vocabular-
ies in the UMLS may treat relations like isa and part_of as synonymous and
some as distinct types of relations. Ongoing work also considers other corpora,
like parts of MEDLINE6 , for relational annotation.
6
www.nlm.nih.gov/databases/databases_medline.html
33
Relational Annotation of Scientific Medical Corpora - A Case Study
References
1. Abacha, A.B., Zweigenbaum, P.: A hybrid approach for the extraction of semantic
relations from MEDLINE abstracts. In: Proceedings of 12th International Confer-
ence on Intelligent Text Processing and Computational Linguistics (CICLing 2011)
(2011)
2. Frunza, O., Inkpen, D.: Extraction of disease-treatment semantic relations from
biomedical sentences. In: BioNLP Workshop (ACL 2010) (2010)
3. Kokkinakis, D., Gerdin, U.: A Swedish scientific medical corpus for terminology
management and linguistic exploration. In: Proceedings of the Seventh conference
on International Language Resources and Evaluation (LREC 2010) (2010)
4. Patel, C.O., Cimino, J.J.: Using semantic and structural properties of the unified
medical language system to discover potential terminological relationships. J Am
Med Inform Assoc 16(3), 346–353 (2009), http://dx.doi.org/10.1197/jamia.M2931
5. Segura-Bedmar, I., Crespo, M., de Pablo-Sanchez, C., Martinez, P.: Resolving
anaphoras for the extraction of drug-drug interactions in pharmacological docu-
ments. BMC Bioinformatics 11 Suppl 2, S1 (2010), http://dx.doi.org/10.1186/1471-
2105-11-S2-S1
6. Yao, L., Sun, C., Wang, X., Wang, X.: Relationship extraction from biomedical liter-
ature using maximum entropy based on rich features. In: Proceedings International
Conference on Machine Learning and Cybernetics, (ICMLC 2010). pp. 3358–3361
(2010)
34