=Paper=
{{Paper
|id=Vol-222/paper-9
|storemode=property
|title=The Development of a Schema for the Annotation of Terms in the Biocaster Disease Detecting/Tracking System
|pdfUrl=https://ceur-ws.org/Vol-222/krmed2006-p09.pdf
|volume=Vol-222
|dblpUrl=https://dblp.org/rec/conf/krmed/KawazoeJSBTC06
}}
==The Development of a Schema for the Annotation of Terms in the Biocaster Disease Detecting/Tracking System==
KR-MED 2006 "Biomedical Ontology in Action"
November 8, 2006, Baltimore, Maryland, USA
The development of a schema for the annotation of terms in the BioCaster
disease detecting/tracking system
Ai Kawazoe*1, Ph.D., Lihua Jin*1, Ph.D., Mika Shigematsu*3, M.D.,
Roberto Barrero*2, Ph.D., Kiyosu Taniguchi*3, M.D., Nigel Collier*1, Ph.D.
*1
National Institute of Informatics, Hitotsubashi 2-1-2 Chiyoda-ku Tokyo, JAPAN
*2
National Institute of Genetics, Yata 1111 Mishima Shizuoka, JAPAN
*3
National Institute of Infectious Diseases, Toyama 1-23-1 Shinjuku-ku Tokyo, JAPAN
*1 *2
{zoeai,lihua-jin,collier}@nii.ac.jp, rbarrero@genes.nig.ac.jp,
*3
{mikas,tanigk}@nih.go.jp
Amid growing public concern about the spread of particular value for assessing possible outbreaks in
infectious diseases such as avian influenza and SARS, there areas where formal reporting procedures are absent
is an increasing need for collecting timely and reliable or not well established.
information about disease outbreaks from natural language Several major challenges exist in locating Web-
data such as online news articles. In this paper we
introduce BioCaster, a text mining-based system for
based information in a timely manner using
infectious disease detection and tracking currently being traditional search methods: (1) the massively
developed, and discuss the development of a domain increasing volume of dynamically changing
ontology and schema for the annotation of terms. In unstructured news data available on the Web makes
particular we focus on the comparison between two it extremely difficult to obtain a clear picture of an
approaches, 1) a traditional task-oriented approach with a outbreak in a timely manner, (2) the large-scale
simple schema that does not strictly follow ontological republication of reports from centralized news
principles, and 2) a formal approach which is ontologically agencies requires redundancy to be identified and
well-founded but adds extra requirements to the annotation removed, (3) the initial reports of an outbreak are
schema. We report on several critical problems that were
contained in only a few news articles which will
highlighted by an entity annotation experiment,
attributable to the purely task-oriented ontology design. A usually be overlooked by traditional search engines
second experiment based on a formally constructed which use keyword indexing, (4) the first reports of
ontology produced improved annotation results despite the an infectious disease will often be reported in local
apparent complexity of the annotation schema. news media which are only available in the local
language. Experience has shown that this requires
1. INTRODUCTION computer systems to have at least a partial
understanding of the domain through ontologies,
As shown by the recent outbreak of Severe Acute term lists and databases as well as specialized
Respiratory Syndrome (SARS) and emerging cases multilingual resources.
of avian influenza, infectious diseases have the To address the information needs in the domain of
potential to spread rapidly through person-to-person infectious disease outbreaks, standard Information
transmission within densely populated areas and Extraction technology has been adapted for
across country borders through international air retrospective archive search [2] but only a few
travel. The first line of defense against rapidly systems are currently actively deployed with the most
spreading diseases is surveillance, led by the World prominent being the Global Public Health
Health Organization (WHO) and national health Intelligence Network (GPHIN) [3], a successful but
authorities. Catching an outbreak earlier has clear semi-closed system used by the WHO. We are now
implications for both morbidity and mortality as well developing BioCaster, a text mining system based on
as the feasibility of containment [1]. However a lack an openly available multilingual ontology for
of surveillance system infrastructure in Southeast proactive notification about priority disease
Asia, which is currently the focus of an avian H5N1 outbreaks. A key component of the BioCaster system
epidemic is seen as hindering control efforts. In is the use of automated learning methods to identify
addition to traditional surrogate methods such as novel entities and events using features derived from
reporting notifiable diseases and over-the-counter annotated examples in a multilingual collection of
(OTC) sales monitoring, public health experts are news articles. The initial target languages are English,
increasingly considering news and other reports Japanese, Vietnamese and Thai.
available on the World Wide Web (Web) as a cost- In our early development of BioCaster it became
effective means of helping to find and track early clear that we needed a rigorous schema for markable
cluster cases, enabling a timely and appropriate entities. Since the system relies on high quality
response. Such rumour-based information may be of human annotated training data for constructing
77
named entity recognizers (NERs), any inconsistency participate in, while most others, such as PERSON,
introduced into the annotation schema by ontological BACTERIA, and NON_HUMAN, represent types.
inconsistencies should be harmful for annotation We had two options for constructing the ontology
performance, both human and machine. Surprisingly and annotation schema, according to how to deal
while there have been several studies on the mapping with concepts of a different nature. The first
problem between terms and coding systems such as approach is rather task-oriented. Here we do not
the UMLS Metathesaurus [4] as well as biomedical make any distinction between context-dependent
annotation experiments [5] [6] [7] there have been to concepts and others. This results in a somewhat
the best of our knowledge no studies conducted into simpler ontology: all categories of concepts are
the method by which new domain models suitable for represented as classes which follow a disjoint entity
biomedical text mining should be organized. We class principal that has been the underlying premise
report here on our initial experience which showed of NERs. The corresponding annotation schema will
that the task-oriented annotation schema based on a also be simpler, since instances of context-dependent
poorly-considered domain ontology can indeed be classes are annotated in the same way as those of
harmful to accuracy. Re-organizing this schema other classes, e.g.
using well founded ontological principles produced
better results, despite the added complexity. Kofi Annan
a 12 year-old girl infected
2. USER NEEDS with H5N1
Epidemiologists are concerned with the (The details of this schema will be given in the next
circumstances in which diseases occur in a section.) In this task-oriented approach, we can
population and the factors that influence their annotate exactly what the event frame needs to
incidence, spread, recognition and control. Our identify. For example, we can exclude from
initial discussions with domain experts at the annotation non-named, non-case mentions, which we
National Institute of Infectious Diseases revealed are not interested in. A defect of this approach is that
several common scenarios for gathering information it is not ontologically well-founded.
from Web news including cases involving the spread The alternative approach is a more formal one
of a communicable disease across international where we make a clear distinction between context-
borders and the contamination of blood products. dependent concepts and others, based on well-
From these initial discussions we collected examples founded ontological principles. The result is likely to
of early outbreak news reports and compiled a list of be a more complex ontology in which context-
significant entity classes which included DISEASE1, dependent concepts have a different status from other
CASE, LOCATION SYMPTOM, TIME, DRUG, etc. concepts. The corresponding annotation schema will
Subsequent follow up discussions and examination also be more complex as well, since roles are
of the literature revealed that we can categorize these annotated in a different way from those of entity
concepts according to the information needs of the classes. In order to achieve ontological consistency
scientists as shown in Table 1. we also need to annotate more mentions than the
Genetic epidemiology adds another dimension to former approach, including those that will not
the information needs as the genetic makeup of the instantiate event frames.
host plays a key role in determining susceptibility or From the two approaches above, out of expediency
resistance to pathogens. We therefore chose to add in we chose the former for the first annotation
a further level of detail about the host which includes experiment. The reason being that it seemed easier
genes and their products, identified with a §. Finally for annotators and that we could find almost no
we had 19 categories of concepts which we want to precedent works in named entity annotation which
identify in news texts (Table 2). dealt with formal analysis of entities and role
concepts.
3. CONSIDERATION ON TWO APPROACHES
At this stage we were aware that some of the 4. ANNOTATION EXPERIMENT 1
important concepts in Table 2 are contextually-
dependent and intrinsically different from others. 4.1 Method
For example, CASE and TRANSMISSION represent Based on the list of categories of concepts in Table 2,
roles (discussed in [8] [9] [10] [11] among others) we constructed the ontology shown in Figure 1. Note
which are dependent on the existence of events they that CASE and TRANSMISSION, which represent
1
We will adopt here the notation of using all upper case for
domain entity classes.
78
Focus Description Example properties Concept types
Agent Pathogens Infectivity, pathogenicity, virulence, incubation VIRUS, BACTERIA,
period, communicability PARASITE*, FUNGI*
Transmission The delivery or dispersal Dermal, oral, respiratory TRANSMISSION
method
Host Persons carrying a Age, gender, occupation, CASE, SYMPTOM, DISEASE,
disease ANATOMY, DNA§, RNA§,
PROTEIN§
Environment Location and climate Large population centre, enclosed building, mass LOCATION, TIME
transport system, rural village
* Not included in the current schema
§
Genetic level entities
Table 1 Categorization of concepts
Classes Examples Description
ANATOMY liver, pancreas, nervous system, eLa cel, Body parts including tissues and cells
BACTERIA Escherichia coli O157, tubercle bacillus Eubacteria
CASE a 35-year-old woman, the third case Confirmed cases of diseases
NT_CHEMICAL beryllium, organophosphate pesticide Chemicals intended for non-therapeutic purposes *1
T_CHEMICAL Relenza, immunosuppressive drug, oseltamivir Chemicals intended for the treatment of diseases*1
CONTROL stamping out, screening, vaccination Control measures to lower the risk of transmission of a
disease
DISEASE H5N1 avian influenza, SARS, cholera A deviation in the normal functioning of the host caused
by a persistent agent (pathogen) or some environmental
factor
DNA Sp1 site, triple-A, c-jun gene Includes the names of DNAs, groups, families, molecules,
domains and regions*2
LOCATION Viet Nam, Jakarta, Sumatra Island, Asia A politically or geographically defined location*3
NON_HUMAN civet cats, poultry, flies Multi-cell organism other than humans, i.e. "animals"
ORGANIZATION the Ministry of Health, WHO, Pasteur Institute Corporate, governmental, or other organizational entity*3
PERSON Jean Chretien, Murray McQuigge A named person or family
PRODUCT botulism antitoxin, Influenza vaccine Biological product, (e.g. vaccines, immune sera)
PROTEIN STAT, RNA polymerase II alpha subunit Includes the names of proteins, groups, families,
molecules, complexes and substructures*2
RNA IL-2R alpha transcripts, TNF mRNA Includes the names of RNAs, groups, families, molecules,
domains and regions*2
SYMPTOM cough, fever, dehydration, convulsion Alterations in the appearance of a case due to a disease
TIME Tue Jan 3, winter, March, since October, 2003 Temporal expressions that can be anchored on a
timeline*4
TRANSMISSION HIV-tainted blood products, BSE-infected cows Source of infection
VIRUS Ebola virus, HIV Viruses such as HIV, HTLV, EBV *2
Descriptions marked with *1 , *2, *3, *4 are based on those in MeSH [12], GENIA ontology [13], MUC-7 [14], and HUB-4 [15],
respectively.
Table 2 List of classes of markable concepts
79
In the annotation schema used in the example above,
the attribute cl takes the entity class label as its value.
For example "Kofi
Annan " means that the entity mentioned
by "Kofi Annan" is related to the class PERSON.
The reason for using this rather vague expression is
to cover two relations between mentioned entities
and the ontology we want to describe. The first is "is
an instance of", and the other one is "is a subclass of".
Some of the markable texts mention a particular and
others mention a universal. For example, names of
persons, locations and organizations are usually used
to refer to a particular, whereas names of chemical
substance, viruses and proteins are often used to refer
to universals. This is one of the factors which makes
ontology-based annotation a complicated process. It
should be noted though that we intend to work
towards a clear distinction between the two relations
in future work.
Figure 1 Initial domain ontology (simplified) 4.2 Annotation results and problems
During the first annotation experiment, we had many
roles, have the same status as other classes since we problem reports form annotators, and found a
adopted the task-oriented approach as discussed in significant number of inconsistencies in the
the last section. We developed annotation guidelines annotation results. Most of the problems could be
to annotate non-overlapping mentions related to the traced back to poor design of the domain ontology
classes in news articles and hired two PhD and the annotation schema. Follow up analysis on
informatics students as annotators. After 1-week of the corpus yielded the following symptoms of error:
training consisting of guideline review, case study
discussions and test cases, we started the annotation • Gaps in the annotation schema shown by the
process with 200 news articles taken from domain existence of mentions to entities which it is
sources, including WHO epidemic reports, IRIN, and desirable to annotate but the annotation schema
Reuter news. does not cover.
In order to restrict the markable mentions to exactly • Ambiguity between context-dependent concepts
those that we aimed to identify with the text mining and context-independent ones
system, we defined CASE as the class of confirmed • Idiosyncratic annotations which are forced on
cases which are unnamed, and PERSON as the class annotators due to the disjoint entity class
of named persons who are not cases. We considered principal.
this would narrow down the number of markable
mentions since unnamed mentions for non-cases need Gaps in the annotation schema
not be annotated. We also instructed annotators to At the initial stage of our analysis we considered that
markup only the single most appropriate class, distinguishing CASE (as confirmed cases of a disease
prohibited multiple classes. An example of annotated which are unnamed humans) from PERSON (named
text is shown below: persons who are not cases of a disease) was rather
natural, since CASE entities are in general
The Ministry of anonymous. However, in the news articles there
Health in were some examples where cases were mentioned by
Indonesia has today confirmed a fatal human case of
name as follows:
H5N1 avian
influenza . A 27- E1 Tests carried out in a UK laboratory confirmed
year-old woman from Jakarta developed
symptoms on 17 In addition, we found that there were more frequent
September . She contracted the virus from mentions of putative cases than we had expected.
close contact with infected birds . 2
In this example we only show initials of the victims' names.
80
These mentions were often annotated as CASE by
annotators although we restricted the scope of this 4.3 Empirical results from training an NER
class only to confirmed cases. We trained a support vector machine [13] (for details,
see Takeuchi and Collier [14]) for named entity
E2 a Taiwanese is suspected to have died of SARS recognition based on the annotated corpus of 200
news articles. 10-fold cross validation experiments
Follow up discussions with public health experts were performed using TinySVM3. A -2/+1 features
revealed that mentions of putative cases are window was used that included surface word,
important, especially in the early stages of disease orthography, biomedical prefixes/suffixes, lemma,
outbreaks, and we concluded that they should be head noun and previous class predications. The F-
identified by the system. However, the existing score for the all classes in Table 2 was 76.96.
framework made them difficult to capture. Among the problematic classes were found to be
PERSON, CASE and NON_HUMAN (many
Ambiguity caused by context-dependent concepts instances of which had ambiguity with
One of the classes which confused annotators most TRANSMISSION) which had F-scores below our
was TRANSMISSION (source of infection). Below expectation: PERSON (54.95), CASE (53.17),
are typical examples of problematic cases. NON_HUMAN (68.0).
E3 Victims contract the virus from close contact 5. ANNOTATION EXPERIMENT 2
with infected birds
E4 There is no known cure for Ebola, which is
transmitted via infected body fluids 5.1 Re-examination of the approach
E5 An Irish woman infected with Hepatitis C by a Although we chose the task-oriented approach for its
contaminated blood product simplicity and ease of implementation the results
E6 18 hospitalized after consuming chapattis from automatic NER and subsequent corpus analysis
revealed that problems arose because we made no
Annotators had a problem in annotating ‘birds' in E3 clear distinction between context-dependent and
since those can be classified as both context-independent classes. We decided to take an
TRANSMISSION and NON_HUMAN (animals). alternative, formal and linguistically-sound approach,
‘Body fluid’ in E4 is also ambiguous between and distinguish context-dependent concepts from
TRANSMISSION and ANATOMY (body parts), and others in both the ontology and the annotation
also ‘blood product’ in E5 is ambiguous between schema.
TRANSMISSION and PRODUCT (biological
product). Most of the TRANSMISSION instances 5.2 Classification of concepts
found in the text were those which could be The first step was to use the classification method
categorized as NON_HUMAN, and the cases which proposed by Guarino and Welty ([9] and [10]) which
belonged only to TRANSMISSION, such as is based on meta-properties (rigidity, identity,
‘chapattis’ in E6, were very few. dependency), in order to classify categories of
concepts in Table 2. Definitions of the meta-
Idiosyncratic annotations due to the disjoint entity properties we used are as follows:
class principal
E7 Hudd has ([10], p.4)
written several books on music hall and rigid property φ(+R): ∀x φ(x) → □φ(x)
Variety... anti-rigid property φ(~R): ∀x φ(x) →¬□φ(x)
E8 Doctors later diagnosed Hudd with a chest ([10], p.5)
infection... Identity Condition (IC): An identity condition is a
formula Γ that satisfies either of the followings4:
In the example above, it is clearly undesirable that
the same entity is related to PERSON in E7 and
CASE in E8. Although the annotator was aware of
the choices the principal of disjoint classes forced a 3
Available from http://cl.aist-nara.ac.jp/~taku-
choice. ku/software/TinySVM
4
In [9], further restrictions are added in order to avoid 1) the case
where the necessary IC definition becomes trivially true regardless
of the truth value of the formula x=y and 2) the case where Γ(x, y,
t, t') is false and that makes the sufficient IC definition trivially true.
81
rigidity identity (supplying) identity (carrying) dependency classification
ANATOMY +R +O +I -D Type
BACTERIA +R +O +I -D Type
CASE ~R -O +I +D Material Role
NT_CHEMICAL ~R -O +I +D Material Role
T_CHEMICAL ~R -O +I +D Material Role
CONTROL ~R *1 - O*2 +I +D Material Role
DISEASE +R +O*3 +I +D Type
DNA +R +O +I -D Type
LOCATION +R +O +I -D Type
NON_HUMAN +R +O +I -D Type
ORGANIZATION +R +O +I -D Type
PERSON +R +O +I -D Type
PRODUCT +R +O +I +D Type
PROTEIN +R +O +I -D Type
RNA +R +O +I -D Type
SYMPTOM +R +O +I +D Type
TIME +R +O +I -D Type
VIRUS +R +O +I -D Type
TRANSMISSION ~R -O -I +D Formal Role
*1 We consider that this class is anti-rigid, since it is possible that an action which is an instance of CONTROL in the current world is not an
instance of CONTROL in some other accessible world. The same action may be conducted for different purposes in different worlds.
*2 This class includes events. In DOLCE top level categories (Gangemi et al.[19]), Events are under the class of Perdurant/Occurrence. It
seems to be controversial what the identity condition for events should be. Davidson [20] proposes a condition such that "events are identical
if and only if they have exactly the same causes and effects". In any case it should be reasonable to assume that this class itself does not
supply ICs but inherits them from the upper level classes.
*3 What we consider ICs for this class is as follows: Two instances of diseases are identical iff the two are experienced by the same host at
the same time, are caused by the same agent (e.g. H5N1 virus for "H5N1 avian influenza") and have the same set of characteristic
alterations/symptoms (e.g. inflammation of the lung for "pneumonia").
Table 3: Classification of concepts
necessary IC: E(x, t)∧φ(x, t)∧E(x, t')∧φ(y, t')∧ experiment were classified as Role:
x=y →Γ(x, y, t, t') TRANSMISSION (Formal Role) and CASE
sufficient IC: E(x, t)∧φ(x, t)∧E(x, t')∧φ(y, t')∧ (Material Role). According to the further
classification of non-rigid concepts by Kaneiwa and
Γ(x, y, t, t') →x=y Mizoguchi [18], these cases are classified as time-
(E : "actually exist at time t") dependent concepts.
Any property φ carries an IC (+I) iff it is 5.3 Modification of the schema
subsumed by a property supplying that IC. For some of the roles in Table 3, we modified their
A property φ supplies an IC (+O) iff i) it is rigid; status in the annotation schema.
ii) there is a necessary or sufficient IC for it; and iii)
the same IC is not carried by all the properties CASE
subsuming φ. CASE and PERSON were problematic since we
distinguished them according to the form of
([10], p.7) expression (unnamed/named), in addition to the
externally dependent property φ (+D): case/non-case distinction. In order to cover the
∀x□(φ(x) →∃y ω(y) ∧¬P(y, x) ∧¬C(y, x)) mentions which could not be annotated in the first
(P: "is a part of") experiment, we extended the scope of the PERSON
(C: "is a constituent of") class to include person instances in general, and
eliminate the unnamed/named and case/non-case
Classification results are shown in Table 3. Most distinctions. We modified the annotation schema so
concepts such as ANATOMY, NON_HUMAN, and that CASE is not the value of cl attribute, but is the
PERSON are classified as Type, whereas the case attribute which applies to the referred instance
concepts which were problematic in the first of PERSON. This attribute takes the value true when
the mentioned instance is a confirmed case of disease,
82
false when the instance is not a case, and putative TRANSMISSION
when the instance is a suspected case. Named case We defined the transmission attribute which applies
mentions and suspected case mentions are annotated to mentions of ANATOMY, PRODUCT, PERSON
as follows: and NON_HUMAN classes. As shown in the
following examples, 'birds' are always related to
E9 Tests carried out in a UK laboratory confirmed NON_HUMAN, and take a 'true' value only when
that M.A ... also take a 'putative' value to cover mentions to
possible sources of infection.
E10 a
Taiwanese is suspected to have died E11 Victims contract the virus from close contact
of SARS with infected birds
The meaning of case attribute-value pairs can be
described in logical description and natural language
as follows: T_CHEMICAL /NT_CHEMICAL
Concept classification revealed that T_CHEMICAL
<...cl="PERSON" case="true">John: case(j) and NT_CHEMICAL have "the situation dependency
"It is true that the person j mentioned by "John" is an obtained from extending types" discussed in [18] and
instance of the CASE class" have the same status as 'weapon' and 'table'.
T_CHEMICAL includes chemicals mentioned as
<...cl="PERSON" case="false">John: ¬case(j) drugs in any context and those regarded as drugs in
"It is false that the person j mentioned by "John" is some context. Here we removed the two classes and
an instance of the CASE class" made the parent node CHEMICAL as a class for
annotation.
<...cl="PERSON" case="putative">John: We then defined therapeutic attribute which applies
◇case(j) to mentions of CHEMICAL and takes the value true
"It is possible that the person j mentioned by "John" when the entity is intended for therapeutic use and
is an instance of the CASE class" false otherwise.
As shown above, the values of the case attribute As a result of the modifications above, our revised
ontology is shown in Figure 2. We also added new
correspond to logical operators such as ¬ and ◇.
classes CONDITION (status of patients:
The values of case attributes specify the modes of
'hospitalized' 'died 'in critical condition', etc) and
linkage between the referred concept and the CASE
OUTBREAK (collective disease incident: 'outbreak',
class. The formal basis we had in mind when
'pandemic', etc). Information about CONDITION is
formulating the case attribute are as follows: 1) every
important for experts to know the rate of
instance of a non-rigid class must be an instance of
hospitalization and death and determine the alert
some rigid class, 2) the relations between a non-rigid
level. Mentions of OUTBREAK include expressions
class and its instance are often modified by
which are specific to disease outbreak news,
modal/temporal operators. The first point drove us to
increasing the specificity of our detection system. We
create the case attribute which apply to instances of
located PERSON and NON_HUMAN under metazoa,
some rigid class, here, PERSON. The second point
and added a number attribute (which takes one or
is the motivation for us to set values to include
many as its value) to be applied to PERSON
negative and modal operators. This schema can be
instances.
extended if we allow a wider value range for the case
With insights from the revised ontology we also
attribute to include other modal/temporal operators,
changed the annotation method by dividing the
although currently we restrict the values to the three
process into two distinct stages as shown in Figure 3:
above.
1) annotation of mentions to non-role (rigid)
It is worth noting that there is a trade-off between
concepts and 2) annotation of role (non-rigid)
this revised schema and the former schema which is
concepts.
that we have increased the number of the markable
entities, since we need to annotate unnamed, non-
case mentions which are not directly related to the
purpose of the system.
83
significant increases of the F score were observed in
the classes for PERSON (66.28; +11.33 compared to
the previous result), case mentions among PERSON
(65.63; +12.46), and NON_HUMAN (73.21; +5.21).
therapeutic attribute
5.5 Remaining issues
Some of the problems reported in this second
experiment were related to context dependency (anti-
rigidity, situation dependency) discussed in Section
case attribute
6.2.
number attribute The most difficult class seemed to be CONTROL
(control measures to lower the risk of diseases). As
shown in Table 3, we consider this class is also non-
rigid, and it includes mentions which refer to
subclasses of the CONTROL class regardless of
situation ("quarantine" "vaccination"), and others
which can be a control measure depending on the
transmission attribute situation ("warning" "blockade"). This characteristic
seems to cause the difficulty.
therapeutic attribute So far we have resolved the complexity of non-
rigid concepts by defining attributes which apply to
instances of rigid classes (e.g. the case attribute for
Figure 2 Current ontology (simplified) the class PERSON). This strategy, however does not
seem to be effective for CONTROL since it is not
easy to identify a rigid superclass for CONTROL
which can be realistically annotated in the text. For
example, EVENT can be considered as a rigid class
4. Event
annotation subsuming CONTROL, but currently it is not
realistic to manually annotate every mention of an
event. Currently we are seeking for a way to deal
3. Coreference annotation with this problem.
2. Annotation of Role (non-rigid) concepts 6. CONCLUSION
The study in this paper was motivated by our need
1. Annotation of Type (rigid) concepts
for a high quality annotation schema to support
detection of novel entities in the infectious disease
Figure 3 Annotation schedule outbreak domain. We discussed two experiments
based on alternative approaches for constructing an
5.4 Results of annotation and NE recognizer ontology-based annotation schema. The amount of
training data in our study is relatively small but empirical
We asked three PhD students to annotate a further results indicate support for our view that there is a
300 news articles. This time we used the revised positive effect in adopting well founded ontological
annotation method 1 and 2 shown in Figure 3. principals over an ad-hoc task-based approach.
As a result of distinguishing between Role concepts Although this study is not a formal evaluation of
(case, transmission, therapeutic) from others in the ontologies, it is still an evaluation from the viewpoint
annotation schema, problem reports on these classes of ontology application to the task of natural
were reduced, and the annotation results were also language annotation. The classification method of
improved. Contrary to our expectations, the Guarino and Welty ([9], [10]) which was originally
complexity of the new annotation schema and the proposed to achieve consistency in the
increased number of markable mentions seemed to configurational structure of ontologies, was adapted
have no negative influence on the annotator’s speed. and found to be useful for improving annotation
The improvement can be seen empirically in the performance.
NER results. We re-annotated the corpus used in the An alternative possibility exists which we have not
first experiment using the revised annotation schema. addressed in this paper which is to reformulate the
This time the F-score for all classes rose to 79.96 (+3 tradition NER task to allow for overlapping (nested)
compared to the previous result). Especially, and multi-class entities. This however introduces
84
significant additional complications in both the of EKAW-2000: The 12th International
recognizer models and in the annotation schema so Conference on Knowledge Engineering and
we have adopted a less radical formulation in this Knowledge Management, volume 1937: 97-112.
work. 10. Guarino N, Welty C. Ontological analysis of
As the next step in this study, we are now taxonomic relations. Lander A, Storey V (eds.)
extending our simple taxonomy to a multi-lingual Proceedings of ER-2000: The International
ontology; enriching the current taxonomic structure Conference on Conceptual Modeling, vol. 1920,
with domain-sensitive relations. The resulting 210-224, Springer Verlag LNCS, Berlin,
ontology will be freely available for re-use. At the Germany.
initial stage we are focusing on English, Japanese, 11. Steimann F. On the representation of roles in
Vietnamese, Thai, Chinese (standard) and Korean. object-oriented and conceptual modelling. Data
We hope to add other Asia-Pacific languages in the and Knowledge Engineering35, 1: 83-106. 2000.
future. 12. U.S. National Library of Medicine. Medical
Subject Headings (MeSH), 2006.
13. Kim J.D., Ohta T, Tateishi Y, Tsujii J. GENIA
Acknowledgements corpus - a semantically annotated corpus for bio-
textmining. Bioinformatics 19(suppl. 1), pp.
We gratefully acknowledge partial funding support i180-i182, Oxford University Press, 2003.
from the Japan Society for the Promotion of Science 14. Hirschman L, Chinchor N. MUC-7 named entity
(grant no. 18049071). We also thank the anonymous task definition. Proceedings of the 7th Message
reviewers for helpful comments. Understanding Conference (MUC-7).
15. Hirschman L, Chinchor N, Grishman R,
Sundheim B. Hub-4 Event Guidelines Version
References 2.6. http://www-
1. Ferguson NM, Cummings DA, Cauchemez S, nlpir.nist.gov/related_projects/muc/proceedings/
Fraser C, Riley S, et al. Strategies for containing hub4/guidelines.html
an emerging influenza pandemic in Southeast 16. Vapnik, V. N. The Nature of Statistical Learning
Asia. Nature 437: 209–214. 2005. Theory, Springer-Verlag, New York, 1995.
2. Grishman R, Huttunen S, and Yangarber R. 17. Takeuchi, K and Collier, N. "Bio-medical entity
Information extraction for enhanced access to extraction using support vector machines", in vol.
disease outbreak reports. Journal of Biomedical 33, no.2, Artificial Intelligence in Medicine,
Informatics, Vol. 35, No. 4, 236 - 246, 2002. Elsevier, pp. 125-137, 2005.
3. Public Health Agency of Canada. GPHIN 18. Kaneiwa K, Mizoguchi, R. An order-sorted
system. http://www.phac-aspc.gc.ca/media/nr- quantified modal logic for meta-ontology. Proc.
rp/2004/2004_gphin-rmispbk_e.html of the International Conference on Automated
4. Aronson A.R. Effective mapping of biomedical Reasoning with Analytic Tableaux and Related
text to the UMLS Metathesaurus: the MetaMap Methods (TABLEAUX 2005), Koblenz,
program. Proceedings of AMIA Symposium, Germany: 169-184, 2005.
17–21, 2001. 19. Gangemi A, Guarino N, Masolo C, Oltramari A,
5. Rindflesch T.C., Tanabe L, Weinstein J.N. and Schneider L. Sweetening ontologies with
Hunter L. EDGAR: extraction of drugs, genes DOLCE. Benjamins et al. (eds.), Proceedings of
and relations from the biomedical literature. the 13th European Conference on Knowledge
Proceedings of Pacific Symposium on Engineering and Knowledge Management
Biocomputing 5:514-525, 2000. (EKAW2002), 166-181, Sigüenza, Spain, 2002.
6. Kim J.D., Ohta T, Tsuruoka Y, Tateishi Y, 20. Davidson D. The Individuation of events.
Collier N. Introduction to the Bio-entity Rescher N (ed) Essays in Honor of Carl G.
Recognition Task of the JNLPBA workshop. Hempel: 216-234, 1969, D. Reidel.
Proceedings of the JNPBA, 70-76, 2004.
7. Yeh A, Morgan A, Colosimo M, Hirschman L.
BioCreAtIvE task 1A: gene mention finding
evaluation. BMC Bioinformatics 2005, 6(Suppl
1):S2.
8. Sowa J.F. Conceptual structures: Information
processing in mind and machine. Addison-
Wesley, New York; 1984.
9. Guarino N, Welty C. A formal ontology of
properties. Dieng R, Corby O (eds.) Proceedings
85