=Paper= {{Paper |id=Vol-222/paper-9 |storemode=property |title=The Development of a Schema for the Annotation of Terms in the Biocaster Disease Detecting/Tracking System |pdfUrl=https://ceur-ws.org/Vol-222/krmed2006-p09.pdf |volume=Vol-222 |dblpUrl=https://dblp.org/rec/conf/krmed/KawazoeJSBTC06 }} ==The Development of a Schema for the Annotation of Terms in the Biocaster Disease Detecting/Tracking System== https://ceur-ws.org/Vol-222/krmed2006-p09.pdf
KR-MED 2006 "Biomedical Ontology in Action"
November 8, 2006, Baltimore, Maryland, USA


   The development of a schema for the annotation of terms in the BioCaster
                      disease detecting/tracking system
               Ai Kawazoe*1, Ph.D., Lihua Jin*1, Ph.D., Mika Shigematsu*3, M.D.,
          Roberto Barrero*2, Ph.D., Kiyosu Taniguchi*3, M.D., Nigel Collier*1, Ph.D.
      *1
         National Institute of Informatics, Hitotsubashi 2-1-2 Chiyoda-ku Tokyo, JAPAN
             *2
                National Institute of Genetics, Yata 1111 Mishima Shizuoka, JAPAN
   *3
      National Institute of Infectious Diseases, Toyama 1-23-1 Shinjuku-ku Tokyo, JAPAN
*1                                                        *2
   {zoeai,lihua-jin,collier}@nii.ac.jp,                     rbarrero@genes.nig.ac.jp,
                                *3
                                   {mikas,tanigk}@nih.go.jp
Amid growing public concern about the spread of                    particular value for assessing possible outbreaks in
infectious diseases such as avian influenza and SARS, there        areas where formal reporting procedures are absent
is an increasing need for collecting timely and reliable           or not well established.
information about disease outbreaks from natural language            Several major challenges exist in locating Web-
data such as online news articles. In this paper we
introduce BioCaster, a text mining-based system for
                                                                   based information in a timely manner using
infectious disease detection and tracking currently being          traditional search methods:         (1) the massively
developed, and discuss the development of a domain                 increasing volume of dynamically changing
ontology and schema for the annotation of terms. In                unstructured news data available on the Web makes
particular we focus on the comparison between two                  it extremely difficult to obtain a clear picture of an
approaches, 1) a traditional task-oriented approach with a         outbreak in a timely manner, (2) the large-scale
simple schema that does not strictly follow ontological            republication of reports from centralized news
principles, and 2) a formal approach which is ontologically        agencies requires redundancy to be identified and
well-founded but adds extra requirements to the annotation         removed, (3) the initial reports of an outbreak are
schema. We report on several critical problems that were
                                                                   contained in only a few news articles which will
highlighted by an entity annotation experiment,
attributable to the purely task-oriented ontology design. A        usually be overlooked by traditional search engines
second experiment based on a formally constructed                  which use keyword indexing, (4) the first reports of
ontology produced improved annotation results despite the          an infectious disease will often be reported in local
apparent complexity of the annotation schema.                      news media which are only available in the local
                                                                   language. Experience has shown that this requires
                 1. INTRODUCTION                                   computer systems to have at least a partial
                                                                   understanding of the domain through ontologies,
As shown by the recent outbreak of Severe Acute                    term lists and databases as well as specialized
Respiratory Syndrome (SARS) and emerging cases                     multilingual resources.
of avian influenza, infectious diseases have the                     To address the information needs in the domain of
potential to spread rapidly through person-to-person               infectious disease outbreaks, standard Information
transmission within densely populated areas and                    Extraction technology has been adapted for
across country borders through international air                   retrospective archive search [2] but only a few
travel. The first line of defense against rapidly                  systems are currently actively deployed with the most
spreading diseases is surveillance, led by the World               prominent being the Global Public Health
Health Organization (WHO) and national health                      Intelligence Network (GPHIN) [3], a successful but
authorities. Catching an outbreak earlier has clear                semi-closed system used by the WHO. We are now
implications for both morbidity and mortality as well              developing BioCaster, a text mining system based on
as the feasibility of containment [1]. However a lack              an openly available multilingual ontology for
of surveillance system infrastructure in Southeast                 proactive notification about priority disease
Asia, which is currently the focus of an avian H5N1                outbreaks. A key component of the BioCaster system
epidemic is seen as hindering control efforts. In                  is the use of automated learning methods to identify
addition to traditional surrogate methods such as                  novel entities and events using features derived from
reporting notifiable diseases and over-the-counter                 annotated examples in a multilingual collection of
(OTC) sales monitoring, public health experts are                  news articles. The initial target languages are English,
increasingly considering news and other reports                    Japanese, Vietnamese and Thai.
available on the World Wide Web (Web) as a cost-                      In our early development of BioCaster it became
effective means of helping to find and track early                 clear that we needed a rigorous schema for markable
cluster cases, enabling a timely and appropriate                   entities. Since the system relies on high quality
response. Such rumour-based information may be of                  human annotated training data for constructing


                                                              77
named entity recognizers (NERs), any inconsistency                   participate in, while most others, such as PERSON,
introduced into the annotation schema by ontological                 BACTERIA, and NON_HUMAN, represent types.
inconsistencies should be harmful for annotation                       We had two options for constructing the ontology
performance, both human and machine. Surprisingly                    and annotation schema, according to how to deal
while there have been several studies on the mapping                 with concepts of a different nature. The first
problem between terms and coding systems such as                     approach is rather task-oriented. Here we do not
the UMLS Metathesaurus [4] as well as biomedical                     make any distinction between context-dependent
annotation experiments [5] [6] [7] there have been to                concepts and others. This results in a somewhat
the best of our knowledge no studies conducted into                  simpler ontology: all categories of concepts are
the method by which new domain models suitable for                   represented as classes which follow a disjoint entity
biomedical text mining should be organized. We                       class principal that has been the underlying premise
report here on our initial experience which showed                   of NERs. The corresponding annotation schema will
that the task-oriented annotation schema based on a                  also be simpler, since instances of context-dependent
poorly-considered domain ontology can indeed be                      classes are annotated in the same way as those of
harmful to accuracy. Re-organizing this schema                       other classes, e.g.
using well founded ontological principles produced
better results, despite the added complexity.                        Kofi Annan
                                                                     a 12 year-old girl infected
                      2. USER NEEDS                                  with H5N1

Epidemiologists       are    concerned      with    the              (The details of this schema will be given in the next
circumstances in which diseases occur in a                           section.) In this task-oriented approach, we can
population and the factors that influence their                      annotate exactly what the event frame needs to
incidence, spread, recognition and control. Our                      identify.     For example, we can exclude from
initial discussions with domain experts at the                       annotation non-named, non-case mentions, which we
National Institute of Infectious Diseases revealed                   are not interested in. A defect of this approach is that
several common scenarios for gathering information                   it is not ontologically well-founded.
from Web news including cases involving the spread                      The alternative approach is a more formal one
of a communicable disease across international                       where we make a clear distinction between context-
borders and the contamination of blood products.                     dependent concepts and others, based on well-
From these initial discussions we collected examples                 founded ontological principles. The result is likely to
of early outbreak news reports and compiled a list of                be a more complex ontology in which context-
significant entity classes which included DISEASE1,                  dependent concepts have a different status from other
CASE, LOCATION SYMPTOM, TIME, DRUG, etc.                             concepts. The corresponding annotation schema will
   Subsequent follow up discussions and examination                  also be more complex as well, since roles are
of the literature revealed that we can categorize these              annotated in a different way from those of entity
concepts according to the information needs of the                   classes. In order to achieve ontological consistency
scientists as shown in Table 1.                                      we also need to annotate more mentions than the
  Genetic epidemiology adds another dimension to                     former approach, including those that will not
the information needs as the genetic makeup of the                   instantiate event frames.
host plays a key role in determining susceptibility or                 From the two approaches above, out of expediency
resistance to pathogens. We therefore chose to add in                we chose the former for the first annotation
a further level of detail about the host which includes              experiment. The reason being that it seemed easier
genes and their products, identified with a §. Finally               for annotators and that we could find almost no
we had 19 categories of concepts which we want to                    precedent works in named entity annotation which
identify in news texts (Table 2).                                    dealt with formal analysis of entities and role
                                                                     concepts.
3. CONSIDERATION ON TWO APPROACHES
At this stage we were aware that some of the                                4. ANNOTATION EXPERIMENT 1
important concepts in Table 2 are contextually-
dependent and intrinsically different from others.                   4.1 Method
For example, CASE and TRANSMISSION represent                         Based on the list of categories of concepts in Table 2,
roles (discussed in [8] [9] [10] [11] among others)                  we constructed the ontology shown in Figure 1. Note
which are dependent on the existence of events they                  that CASE and TRANSMISSION, which represent

1
  We will adopt here the notation of using all upper case for
domain entity classes.



                                                                78
    Focus              Description                  Example properties                                  Concept types
    Agent              Pathogens                    Infectivity, pathogenicity, virulence, incubation   VIRUS, BACTERIA,
                                                    period, communicability                             PARASITE*, FUNGI*
    Transmission       The delivery or dispersal    Dermal, oral, respiratory                           TRANSMISSION
                       method
    Host               Persons carrying a           Age, gender, occupation,                            CASE, SYMPTOM, DISEASE,
                       disease                                                                          ANATOMY, DNA§, RNA§,
                                                                                                        PROTEIN§
    Environment        Location and climate         Large population centre, enclosed building, mass    LOCATION, TIME
                                                    transport system, rural village
    * Not included in the current schema
    §
      Genetic level entities

                                                   Table 1 Categorization of concepts




Classes                     Examples                                               Description
ANATOMY                     liver, pancreas, nervous system, eLa cel,              Body parts including tissues and cells
BACTERIA                    Escherichia coli O157, tubercle bacillus               Eubacteria
CASE                        a 35-year-old woman, the third case                    Confirmed cases of diseases
NT_CHEMICAL                 beryllium, organophosphate pesticide                   Chemicals intended for non-therapeutic purposes *1
T_CHEMICAL                  Relenza, immunosuppressive drug, oseltamivir           Chemicals intended for the treatment of diseases*1
CONTROL                     stamping out, screening, vaccination                   Control measures to lower the risk of transmission of a
                                                                                   disease
DISEASE                     H5N1 avian influenza, SARS, cholera                    A deviation in the normal functioning of the host caused
                                                                                   by a persistent agent (pathogen) or some environmental
                                                                                   factor
DNA                         Sp1 site, triple-A, c-jun gene                         Includes the names of DNAs, groups, families, molecules,
                                                                                   domains and regions*2
LOCATION                    Viet Nam, Jakarta, Sumatra Island, Asia                A politically or geographically defined location*3
NON_HUMAN                   civet cats, poultry, flies                             Multi-cell organism other than humans, i.e. "animals"
ORGANIZATION                the Ministry of Health, WHO, Pasteur Institute         Corporate, governmental, or other organizational entity*3
PERSON                      Jean Chretien, Murray McQuigge                         A named person or family
PRODUCT                     botulism antitoxin, Influenza vaccine                  Biological product, (e.g. vaccines, immune sera)
PROTEIN                     STAT, RNA polymerase II alpha subunit                  Includes the names of proteins, groups, families,
                                                                                   molecules, complexes and substructures*2
RNA                         IL-2R alpha transcripts, TNF mRNA                      Includes the names of RNAs, groups, families, molecules,
                                                                                   domains and regions*2
SYMPTOM                     cough, fever, dehydration, convulsion                  Alterations in the appearance of a case due to a disease
TIME                        Tue Jan 3, winter, March, since October, 2003          Temporal expressions that can be anchored on a
                                                                                   timeline*4
TRANSMISSION                HIV-tainted blood products, BSE-infected cows          Source of infection
VIRUS                       Ebola virus, HIV                                       Viruses such as HIV, HTLV, EBV *2
Descriptions marked with *1 , *2, *3, *4 are based on those in MeSH [12], GENIA ontology [13], MUC-7 [14], and HUB-4 [15],
respectively.

                                           Table 2 List of classes of markable concepts




                                                                       79
                                                               In the annotation schema used in the example above,
                                                               the attribute cl takes the entity class label as its value.
                                                               For     example      "Kofi
                                                               Annan" means that the entity mentioned
                                                               by "Kofi Annan" is related to the class PERSON.
                                                               The reason for using this rather vague expression is
                                                               to cover two relations between mentioned entities
                                                               and the ontology we want to describe. The first is "is
                                                               an instance of", and the other one is "is a subclass of".
                                                               Some of the markable texts mention a particular and
                                                               others mention a universal. For example, names of
                                                               persons, locations and organizations are usually used
                                                               to refer to a particular, whereas names of chemical
                                                               substance, viruses and proteins are often used to refer
                                                               to universals. This is one of the factors which makes
                                                               ontology-based annotation a complicated process. It
                                                               should be noted though that we intend to work
                                                               towards a clear distinction between the two relations
                                                               in future work.

   Figure 1 Initial domain ontology (simplified)               4.2 Annotation results and problems
                                                               During the first annotation experiment, we had many
roles, have the same status as other classes since we          problem reports form annotators, and found a
adopted the task-oriented approach as discussed in             significant number of inconsistencies in the
the last section. We developed annotation guidelines           annotation results. Most of the problems could be
to annotate non-overlapping mentions related to the            traced back to poor design of the domain ontology
classes in news articles and hired two PhD                     and the annotation schema. Follow up analysis on
informatics students as annotators. After 1-week of            the corpus yielded the following symptoms of error:
training consisting of guideline review, case study
discussions and test cases, we started the annotation          •        Gaps in the annotation schema shown by the
process with 200 news articles taken from domain                        existence of mentions to entities which it is
sources, including WHO epidemic reports, IRIN, and                      desirable to annotate but the annotation schema
Reuter news.                                                            does not cover.
  In order to restrict the markable mentions to exactly        •        Ambiguity between context-dependent concepts
those that we aimed to identify with the text mining                    and context-independent ones
system, we defined CASE as the class of confirmed              •        Idiosyncratic annotations which are forced on
cases which are unnamed, and PERSON as the class                        annotators due to the disjoint entity class
of named persons who are not cases. We considered                       principal.
this would narrow down the number of markable
mentions since unnamed mentions for non-cases need             Gaps in the annotation schema
not be annotated. We also instructed annotators to             At the initial stage of our analysis we considered that
markup only the single most appropriate class,                 distinguishing CASE (as confirmed cases of a disease
prohibited multiple classes. An example of annotated           which are unnamed humans) from PERSON (named
text is shown below:                                           persons who are not cases of a disease) was rather
                                                               natural, since CASE entities are in general
     The Ministry of                   anonymous. However, in the news articles there
     Health in                      were some examples where cases were mentioned by
     Indonesia has today confirmed a fatal human case of
                                                               name as follows:
     H5N1 avian
     influenza. A 27-                   E1 Tests carried out in a UK laboratory confirmed
     year-old woman from Jakarta developed
     symptoms on 17                            In addition, we found that there were more frequent
     September. She contracted the virus from           mentions of putative cases than we had expected.
     close contact with infected birds.                           2
                                                                   In this example we only show initials of the victims' names.



                                                          80
These mentions were often annotated as CASE by
annotators although we restricted the scope of this           4.3 Empirical results from training an NER
class only to confirmed cases.                                We trained a support vector machine [13] (for details,
                                                              see Takeuchi and Collier [14]) for named entity
E2 a Taiwanese is suspected to have died of SARS              recognition based on the annotated corpus of 200
                                                              news articles. 10-fold cross validation experiments
Follow up discussions with public health experts              were performed using TinySVM3. A -2/+1 features
revealed that mentions of putative cases are                  window was used that included surface word,
important, especially in the early stages of disease          orthography, biomedical prefixes/suffixes, lemma,
outbreaks, and we concluded that they should be               head noun and previous class predications. The F-
identified by the system. However, the existing               score for the all classes in Table 2 was 76.96.
framework made them difficult to capture.                     Among the problematic classes were found to be
                                                              PERSON, CASE and NON_HUMAN (many
Ambiguity caused by context-dependent concepts                instances of which had ambiguity with
One of the classes which confused annotators most             TRANSMISSION) which had F-scores below our
was TRANSMISSION (source of infection). Below                 expectation: PERSON (54.95), CASE (53.17),
are typical examples of problematic cases.                    NON_HUMAN (68.0).

E3 Victims contract the virus from close contact                        5. ANNOTATION EXPERIMENT 2
   with infected birds
E4 There is no known cure for Ebola, which is
   transmitted via infected body fluids                       5.1 Re-examination of the approach
E5 An Irish woman infected with Hepatitis C by a              Although we chose the task-oriented approach for its
   contaminated blood product                                 simplicity and ease of implementation the results
E6 18 hospitalized after consuming chapattis                  from automatic NER and subsequent corpus analysis
                                                              revealed that problems arose because we made no
Annotators had a problem in annotating ‘birds' in E3          clear distinction between context-dependent and
since those can be classified as both                         context-independent classes. We decided to take an
TRANSMISSION and NON_HUMAN (animals).                         alternative, formal and linguistically-sound approach,
‘Body fluid’ in E4 is also ambiguous between                  and distinguish context-dependent concepts from
TRANSMISSION and ANATOMY (body parts), and                    others in both the ontology and the annotation
also ‘blood product’ in E5 is ambiguous between               schema.
TRANSMISSION and PRODUCT (biological
product). Most of the TRANSMISSION instances                  5.2 Classification of concepts
found in the text were those which could be                   The first step was to use the classification method
categorized as NON_HUMAN, and the cases which                 proposed by Guarino and Welty ([9] and [10]) which
belonged only to TRANSMISSION, such as                        is based on meta-properties (rigidity, identity,
‘chapattis’ in E6, were very few.                             dependency), in order to classify categories of
                                                              concepts in Table 2. Definitions of the meta-
Idiosyncratic annotations due to the disjoint entity          properties we used are as follows:
class principal

E7 Hudd has                           ([10], p.4)
   written several books on music hall and                    rigid property φ(+R): ∀x φ(x) → □φ(x)
   Variety...                                                 anti-rigid property φ(~R): ∀x φ(x) →¬□φ(x)
E8 Doctors      later   diagnosed  Hudd with a chest                          ([10], p.5)
   infection...                                               Identity Condition (IC): An identity condition is a
                                                              formula Γ that satisfies either of the followings4:
In the example above, it is clearly undesirable that
the same entity is related to PERSON in E7 and
CASE in E8. Although the annotator was aware of
the choices the principal of disjoint classes forced a        3
                                                                Available from http://cl.aist-nara.ac.jp/~taku-
choice.                                                       ku/software/TinySVM
                                                              4
                                                                In [9], further restrictions are added in order to avoid 1) the case
                                                              where the necessary IC definition becomes trivially true regardless
                                                              of the truth value of the formula x=y and 2) the case where Γ(x, y,
                                                              t, t') is false and that makes the sufficient IC definition trivially true.



                                                         81
                          rigidity                 identity (supplying)    identity (carrying)     dependency               classification
ANATOMY                    +R                        +O                         +I                      -D                  Type
BACTERIA                   +R                        +O                         +I                      -D                  Type
CASE                       ~R                        -O                         +I                      +D                  Material Role
NT_CHEMICAL                ~R                        -O                         +I                      +D                  Material Role
T_CHEMICAL                 ~R                        -O                         +I                      +D                  Material Role
CONTROL                    ~R *1                     - O*2                      +I                      +D                  Material Role
DISEASE                    +R                        +O*3                       +I                      +D                  Type
DNA                        +R                        +O                         +I                      -D                  Type
LOCATION                   +R                        +O                         +I                      -D                  Type
NON_HUMAN                  +R                        +O                         +I                      -D                  Type
ORGANIZATION               +R                        +O                         +I                      -D                  Type
PERSON                     +R                        +O                         +I                      -D                  Type
PRODUCT                    +R                        +O                         +I                      +D                  Type
PROTEIN                    +R                        +O                         +I                      -D                  Type
RNA                        +R                        +O                         +I                      -D                  Type
SYMPTOM                    +R                        +O                         +I                      +D                  Type
TIME                       +R                        +O                         +I                      -D                  Type
VIRUS                      +R                        +O                         +I                      -D                  Type
TRANSMISSION               ~R                        -O                         -I                      +D                  Formal Role
*1 We consider that this class is anti-rigid, since it is possible that an action which is an instance of CONTROL in the current world is not an
instance of CONTROL in some other accessible world. The same action may be conducted for different purposes in different worlds.
*2 This class includes events. In DOLCE top level categories (Gangemi et al.[19]), Events are under the class of Perdurant/Occurrence. It
seems to be controversial what the identity condition for events should be. Davidson [20] proposes a condition such that "events are identical
if and only if they have exactly the same causes and effects". In any case it should be reasonable to assume that this class itself does not
supply ICs but inherits them from the upper level classes.
*3 What we consider ICs for this class is as follows: Two instances of diseases are identical iff the two are experienced by the same host at
the same time, are caused by the same agent (e.g. H5N1 virus for "H5N1 avian influenza") and have the same set of characteristic
alterations/symptoms (e.g. inflammation of the lung for "pneumonia").

                                                    Table 3: Classification of concepts

necessary IC: E(x, t)∧φ(x, t)∧E(x, t')∧φ(y, t')∧                                experiment       were      classified   as    Role:
x=y →Γ(x, y, t, t')                                                             TRANSMISSION (Formal Role) and CASE
sufficient IC: E(x, t)∧φ(x, t)∧E(x, t')∧φ(y, t')∧                               (Material Role).        According to the further
                                                                                classification of non-rigid concepts by Kaneiwa and
Γ(x, y, t, t') →x=y                                                             Mizoguchi [18], these cases are classified as time-
     (E : "actually exist at time t")                                           dependent concepts.
Any property φ carries an IC (+I) iff it is                                     5.3 Modification of the schema
subsumed by a property supplying that IC.                                       For some of the roles in Table 3, we modified their
A property φ supplies an IC (+O) iff i) it is rigid;                            status in the annotation schema.
ii) there is a necessary or sufficient IC for it; and iii)
the same IC is not carried by all the properties                                CASE
subsuming φ.                                                                    CASE and PERSON were problematic since we
                                                                                distinguished them according to the form of
 ([10], p.7)                                                        expression (unnamed/named), in addition to the
externally dependent property φ (+D):                                           case/non-case distinction. In order to cover the
∀x□(φ(x) →∃y ω(y) ∧¬P(y, x) ∧¬C(y, x))                                          mentions which could not be annotated in the first
     (P: "is a part of")                                                        experiment, we extended the scope of the PERSON
     (C: "is a constituent of")                                                 class to include person instances in general, and
                                                                                eliminate the unnamed/named and case/non-case
Classification results are shown in Table 3. Most                               distinctions. We modified the annotation schema so
concepts such as ANATOMY, NON_HUMAN, and                                        that CASE is not the value of cl attribute, but is the
PERSON are classified as Type, whereas the                                      case attribute which applies to the referred instance
concepts which were problematic in the first                                    of PERSON. This attribute takes the value true when
                                                                                the mentioned instance is a confirmed case of disease,


                                                                          82
false when the instance is not a case, and putative            TRANSMISSION
when the instance is a suspected case. Named case              We defined the transmission attribute which applies
mentions and suspected case mentions are annotated             to mentions of ANATOMY, PRODUCT, PERSON
as follows:                                                    and NON_HUMAN classes. As shown in the
                                                               following examples, 'birds' are always related to
E9 Tests carried out in a UK laboratory confirmed              NON_HUMAN, and take a 'true' value only when
   that M.A...                                   also take a 'putative' value to cover mentions to
                                                               possible sources of infection.
E10 a
    Taiwanese is suspected to have died                 E11 Victims contract the virus from close contact
    of SARS                                                        with infected birds
The meaning of case attribute-value pairs can be
described in logical description and natural language
as follows:                                                    T_CHEMICAL /NT_CHEMICAL
                                                               Concept classification revealed that T_CHEMICAL
<...cl="PERSON" case="true">John: case(j)                and NT_CHEMICAL have "the situation dependency
"It is true that the person j mentioned by "John" is an        obtained from extending types" discussed in [18] and
instance of the CASE class"                                    have the same status as 'weapon' and 'table'.
                                                               T_CHEMICAL includes chemicals mentioned as
<...cl="PERSON" case="false">John: ¬case(j)              drugs in any context and those regarded as drugs in
"It is false that the person j mentioned by "John" is          some context. Here we removed the two classes and
an instance of the CASE class"                                 made the parent node CHEMICAL as a class for
                                                               annotation.
<...cl="PERSON" case="putative">John:                       We then defined therapeutic attribute which applies
◇case(j)                                                       to mentions of CHEMICAL and takes the value true
"It is possible that the person j mentioned by "John"          when the entity is intended for therapeutic use and
is an instance of the CASE class"                              false otherwise.

As shown above, the values of the case attribute                 As a result of the modifications above, our revised
                                                               ontology is shown in Figure 2. We also added new
correspond to logical operators such as ¬ and ◇.
                                                               classes     CONDITION         (status    of    patients:
The values of case attributes specify the modes of
                                                               'hospitalized' 'died 'in critical condition', etc) and
linkage between the referred concept and the CASE
                                                               OUTBREAK (collective disease incident: 'outbreak',
class. The formal basis we had in mind when
                                                               'pandemic', etc). Information about CONDITION is
formulating the case attribute are as follows: 1) every
                                                               important for experts to know the rate of
instance of a non-rigid class must be an instance of
                                                               hospitalization and death and determine the alert
some rigid class, 2) the relations between a non-rigid
                                                               level. Mentions of OUTBREAK include expressions
class and its instance are often modified by
                                                               which are specific to disease outbreak news,
modal/temporal operators. The first point drove us to
                                                               increasing the specificity of our detection system. We
create the case attribute which apply to instances of
                                                               located PERSON and NON_HUMAN under metazoa,
some rigid class, here, PERSON. The second point
                                                               and added a number attribute (which takes one or
is the motivation for us to set values to include
                                                               many as its value) to be applied to PERSON
negative and modal operators. This schema can be
                                                               instances.
extended if we allow a wider value range for the case
                                                                 With insights from the revised ontology we also
attribute to include other modal/temporal operators,
                                                               changed the annotation method by dividing the
although currently we restrict the values to the three
                                                               process into two distinct stages as shown in Figure 3:
above.
                                                               1) annotation of mentions to non-role (rigid)
  It is worth noting that there is a trade-off between
                                                               concepts and 2) annotation of role (non-rigid)
this revised schema and the former schema which is
                                                               concepts.
that we have increased the number of the markable
entities, since we need to annotate unnamed, non-
case mentions which are not directly related to the
purpose of the system.




                                                          83
                                                                               significant increases of the F score were observed in
                                                                               the classes for PERSON (66.28; +11.33 compared to
                                                                               the previous result), case mentions among PERSON
                                                                               (65.63; +12.46), and NON_HUMAN (73.21; +5.21).
                    therapeutic attribute
                                                                               5.5 Remaining issues
                                                                               Some of the problems reported in this second
                                                                               experiment were related to context dependency (anti-
                                                                               rigidity, situation dependency) discussed in Section
                                                 case attribute
                                                                               6.2.
                                               number attribute                  The most difficult class seemed to be CONTROL
                                                                               (control measures to lower the risk of diseases). As
                                                                               shown in Table 3, we consider this class is also non-
                                                                               rigid, and it includes mentions which refer to
                                                                               subclasses of the CONTROL class regardless of
                                                                               situation ("quarantine" "vaccination"), and others
                                                                               which can be a control measure depending on the
                                                 transmission attribute        situation ("warning" "blockade"). This characteristic
                                                                               seems to cause the difficulty.
                                   therapeutic attribute                         So far we have resolved the complexity of non-
                                                                               rigid concepts by defining attributes which apply to
                                                                               instances of rigid classes (e.g. the case attribute for
       Figure 2 Current ontology (simplified)                                  the class PERSON). This strategy, however does not
                                                                               seem to be effective for CONTROL since it is not
                                                                               easy to identify a rigid superclass for CONTROL
                                                                               which can be realistically annotated in the text. For
                                                                               example, EVENT can be considered as a rigid class
                            4. Event
                           annotation                                          subsuming CONTROL, but currently it is not
                                                                               realistic to manually annotate every mention of an
                                                                               event. Currently we are seeking for a way to deal
                  3. Coreference annotation                                    with this problem.

          2. Annotation of Role (non-rigid) concepts                                            6. CONCLUSION
                                                                               The study in this paper was motivated by our need
           1. Annotation of Type (rigid) concepts
                                                                               for a high quality annotation schema to support
                                                                               detection of novel entities in the infectious disease
            Figure 3 Annotation schedule                                       outbreak domain. We discussed two experiments
                                                                               based on alternative approaches for constructing an
5.4 Results of annotation and NE recognizer                                    ontology-based annotation schema. The amount of
training                                                                       data in our study is relatively small but empirical
We asked three PhD students to annotate a further                              results indicate support for our view that there is a
300 news articles. This time we used the revised                               positive effect in adopting well founded ontological
annotation method 1 and 2 shown in Figure 3.                                   principals over an ad-hoc task-based approach.
  As a result of distinguishing between Role concepts                          Although this study is not a formal evaluation of
(case, transmission, therapeutic) from others in the                           ontologies, it is still an evaluation from the viewpoint
annotation schema, problem reports on these classes                            of ontology application to the task of natural
were reduced, and the annotation results were also                             language annotation. The classification method of
improved.      Contrary to our expectations, the                               Guarino and Welty ([9], [10]) which was originally
complexity of the new annotation schema and the                                proposed to achieve consistency in the
increased number of markable mentions seemed to                                configurational structure of ontologies, was adapted
have no negative influence on the annotator’s speed.                           and found to be useful for improving annotation
  The improvement can be seen empirically in the                               performance.
NER results. We re-annotated the corpus used in the                              An alternative possibility exists which we have not
first experiment using the revised annotation schema.                          addressed in this paper which is to reformulate the
This time the F-score for all classes rose to 79.96 (+3                        tradition NER task to allow for overlapping (nested)
compared to the previous result).            Especially,                       and multi-class entities. This however introduces



                                                                          84
significant additional complications in both the                    of EKAW-2000: The 12th International
recognizer models and in the annotation schema so                   Conference on Knowledge Engineering and
we have adopted a less radical formulation in this                  Knowledge Management, volume 1937: 97-112.
work.                                                           10. Guarino N, Welty C. Ontological analysis of
  As the next step in this study, we are now                        taxonomic relations. Lander A, Storey V (eds.)
extending our simple taxonomy to a multi-lingual                    Proceedings of ER-2000: The International
ontology; enriching the current taxonomic structure                 Conference on Conceptual Modeling, vol. 1920,
with domain-sensitive relations. The resulting                      210-224, Springer Verlag LNCS, Berlin,
ontology will be freely available for re-use. At the                Germany.
initial stage we are focusing on English, Japanese,             11. Steimann F. On the representation of roles in
Vietnamese, Thai, Chinese (standard) and Korean.                    object-oriented and conceptual modelling. Data
We hope to add other Asia-Pacific languages in the                  and Knowledge Engineering35, 1: 83-106. 2000.
future.                                                         12. U.S. National Library of Medicine. Medical
                                                                    Subject Headings (MeSH), 2006.
                                                                13. Kim J.D., Ohta T, Tateishi Y, Tsujii J. GENIA
                 Acknowledgements                                   corpus - a semantically annotated corpus for bio-
                                                                    textmining. Bioinformatics 19(suppl. 1), pp.
  We gratefully acknowledge partial funding support                 i180-i182, Oxford University Press, 2003.
from the Japan Society for the Promotion of Science             14. Hirschman L, Chinchor N. MUC-7 named entity
(grant no. 18049071). We also thank the anonymous                   task definition. Proceedings of the 7th Message
reviewers for helpful comments.                                     Understanding Conference (MUC-7).
                                                                15. Hirschman L, Chinchor N, Grishman R,
                                                                    Sundheim B. Hub-4 Event Guidelines Version
                      References                                    2.6.                                 http://www-
1.   Ferguson NM, Cummings DA, Cauchemez S,                         nlpir.nist.gov/related_projects/muc/proceedings/
     Fraser C, Riley S, et al. Strategies for containing            hub4/guidelines.html
     an emerging influenza pandemic in Southeast                16. Vapnik, V. N. The Nature of Statistical Learning
     Asia. Nature 437: 209–214. 2005.                               Theory, Springer-Verlag, New York, 1995.
2.   Grishman R, Huttunen S, and Yangarber R.                   17. Takeuchi, K and Collier, N. "Bio-medical entity
     Information extraction for enhanced access to                  extraction using support vector machines", in vol.
     disease outbreak reports. Journal of Biomedical                33, no.2, Artificial Intelligence in Medicine,
     Informatics, Vol. 35, No. 4, 236 - 246, 2002.                  Elsevier, pp. 125-137, 2005.
3.   Public Health Agency of Canada. GPHIN                      18. Kaneiwa K, Mizoguchi, R. An order-sorted
     system. http://www.phac-aspc.gc.ca/media/nr-                   quantified modal logic for meta-ontology. Proc.
     rp/2004/2004_gphin-rmispbk_e.html                              of the International Conference on Automated
4.   Aronson A.R. Effective mapping of biomedical                   Reasoning with Analytic Tableaux and Related
     text to the UMLS Metathesaurus: the MetaMap                    Methods (TABLEAUX 2005), Koblenz,
     program. Proceedings of AMIA Symposium,                        Germany: 169-184, 2005.
     17–21, 2001.                                               19. Gangemi A, Guarino N, Masolo C, Oltramari A,
5.   Rindflesch T.C., Tanabe L, Weinstein J.N. and                  Schneider L. Sweetening ontologies with
     Hunter L. EDGAR: extraction of drugs, genes                    DOLCE. Benjamins et al. (eds.), Proceedings of
     and relations from the biomedical literature.                  the 13th European Conference on Knowledge
     Proceedings of Pacific Symposium on                            Engineering and Knowledge Management
     Biocomputing 5:514-525, 2000.                                  (EKAW2002), 166-181, Sigüenza, Spain, 2002.
6.   Kim J.D., Ohta T, Tsuruoka Y, Tateishi Y,                  20. Davidson D. The Individuation of events.
     Collier N.      Introduction to the Bio-entity                 Rescher N (ed) Essays in Honor of Carl G.
     Recognition Task of the JNLPBA workshop.                       Hempel: 216-234, 1969, D. Reidel.
     Proceedings of the JNPBA, 70-76, 2004.
7.   Yeh A, Morgan A, Colosimo M, Hirschman L.
     BioCreAtIvE task 1A: gene mention finding
     evaluation. BMC Bioinformatics 2005, 6(Suppl
     1):S2.
8.   Sowa J.F. Conceptual structures: Information
     processing in mind and machine. Addison-
     Wesley, New York; 1984.
9.   Guarino N, Welty C. A formal ontology of
     properties. Dieng R, Corby O (eds.) Proceedings


                                                           85