KR-MED 2006 "Biomedical Ontology in Action" November 8, 2006, Baltimore, Maryland, USA The development of a schema for the annotation of terms in the BioCaster disease detecting/tracking system Ai Kawazoe*1, Ph.D., Lihua Jin*1, Ph.D., Mika Shigematsu*3, M.D., Roberto Barrero*2, Ph.D., Kiyosu Taniguchi*3, M.D., Nigel Collier*1, Ph.D. *1 National Institute of Informatics, Hitotsubashi 2-1-2 Chiyoda-ku Tokyo, JAPAN *2 National Institute of Genetics, Yata 1111 Mishima Shizuoka, JAPAN *3 National Institute of Infectious Diseases, Toyama 1-23-1 Shinjuku-ku Tokyo, JAPAN *1 *2 {zoeai,lihua-jin,collier}@nii.ac.jp, rbarrero@genes.nig.ac.jp, *3 {mikas,tanigk}@nih.go.jp Amid growing public concern about the spread of particular value for assessing possible outbreaks in infectious diseases such as avian influenza and SARS, there areas where formal reporting procedures are absent is an increasing need for collecting timely and reliable or not well established. information about disease outbreaks from natural language Several major challenges exist in locating Web- data such as online news articles. In this paper we introduce BioCaster, a text mining-based system for based information in a timely manner using infectious disease detection and tracking currently being traditional search methods: (1) the massively developed, and discuss the development of a domain increasing volume of dynamically changing ontology and schema for the annotation of terms. In unstructured news data available on the Web makes particular we focus on the comparison between two it extremely difficult to obtain a clear picture of an approaches, 1) a traditional task-oriented approach with a outbreak in a timely manner, (2) the large-scale simple schema that does not strictly follow ontological republication of reports from centralized news principles, and 2) a formal approach which is ontologically agencies requires redundancy to be identified and well-founded but adds extra requirements to the annotation removed, (3) the initial reports of an outbreak are schema. We report on several critical problems that were contained in only a few news articles which will highlighted by an entity annotation experiment, attributable to the purely task-oriented ontology design. A usually be overlooked by traditional search engines second experiment based on a formally constructed which use keyword indexing, (4) the first reports of ontology produced improved annotation results despite the an infectious disease will often be reported in local apparent complexity of the annotation schema. news media which are only available in the local language. Experience has shown that this requires 1. INTRODUCTION computer systems to have at least a partial understanding of the domain through ontologies, As shown by the recent outbreak of Severe Acute term lists and databases as well as specialized Respiratory Syndrome (SARS) and emerging cases multilingual resources. of avian influenza, infectious diseases have the To address the information needs in the domain of potential to spread rapidly through person-to-person infectious disease outbreaks, standard Information transmission within densely populated areas and Extraction technology has been adapted for across country borders through international air retrospective archive search [2] but only a few travel. The first line of defense against rapidly systems are currently actively deployed with the most spreading diseases is surveillance, led by the World prominent being the Global Public Health Health Organization (WHO) and national health Intelligence Network (GPHIN) [3], a successful but authorities. Catching an outbreak earlier has clear semi-closed system used by the WHO. We are now implications for both morbidity and mortality as well developing BioCaster, a text mining system based on as the feasibility of containment [1]. However a lack an openly available multilingual ontology for of surveillance system infrastructure in Southeast proactive notification about priority disease Asia, which is currently the focus of an avian H5N1 outbreaks. A key component of the BioCaster system epidemic is seen as hindering control efforts. In is the use of automated learning methods to identify addition to traditional surrogate methods such as novel entities and events using features derived from reporting notifiable diseases and over-the-counter annotated examples in a multilingual collection of (OTC) sales monitoring, public health experts are news articles. The initial target languages are English, increasingly considering news and other reports Japanese, Vietnamese and Thai. available on the World Wide Web (Web) as a cost- In our early development of BioCaster it became effective means of helping to find and track early clear that we needed a rigorous schema for markable cluster cases, enabling a timely and appropriate entities. Since the system relies on high quality response. Such rumour-based information may be of human annotated training data for constructing 77 named entity recognizers (NERs), any inconsistency participate in, while most others, such as PERSON, introduced into the annotation schema by ontological BACTERIA, and NON_HUMAN, represent types. inconsistencies should be harmful for annotation We had two options for constructing the ontology performance, both human and machine. Surprisingly and annotation schema, according to how to deal while there have been several studies on the mapping with concepts of a different nature. The first problem between terms and coding systems such as approach is rather task-oriented. Here we do not the UMLS Metathesaurus [4] as well as biomedical make any distinction between context-dependent annotation experiments [5] [6] [7] there have been to concepts and others. This results in a somewhat the best of our knowledge no studies conducted into simpler ontology: all categories of concepts are the method by which new domain models suitable for represented as classes which follow a disjoint entity biomedical text mining should be organized. We class principal that has been the underlying premise report here on our initial experience which showed of NERs. The corresponding annotation schema will that the task-oriented annotation schema based on a also be simpler, since instances of context-dependent poorly-considered domain ontology can indeed be classes are annotated in the same way as those of harmful to accuracy. Re-organizing this schema other classes, e.g. using well founded ontological principles produced better results, despite the added complexity. Kofi Annan a 12 year-old girl infected 2. USER NEEDS with H5N1 Epidemiologists are concerned with the (The details of this schema will be given in the next circumstances in which diseases occur in a section.) In this task-oriented approach, we can population and the factors that influence their annotate exactly what the event frame needs to incidence, spread, recognition and control. Our identify. For example, we can exclude from initial discussions with domain experts at the annotation non-named, non-case mentions, which we National Institute of Infectious Diseases revealed are not interested in. A defect of this approach is that several common scenarios for gathering information it is not ontologically well-founded. from Web news including cases involving the spread The alternative approach is a more formal one of a communicable disease across international where we make a clear distinction between context- borders and the contamination of blood products. dependent concepts and others, based on well- From these initial discussions we collected examples founded ontological principles. The result is likely to of early outbreak news reports and compiled a list of be a more complex ontology in which context- significant entity classes which included DISEASE1, dependent concepts have a different status from other CASE, LOCATION SYMPTOM, TIME, DRUG, etc. concepts. The corresponding annotation schema will Subsequent follow up discussions and examination also be more complex as well, since roles are of the literature revealed that we can categorize these annotated in a different way from those of entity concepts according to the information needs of the classes. In order to achieve ontological consistency scientists as shown in Table 1. we also need to annotate more mentions than the Genetic epidemiology adds another dimension to former approach, including those that will not the information needs as the genetic makeup of the instantiate event frames. host plays a key role in determining susceptibility or From the two approaches above, out of expediency resistance to pathogens. We therefore chose to add in we chose the former for the first annotation a further level of detail about the host which includes experiment. The reason being that it seemed easier genes and their products, identified with a §. Finally for annotators and that we could find almost no we had 19 categories of concepts which we want to precedent works in named entity annotation which identify in news texts (Table 2). dealt with formal analysis of entities and role concepts. 3. CONSIDERATION ON TWO APPROACHES At this stage we were aware that some of the 4. ANNOTATION EXPERIMENT 1 important concepts in Table 2 are contextually- dependent and intrinsically different from others. 4.1 Method For example, CASE and TRANSMISSION represent Based on the list of categories of concepts in Table 2, roles (discussed in [8] [9] [10] [11] among others) we constructed the ontology shown in Figure 1. Note which are dependent on the existence of events they that CASE and TRANSMISSION, which represent 1 We will adopt here the notation of using all upper case for domain entity classes. 78 Focus Description Example properties Concept types Agent Pathogens Infectivity, pathogenicity, virulence, incubation VIRUS, BACTERIA, period, communicability PARASITE*, FUNGI* Transmission The delivery or dispersal Dermal, oral, respiratory TRANSMISSION method Host Persons carrying a Age, gender, occupation, CASE, SYMPTOM, DISEASE, disease ANATOMY, DNA§, RNA§, PROTEIN§ Environment Location and climate Large population centre, enclosed building, mass LOCATION, TIME transport system, rural village * Not included in the current schema § Genetic level entities Table 1 Categorization of concepts Classes Examples Description ANATOMY liver, pancreas, nervous system, eLa cel, Body parts including tissues and cells BACTERIA Escherichia coli O157, tubercle bacillus Eubacteria CASE a 35-year-old woman, the third case Confirmed cases of diseases NT_CHEMICAL beryllium, organophosphate pesticide Chemicals intended for non-therapeutic purposes *1 T_CHEMICAL Relenza, immunosuppressive drug, oseltamivir Chemicals intended for the treatment of diseases*1 CONTROL stamping out, screening, vaccination Control measures to lower the risk of transmission of a disease DISEASE H5N1 avian influenza, SARS, cholera A deviation in the normal functioning of the host caused by a persistent agent (pathogen) or some environmental factor DNA Sp1 site, triple-A, c-jun gene Includes the names of DNAs, groups, families, molecules, domains and regions*2 LOCATION Viet Nam, Jakarta, Sumatra Island, Asia A politically or geographically defined location*3 NON_HUMAN civet cats, poultry, flies Multi-cell organism other than humans, i.e. "animals" ORGANIZATION the Ministry of Health, WHO, Pasteur Institute Corporate, governmental, or other organizational entity*3 PERSON Jean Chretien, Murray McQuigge A named person or family PRODUCT botulism antitoxin, Influenza vaccine Biological product, (e.g. vaccines, immune sera) PROTEIN STAT, RNA polymerase II alpha subunit Includes the names of proteins, groups, families, molecules, complexes and substructures*2 RNA IL-2R alpha transcripts, TNF mRNA Includes the names of RNAs, groups, families, molecules, domains and regions*2 SYMPTOM cough, fever, dehydration, convulsion Alterations in the appearance of a case due to a disease TIME Tue Jan 3, winter, March, since October, 2003 Temporal expressions that can be anchored on a timeline*4 TRANSMISSION HIV-tainted blood products, BSE-infected cows Source of infection VIRUS Ebola virus, HIV Viruses such as HIV, HTLV, EBV *2 Descriptions marked with *1 , *2, *3, *4 are based on those in MeSH [12], GENIA ontology [13], MUC-7 [14], and HUB-4 [15], respectively. Table 2 List of classes of markable concepts 79 In the annotation schema used in the example above, the attribute cl takes the entity class label as its value. For example "Kofi Annan" means that the entity mentioned by "Kofi Annan" is related to the class PERSON. The reason for using this rather vague expression is to cover two relations between mentioned entities and the ontology we want to describe. The first is "is an instance of", and the other one is "is a subclass of". Some of the markable texts mention a particular and others mention a universal. For example, names of persons, locations and organizations are usually used to refer to a particular, whereas names of chemical substance, viruses and proteins are often used to refer to universals. This is one of the factors which makes ontology-based annotation a complicated process. It should be noted though that we intend to work towards a clear distinction between the two relations in future work. Figure 1 Initial domain ontology (simplified) 4.2 Annotation results and problems During the first annotation experiment, we had many roles, have the same status as other classes since we problem reports form annotators, and found a adopted the task-oriented approach as discussed in significant number of inconsistencies in the the last section. We developed annotation guidelines annotation results. Most of the problems could be to annotate non-overlapping mentions related to the traced back to poor design of the domain ontology classes in news articles and hired two PhD and the annotation schema. Follow up analysis on informatics students as annotators. After 1-week of the corpus yielded the following symptoms of error: training consisting of guideline review, case study discussions and test cases, we started the annotation • Gaps in the annotation schema shown by the process with 200 news articles taken from domain existence of mentions to entities which it is sources, including WHO epidemic reports, IRIN, and desirable to annotate but the annotation schema Reuter news. does not cover. In order to restrict the markable mentions to exactly • Ambiguity between context-dependent concepts those that we aimed to identify with the text mining and context-independent ones system, we defined CASE as the class of confirmed • Idiosyncratic annotations which are forced on cases which are unnamed, and PERSON as the class annotators due to the disjoint entity class of named persons who are not cases. We considered principal. this would narrow down the number of markable mentions since unnamed mentions for non-cases need Gaps in the annotation schema not be annotated. We also instructed annotators to At the initial stage of our analysis we considered that markup only the single most appropriate class, distinguishing CASE (as confirmed cases of a disease prohibited multiple classes. An example of annotated which are unnamed humans) from PERSON (named text is shown below: persons who are not cases of a disease) was rather natural, since CASE entities are in general The Ministry of anonymous. However, in the news articles there Health in were some examples where cases were mentioned by Indonesia has today confirmed a fatal human case of name as follows: H5N1 avian influenza. A 27- E1 Tests carried out in a UK laboratory confirmed year-old woman from Jakarta developed symptoms on 17 In addition, we found that there were more frequent September. She contracted the virus from mentions of putative cases than we had expected. close contact with infected birds. 2 In this example we only show initials of the victims' names. 80 These mentions were often annotated as CASE by annotators although we restricted the scope of this 4.3 Empirical results from training an NER class only to confirmed cases. We trained a support vector machine [13] (for details, see Takeuchi and Collier [14]) for named entity E2 a Taiwanese is suspected to have died of SARS recognition based on the annotated corpus of 200 news articles. 10-fold cross validation experiments Follow up discussions with public health experts were performed using TinySVM3. A -2/+1 features revealed that mentions of putative cases are window was used that included surface word, important, especially in the early stages of disease orthography, biomedical prefixes/suffixes, lemma, outbreaks, and we concluded that they should be head noun and previous class predications. The F- identified by the system. However, the existing score for the all classes in Table 2 was 76.96. framework made them difficult to capture. Among the problematic classes were found to be PERSON, CASE and NON_HUMAN (many Ambiguity caused by context-dependent concepts instances of which had ambiguity with One of the classes which confused annotators most TRANSMISSION) which had F-scores below our was TRANSMISSION (source of infection). Below expectation: PERSON (54.95), CASE (53.17), are typical examples of problematic cases. NON_HUMAN (68.0). E3 Victims contract the virus from close contact 5. ANNOTATION EXPERIMENT 2 with infected birds E4 There is no known cure for Ebola, which is transmitted via infected body fluids 5.1 Re-examination of the approach E5 An Irish woman infected with Hepatitis C by a Although we chose the task-oriented approach for its contaminated blood product simplicity and ease of implementation the results E6 18 hospitalized after consuming chapattis from automatic NER and subsequent corpus analysis revealed that problems arose because we made no Annotators had a problem in annotating ‘birds' in E3 clear distinction between context-dependent and since those can be classified as both context-independent classes. We decided to take an TRANSMISSION and NON_HUMAN (animals). alternative, formal and linguistically-sound approach, ‘Body fluid’ in E4 is also ambiguous between and distinguish context-dependent concepts from TRANSMISSION and ANATOMY (body parts), and others in both the ontology and the annotation also ‘blood product’ in E5 is ambiguous between schema. TRANSMISSION and PRODUCT (biological product). Most of the TRANSMISSION instances 5.2 Classification of concepts found in the text were those which could be The first step was to use the classification method categorized as NON_HUMAN, and the cases which proposed by Guarino and Welty ([9] and [10]) which belonged only to TRANSMISSION, such as is based on meta-properties (rigidity, identity, ‘chapattis’ in E6, were very few. dependency), in order to classify categories of concepts in Table 2. Definitions of the meta- Idiosyncratic annotations due to the disjoint entity properties we used are as follows: class principal E7 Hudd has ([10], p.4) written several books on music hall and rigid property φ(+R): ∀x φ(x) → □φ(x) Variety... anti-rigid property φ(~R): ∀x φ(x) →¬□φ(x) E8 Doctors later diagnosed Hudd with a chest ([10], p.5) infection... Identity Condition (IC): An identity condition is a formula Γ that satisfies either of the followings4: In the example above, it is clearly undesirable that the same entity is related to PERSON in E7 and CASE in E8. Although the annotator was aware of the choices the principal of disjoint classes forced a 3 Available from http://cl.aist-nara.ac.jp/~taku- choice. ku/software/TinySVM 4 In [9], further restrictions are added in order to avoid 1) the case where the necessary IC definition becomes trivially true regardless of the truth value of the formula x=y and 2) the case where Γ(x, y, t, t') is false and that makes the sufficient IC definition trivially true. 81 rigidity identity (supplying) identity (carrying) dependency classification ANATOMY +R +O +I -D Type BACTERIA +R +O +I -D Type CASE ~R -O +I +D Material Role NT_CHEMICAL ~R -O +I +D Material Role T_CHEMICAL ~R -O +I +D Material Role CONTROL ~R *1 - O*2 +I +D Material Role DISEASE +R +O*3 +I +D Type DNA +R +O +I -D Type LOCATION +R +O +I -D Type NON_HUMAN +R +O +I -D Type ORGANIZATION +R +O +I -D Type PERSON +R +O +I -D Type PRODUCT +R +O +I +D Type PROTEIN +R +O +I -D Type RNA +R +O +I -D Type SYMPTOM +R +O +I +D Type TIME +R +O +I -D Type VIRUS +R +O +I -D Type TRANSMISSION ~R -O -I +D Formal Role *1 We consider that this class is anti-rigid, since it is possible that an action which is an instance of CONTROL in the current world is not an instance of CONTROL in some other accessible world. The same action may be conducted for different purposes in different worlds. *2 This class includes events. In DOLCE top level categories (Gangemi et al.[19]), Events are under the class of Perdurant/Occurrence. It seems to be controversial what the identity condition for events should be. Davidson [20] proposes a condition such that "events are identical if and only if they have exactly the same causes and effects". In any case it should be reasonable to assume that this class itself does not supply ICs but inherits them from the upper level classes. *3 What we consider ICs for this class is as follows: Two instances of diseases are identical iff the two are experienced by the same host at the same time, are caused by the same agent (e.g. H5N1 virus for "H5N1 avian influenza") and have the same set of characteristic alterations/symptoms (e.g. inflammation of the lung for "pneumonia"). Table 3: Classification of concepts necessary IC: E(x, t)∧φ(x, t)∧E(x, t')∧φ(y, t')∧ experiment were classified as Role: x=y →Γ(x, y, t, t') TRANSMISSION (Formal Role) and CASE sufficient IC: E(x, t)∧φ(x, t)∧E(x, t')∧φ(y, t')∧ (Material Role). According to the further classification of non-rigid concepts by Kaneiwa and Γ(x, y, t, t') →x=y Mizoguchi [18], these cases are classified as time- (E : "actually exist at time t") dependent concepts. Any property φ carries an IC (+I) iff it is 5.3 Modification of the schema subsumed by a property supplying that IC. For some of the roles in Table 3, we modified their A property φ supplies an IC (+O) iff i) it is rigid; status in the annotation schema. ii) there is a necessary or sufficient IC for it; and iii) the same IC is not carried by all the properties CASE subsuming φ. CASE and PERSON were problematic since we distinguished them according to the form of ([10], p.7) expression (unnamed/named), in addition to the externally dependent property φ (+D): case/non-case distinction. In order to cover the ∀x□(φ(x) →∃y ω(y) ∧¬P(y, x) ∧¬C(y, x)) mentions which could not be annotated in the first (P: "is a part of") experiment, we extended the scope of the PERSON (C: "is a constituent of") class to include person instances in general, and eliminate the unnamed/named and case/non-case Classification results are shown in Table 3. Most distinctions. We modified the annotation schema so concepts such as ANATOMY, NON_HUMAN, and that CASE is not the value of cl attribute, but is the PERSON are classified as Type, whereas the case attribute which applies to the referred instance concepts which were problematic in the first of PERSON. This attribute takes the value true when the mentioned instance is a confirmed case of disease, 82 false when the instance is not a case, and putative TRANSMISSION when the instance is a suspected case. Named case We defined the transmission attribute which applies mentions and suspected case mentions are annotated to mentions of ANATOMY, PRODUCT, PERSON as follows: and NON_HUMAN classes. As shown in the following examples, 'birds' are always related to E9 Tests carried out in a UK laboratory confirmed NON_HUMAN, and take a 'true' value only when that M.A... also take a 'putative' value to cover mentions to possible sources of infection. E10 a Taiwanese is suspected to have died E11 Victims contract the virus from close contact of SARS with infected birds The meaning of case attribute-value pairs can be described in logical description and natural language as follows: T_CHEMICAL /NT_CHEMICAL Concept classification revealed that T_CHEMICAL <...cl="PERSON" case="true">John: case(j) and NT_CHEMICAL have "the situation dependency "It is true that the person j mentioned by "John" is an obtained from extending types" discussed in [18] and instance of the CASE class" have the same status as 'weapon' and 'table'. T_CHEMICAL includes chemicals mentioned as <...cl="PERSON" case="false">John: ¬case(j) drugs in any context and those regarded as drugs in "It is false that the person j mentioned by "John" is some context. Here we removed the two classes and an instance of the CASE class" made the parent node CHEMICAL as a class for annotation. <...cl="PERSON" case="putative">John: We then defined therapeutic attribute which applies ◇case(j) to mentions of CHEMICAL and takes the value true "It is possible that the person j mentioned by "John" when the entity is intended for therapeutic use and is an instance of the CASE class" false otherwise. As shown above, the values of the case attribute As a result of the modifications above, our revised ontology is shown in Figure 2. We also added new correspond to logical operators such as ¬ and ◇. classes CONDITION (status of patients: The values of case attributes specify the modes of 'hospitalized' 'died 'in critical condition', etc) and linkage between the referred concept and the CASE OUTBREAK (collective disease incident: 'outbreak', class. The formal basis we had in mind when 'pandemic', etc). Information about CONDITION is formulating the case attribute are as follows: 1) every important for experts to know the rate of instance of a non-rigid class must be an instance of hospitalization and death and determine the alert some rigid class, 2) the relations between a non-rigid level. Mentions of OUTBREAK include expressions class and its instance are often modified by which are specific to disease outbreak news, modal/temporal operators. The first point drove us to increasing the specificity of our detection system. We create the case attribute which apply to instances of located PERSON and NON_HUMAN under metazoa, some rigid class, here, PERSON. The second point and added a number attribute (which takes one or is the motivation for us to set values to include many as its value) to be applied to PERSON negative and modal operators. This schema can be instances. extended if we allow a wider value range for the case With insights from the revised ontology we also attribute to include other modal/temporal operators, changed the annotation method by dividing the although currently we restrict the values to the three process into two distinct stages as shown in Figure 3: above. 1) annotation of mentions to non-role (rigid) It is worth noting that there is a trade-off between concepts and 2) annotation of role (non-rigid) this revised schema and the former schema which is concepts. that we have increased the number of the markable entities, since we need to annotate unnamed, non- case mentions which are not directly related to the purpose of the system. 83 significant increases of the F score were observed in the classes for PERSON (66.28; +11.33 compared to the previous result), case mentions among PERSON (65.63; +12.46), and NON_HUMAN (73.21; +5.21). therapeutic attribute 5.5 Remaining issues Some of the problems reported in this second experiment were related to context dependency (anti- rigidity, situation dependency) discussed in Section case attribute 6.2. number attribute The most difficult class seemed to be CONTROL (control measures to lower the risk of diseases). As shown in Table 3, we consider this class is also non- rigid, and it includes mentions which refer to subclasses of the CONTROL class regardless of situation ("quarantine" "vaccination"), and others which can be a control measure depending on the transmission attribute situation ("warning" "blockade"). This characteristic seems to cause the difficulty. therapeutic attribute So far we have resolved the complexity of non- rigid concepts by defining attributes which apply to instances of rigid classes (e.g. the case attribute for Figure 2 Current ontology (simplified) the class PERSON). This strategy, however does not seem to be effective for CONTROL since it is not easy to identify a rigid superclass for CONTROL which can be realistically annotated in the text. For example, EVENT can be considered as a rigid class 4. Event annotation subsuming CONTROL, but currently it is not realistic to manually annotate every mention of an event. Currently we are seeking for a way to deal 3. Coreference annotation with this problem. 2. Annotation of Role (non-rigid) concepts 6. CONCLUSION The study in this paper was motivated by our need 1. Annotation of Type (rigid) concepts for a high quality annotation schema to support detection of novel entities in the infectious disease Figure 3 Annotation schedule outbreak domain. We discussed two experiments based on alternative approaches for constructing an 5.4 Results of annotation and NE recognizer ontology-based annotation schema. The amount of training data in our study is relatively small but empirical We asked three PhD students to annotate a further results indicate support for our view that there is a 300 news articles. This time we used the revised positive effect in adopting well founded ontological annotation method 1 and 2 shown in Figure 3. principals over an ad-hoc task-based approach. As a result of distinguishing between Role concepts Although this study is not a formal evaluation of (case, transmission, therapeutic) from others in the ontologies, it is still an evaluation from the viewpoint annotation schema, problem reports on these classes of ontology application to the task of natural were reduced, and the annotation results were also language annotation. The classification method of improved. Contrary to our expectations, the Guarino and Welty ([9], [10]) which was originally complexity of the new annotation schema and the proposed to achieve consistency in the increased number of markable mentions seemed to configurational structure of ontologies, was adapted have no negative influence on the annotator’s speed. and found to be useful for improving annotation The improvement can be seen empirically in the performance. NER results. We re-annotated the corpus used in the An alternative possibility exists which we have not first experiment using the revised annotation schema. addressed in this paper which is to reformulate the This time the F-score for all classes rose to 79.96 (+3 tradition NER task to allow for overlapping (nested) compared to the previous result). Especially, and multi-class entities. This however introduces 84 significant additional complications in both the of EKAW-2000: The 12th International recognizer models and in the annotation schema so Conference on Knowledge Engineering and we have adopted a less radical formulation in this Knowledge Management, volume 1937: 97-112. work. 10. Guarino N, Welty C. Ontological analysis of As the next step in this study, we are now taxonomic relations. Lander A, Storey V (eds.) extending our simple taxonomy to a multi-lingual Proceedings of ER-2000: The International ontology; enriching the current taxonomic structure Conference on Conceptual Modeling, vol. 1920, with domain-sensitive relations. The resulting 210-224, Springer Verlag LNCS, Berlin, ontology will be freely available for re-use. At the Germany. initial stage we are focusing on English, Japanese, 11. Steimann F. On the representation of roles in Vietnamese, Thai, Chinese (standard) and Korean. object-oriented and conceptual modelling. Data We hope to add other Asia-Pacific languages in the and Knowledge Engineering35, 1: 83-106. 2000. future. 12. U.S. National Library of Medicine. Medical Subject Headings (MeSH), 2006. 13. Kim J.D., Ohta T, Tateishi Y, Tsujii J. GENIA Acknowledgements corpus - a semantically annotated corpus for bio- textmining. Bioinformatics 19(suppl. 1), pp. We gratefully acknowledge partial funding support i180-i182, Oxford University Press, 2003. from the Japan Society for the Promotion of Science 14. Hirschman L, Chinchor N. MUC-7 named entity (grant no. 18049071). We also thank the anonymous task definition. Proceedings of the 7th Message reviewers for helpful comments. Understanding Conference (MUC-7). 15. Hirschman L, Chinchor N, Grishman R, Sundheim B. Hub-4 Event Guidelines Version References 2.6. http://www- 1. Ferguson NM, Cummings DA, Cauchemez S, nlpir.nist.gov/related_projects/muc/proceedings/ Fraser C, Riley S, et al. Strategies for containing hub4/guidelines.html an emerging influenza pandemic in Southeast 16. Vapnik, V. N. The Nature of Statistical Learning Asia. Nature 437: 209–214. 2005. Theory, Springer-Verlag, New York, 1995. 2. Grishman R, Huttunen S, and Yangarber R. 17. Takeuchi, K and Collier, N. "Bio-medical entity Information extraction for enhanced access to extraction using support vector machines", in vol. disease outbreak reports. Journal of Biomedical 33, no.2, Artificial Intelligence in Medicine, Informatics, Vol. 35, No. 4, 236 - 246, 2002. Elsevier, pp. 125-137, 2005. 3. Public Health Agency of Canada. GPHIN 18. Kaneiwa K, Mizoguchi, R. An order-sorted system. http://www.phac-aspc.gc.ca/media/nr- quantified modal logic for meta-ontology. Proc. rp/2004/2004_gphin-rmispbk_e.html of the International Conference on Automated 4. Aronson A.R. Effective mapping of biomedical Reasoning with Analytic Tableaux and Related text to the UMLS Metathesaurus: the MetaMap Methods (TABLEAUX 2005), Koblenz, program. Proceedings of AMIA Symposium, Germany: 169-184, 2005. 17–21, 2001. 19. Gangemi A, Guarino N, Masolo C, Oltramari A, 5. Rindflesch T.C., Tanabe L, Weinstein J.N. and Schneider L. Sweetening ontologies with Hunter L. EDGAR: extraction of drugs, genes DOLCE. Benjamins et al. (eds.), Proceedings of and relations from the biomedical literature. the 13th European Conference on Knowledge Proceedings of Pacific Symposium on Engineering and Knowledge Management Biocomputing 5:514-525, 2000. (EKAW2002), 166-181, Sigüenza, Spain, 2002. 6. Kim J.D., Ohta T, Tsuruoka Y, Tateishi Y, 20. Davidson D. The Individuation of events. Collier N. Introduction to the Bio-entity Rescher N (ed) Essays in Honor of Carl G. Recognition Task of the JNLPBA workshop. Hempel: 216-234, 1969, D. Reidel. Proceedings of the JNPBA, 70-76, 2004. 7. Yeh A, Morgan A, Colosimo M, Hirschman L. BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1):S2. 8. Sowa J.F. Conceptual structures: Information processing in mind and machine. Addison- Wesley, New York; 1984. 9. Guarino N, Welty C. A formal ontology of properties. Dieng R, Corby O (eds.) Proceedings 85