-

Viterbi algorithm [tax]protein Once the state transition probabilities have been calcu- Lowercase [cytoplasmic]source:sl nding un tagged words in unknown contexts that had been OpenSquare [ i =

Machine Learning for Information Extraction from XML marked-up text on the Semantic Web

Nigel Collier

collier@nii.ac.jp 0 0 National Institute of Informatics (NII) National Center of Sciences , 2-1-2 Hitotsubashi Chiyoda-ku, Tokyo 101-8430 , Japan

1999

1 0

with an ‘understanding’ of the users’ texts. One scenario Encouraged by the growth of the Web community, textual issue of information access, i.e. how to nd the information previously marked up text. that is being explored, e.g. [34], is to combine information database providers have been migrating their archives for of text becoming available on the World Wide Web (Web).

For this purpose extensible markup language (XML) [35] The last few years have seen an explosion in the amount order to accomplish this we need to empower computers [14] seems to be most appropriate for semantic annotation, that meets users’ requirements and present it in an underto learn to identify and classify terms based on examples of other digital resources in multiple and diverse domains. The standable form has now become a major research issue. In retrieval (IR) with information extraction (IE). It is likely online access, adding to the available information. Online communities of users have emerged to share documents, and that a critical component in such a system will be an ability termining the current scope. XML allows us to represent sein the training documents through mechanisms such as the not only for terminology, but also for tasks that require the describe its performance in two domains, news and molecularments from their parents. Name spaces can be nested with acteristics of each domain. mantics through potentially unbounded hyptertags but does rates a number of powerful features for describing object and classify object, i.e. term, boundaries and classify them learning of relations between those terms. XML incorpoterm identication and classication based on hidden Mark ov In our work we emphasize the need for IE tools to be adaptable to dieren t domains and languages rather than that are now being proposed [18]. biology and discuss some of the term markup issues that our In the remainder of this paper we present a method for as general-purpose tools due to the distinct semantic charnot by itself attempt to interpret the meaning of the labels. according to the semantic classes and ontologies described the youngest ancestor within a name space declaration deAt the lowest level of IE, a system should be able to identify analysis revealed. models (HMMs) [24] that learns from annotated texts. We semantics, such as inheritance of name spaces by child eleusual document type declaration (DTD) and XML Schemas 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission by the authors.

Semantic Web Workshop 2001 Hongkong, China Copyright by the authors. 2. BACKGROUND common nouns with cross-over of vocabulary between name tion we can see that many are combinations of proper and SOURCE.ct and SOURCE.cl terms. entities are quite dieren t even on a supercial examinaThe sets of name classes for the two domains are given in Tables 1 and 2. classes, e.g. the lemma ‘cell’ belongs to both PROTEIN, Despite their success, HMMs and other machine learnthat our results revealed, particularly from local syntactic data [30]. of marked-up text, using only character features in addition and corpora in dieren t domains - so-called distantly-labeled It has been natural therefore that these models have been standing of linguistic structures and long distance depenas [30]. Additionally, there is a distinction to be made in [12] and those which learn from tagged corpora in the same words and their classes is counter-intuitive to our underorthographic knowledge, that can be portable between dotrain the models. We also discuss some of the problems in practice. Nymble [1], a system which uses HMMs is one of the most successful such systems and trains on a corpus which automatically learn about the model’s structure such chines and have enjoyed success in a number of elds including speech recognition and part-of-speech tagging [17]. n-grams. Although the assumption that a word’s part-ofAlthough it is still early days for the use of HMMs for IE, as the named-entity task in IE. Such models are based on mains. We now present an overview of the training corpora which they are trained. In this study we have focussed on HMMs are one of the most widely methods in ML for lar biology, using orthographic, lexical and class features to speech or name class can be predicted by the previous n-1 we can see a number of trends in the research. Systems can Nymble (at the top level of their backo model) and those the source of the knowledge for estimating transition probadapted for use in other word-class prediction tasks such dencies, this simple method does seem to be highly eectiv e abilities between models which are built by hand such as developing a simple, yet powerful set of features based on relations, due to the local contextual view the model took. overcoming the problems associated with data sparseness be divided into those which use one state per class such as we used. This is followed by the results for a HMM namedwith the help of sophisticated smoothing algorithms [3]. domain such as the model presented in this paper, word lists to word bigrams. ing methods can only be as successful as the features with built patterns and domain specic heuristic rules, e.g. [13], entity system for two diverse domains: news and molecuIE. They can be considered to be stochastic nite state mamidnight time expressions Houston names of places, countries etc.

Harvard Law School names of organisations Washington names of people $10 million money expressions Example Description start-up costs background words 1970s date expressions 2.5% percentage expressions families, complexes and substructures. viruses DNAs, DNA groups, regions and genes cell line sublocation background words proteins, protein groups, cell type Description multi-organism tissue RNAs, RNA groups, regions and genes mono-organism upon stimulation of the <PROTEIN>TCR</PROTEIN>. <PROTEIN>IFN alpha</PROTEIN> induced the tyrosine phosphorylation of <PROTEIN>JAK1</PROTEIN> and <PROTEIN>Tyk2</PROTEIN>, but not <PROTEIN>JAK2</PROTEIN> ( <PROTEIN> TCR </PROTEIN> ) and <PROTEIN> interferon ( IFN ) alpha </PROTEIN> was explored in proteins by <PROTEIN> interleukin ( IL ) - 2 </PROTEIN> , the <PROTEIN> T cell antigen receptor </PROTEIN> terleukin - 2 </PROTEIN> and <PROTEIN> interferon alpha </PROTEIN> , but not the <PROTEIN> T cell antigen receptor </PROTEIN> , in <SOURCE.ct> human T lymphocytes </SOURCE.ct> . or <PROTEIN>JAK3</PROTEIN>. <PROTEIN>IFN alpha</PROTEIN> activated <PROTEIN>STAT1</PROTEIN>, <PROTEIN>JAK3</PROTEIN>, but not <PROTEIN>JAK2</PROTEIN> or <PROTEIN>Tyk2</PROTEIN>, tyrosine phosKit225 </SOURCE.cl> . An <PROTEIN>IL-2</PROTEIN>-induced increase in <PROTEIN>JAK1</PROTEIN> and tectable activation of these <PROTEIN>STATs</PROTEIN> was induced by <PROTEIN>IL-2</PROTEIN>. phorylation was observed. In contrast, no induction of tyrosine phosphorylation of <PROTEIN>JAKs</PROTEIN> was detected ) and <PROTEIN> signal transducer and activator of transcription </PROTEIN> ( <PROTEIN> STAT </PROTEIN> ) <PROTEIN>STAT2</PROTEIN> and <PROTEIN>STAT3</PROTEIN> in <SOURCE.ct>T cells</SOURCE.ct>, but no deTI - Activation of <PROTEIN> JAK kinases </PROTEIN> and <PROTEIN>STAT proteins </PROTEIN> by <PROTEIN> inAB - The activation of <PROTEIN> Janus protein tyrosine kinases </PROTEIN> ( <PROTEIN> JAKs </PROTEIN> <SOURCE.ct> human peripheral blood - derived T cells </SOURCE.ct> and the <SOURCE.cl> leukemic T cell line has been a congressional staer since <TIMEX TYPE= "DATE">1979</TIMEX>. Separately, <ENAMEX would report directly to <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX> Secretary-designate <ENAMEX A graduate of <ENAMEX TYPE="ORGANIZATION">Harvard Law School</ENAMEX>, Ms. <ENAMEX Corp.</ENAMEX>, is expected to be nominated as assistant <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX> TYPE="PERSON">Washington</ENAMEX> worked as a laywer for the corporate nance division of the <ENAMEX secretary for domestic nance. Mr. <ENAMEX TYPE="PERSON">Newman</ENAMEX>, who would be giving up a job that pays <ENAMEX TYPE="MONEY">$1 million</ENAMEX> a year, would oversee the <ENAMEX TYPE="PERSON">Lloyd Bentsen</ENAMEX>. Mr. <ENAMEX TYPE="PERSON">Bentsen</ENAMEX>, who headed TYPE="PERSON">Clinton</ENAMEX> transition oÆcials said that <ENAMEX TYPE="PERSON">Frank TYPE="ORGANIZATION">Treasury</ENAMEX>’s auctions of government securities as well as banking issues. He the top tax jobs at <ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX>. As early as today, the <ENAMEX the <ENAMEX TYPE="ORGANIZATION">Senate Finance Committee</ENAMEX> for the past six years, also is expected TYPE="PERSON">Clinton</ENAMEX> camp is expected to name v e undersecretaries of state and several assistant secretaries.

Newman</ENAMEX>, 50, vice chairman and chief nancial oÆcer of <ENAMEX TYPE="ORGANIZATION">BankAmerica to nominate <ENAMEX TYPE="PERSON"> Samuel Sessions</ENAMEX>, the committee’s chief tax counsel, to one of TYPE="ORGANIZATION">SEC</ENAMEX> in the late <TIMEX TYPE="DATE">1970s</TIMEX>. She of 100 abstracts.

Table 2: Named entity classes for the news domain. # indicates the number of tagged terms in the corpus 1783 108 3 423 390 542 # 838 30 358 77 90 64 417 21 93 37 for news texts is slightly better than for biology, this needs The results are summarised for all classes in each domain diÆculties of the named entity task between marked-up corin Table 4 and show performance with and without the Unity module. Despite the small number of training texts used, soundly motivated metrics (e.g. see [22]) to compare the F score = (4) both domains. Although results indicate that performance pora in dieren t domains. In the following discussion we 2 P recision Recall P recision + Recall conrming with a larger test collection to obtain condence in the conclusion. The result also highlights the need for provide failure analysis of the results. the system could achieve reasonably high performance for

L. Hunter. EDGAR: Extraction of drugs, genes and [20] New York University. Named Entity Task Denition, [26] T. Rindesc h, L. Hunter, and A. Aronson. Mining Two applications of information extraction to Natural Language Processing ANLP, Washington DC, Japanese Texts. In Proceedings of the Sixth Workshop 1990.

Proceedings of the Natural Language Pacic R im and protein structures. In Proceedings of the Pacic Proceedings of the 5th Conference on Applications of 1998. http://cs.nyu.edu/cs/ faculty/grishman/ description can be found at Rim Symposium on Bio-Computing 2000 (PSB’2000), [24] L. Rabiner and B. Juang. An introduction to hidden biological science journal articles: Enzyme interactions Xml-data, w3c note 05 jan 1998. The XML-Data USA, January 2000.

Tree Method for Finding and Classifying Names in 0105/, January relations from the biomedical literature. In Pacic http://www.w3.org/TR/1998/NOTE-XML-dataVersion 2.0, This document can be found online at [22] C. Nobata, N. Collier, and J. Tsujii. Comparison Linguistics (ACL’2000) Workshop on Comparing In American Medical Informatics Association Proceedings of the Association for Computational approach to identifying sentence boundaries. In [27] T. Rindesc h, L. Tanabe, N. Weinstein, and [16] K. Humphreys, G. Demetriou, and R. Gaizauskas. between tagged corpora for the named entity task. In the English Language. Longman, Essex, England, [21] C. Nobata, N. Collier, and J. Tsujii. Automatic term [15] S. Greenbaum and R. Quirk. A Student’s Grammar of Language, 6:225{242, 1992. on Very Large Corpora, Montreal, Canada, August 1999. [19] MEDLINE. The PubMed database can be found at:, MEDLINE abstracts. In Proceedings of the Tenth January 1986. and J. Tsujii. A semantically annotated corpus from identication and classication in biology texts. In molecular binding terminology from biomedical text.

Symposium (NLPRS’2000), November 1999.

NEtask20.book 1.html, May 31st 1995. 1999. http://www.ncbi.nlm.nih.gov/PubMed/. specications are still dev eloping. The latest [28] S. Sekine, R. Grishman, and H. Shinnou. A Decision [25] J. Reynar and A. Ratnaparkhi. A maximum entropy Academy Press, Inc., 14{15 December 1999.

J. Paoli, J. Tigue, N. H. Mikura, and S. De Rose.

Corpora, Hong Kong, October 7th 2000. [18] A. Layman, E. Jung, E. Maler, H. S. Thompson, [23] T. Ohta, Y. Tateishi, N. Collier, C. Nobata, K. Ibushi, Symposium on Bio-informatics (PSB’2000), Hawai’i, hidden markov model. Computer Speech and [17] J. Kupiec. Robust part-of-speech tagging using a Workshop on Genome Informatics. Universal pages 16{19, 1997.

January 2000.

Markov models. IEEE ASSP Magazine, pages 4{16, (AMIA)’99 annual symposium, Washington DC, USA,