<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Viterbi algorithm [tax]protein
Once the state transition probabilities have been calcu- Lowercase [cytoplasmic]source:sl
nding un tagged words in unknown contexts that had been OpenSquare [
i =</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Machine Learning for Information Extraction from XML marked-up text on the Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nigel Collier</string-name>
          <email>collier@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Informatics (NII) National Center of Sciences</institution>
          ,
          <addr-line>2-1-2 Hitotsubashi Chiyoda-ku, Tokyo 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1999</year>
      </pub-date>
      <volume>1</volume>
      <issue>0</issue>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>with an ‘understanding’ of the users’ texts. One scenario
Encouraged by the growth of the Web community, textual
issue of information access, i.e. how to nd the information
previously marked up text.
that is being explored, e.g. [34], is to combine information
database providers have been migrating their archives for
of text becoming available on the World Wide Web (Web).</p>
      <p>For this purpose extensible markup language (XML) [35]
The last few years have seen an explosion in the amount
order to accomplish this we need to empower computers
[14] seems to be most appropriate for semantic annotation,
that meets users’ requirements and present it in an
underto learn to identify and classify terms based on examples of
other digital resources in multiple and diverse domains. The
standable form has now become a major research issue. In
retrieval (IR) with information extraction (IE). It is likely
online access, adding to the available information. Online
communities of users have emerged to share documents, and
that a critical component in such a system will be an ability
termining the current scope. XML allows us to represent
sein the training documents through mechanisms such as the
not only for terminology, but also for tasks that require the
describe its performance in two domains, news and
molecularments from their parents. Name spaces can be nested with
acteristics of each domain.
mantics through potentially unbounded hyptertags but does
rates a number of powerful features for describing object
and classify object, i.e. term, boundaries and classify them
learning of relations between those terms. XML
incorpoterm identication and classication based on hidden Mark ov
In our work we emphasize the need for IE tools to be
adaptable to dieren t domains and languages rather than
that are now being proposed [18].
biology and discuss some of the term markup issues that our
In the remainder of this paper we present a method for
as general-purpose tools due to the distinct semantic
charnot by itself attempt to interpret the meaning of the labels.
according to the semantic classes and ontologies described
the youngest ancestor within a name space declaration
deAt the lowest level of IE, a system should be able to identify
analysis revealed.
models (HMMs) [24] that learns from annotated texts. We
semantics, such as inheritance of name spaces by child
eleusual document type declaration (DTD) and XML Schemas
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission by the authors.</p>
      <p>Semantic Web Workshop 2001 Hongkong, China
Copyright by the authors.
2. BACKGROUND
common nouns with cross-over of vocabulary between name
tion we can see that many are combinations of proper and
SOURCE.ct and SOURCE.cl terms.
entities are quite dieren t even on a supercial
examinaThe sets of name classes for the two domains are given in
Tables 1 and 2.
classes, e.g. the lemma ‘cell’ belongs to both PROTEIN,
Despite their success, HMMs and other machine
learnthat our results revealed, particularly from local syntactic
data [30].
of marked-up text, using only character features in addition
and corpora in dieren t domains - so-called distantly-labeled
It has been natural therefore that these models have been
standing of linguistic structures and long distance
depenas [30]. Additionally, there is a distinction to be made in
[12] and those which learn from tagged corpora in the same
words and their classes is counter-intuitive to our
underorthographic knowledge, that can be portable between
dotrain the models. We also discuss some of the problems
in practice. Nymble [1], a system which uses HMMs is one
of the most successful such systems and trains on a corpus
which automatically learn about the model’s structure such
chines and have enjoyed success in a number of elds
including speech recognition and part-of-speech tagging [17].
n-grams. Although the assumption that a word’s
part-ofAlthough it is still early days for the use of HMMs for IE,
as the named-entity task in IE. Such models are based on
mains. We now present an overview of the training corpora
which they are trained. In this study we have focussed on
HMMs are one of the most widely methods in ML for
lar biology, using orthographic, lexical and class features to
speech or name class can be predicted by the previous n-1
we can see a number of trends in the research. Systems can
Nymble (at the top level of their backo model) and those
the source of the knowledge for estimating transition
probadapted for use in other word-class prediction tasks such
dencies, this simple method does seem to be highly eectiv e
abilities between models which are built by hand such as
developing a simple, yet powerful set of features based on
relations, due to the local contextual view the model took.
overcoming the problems associated with data sparseness
be divided into those which use one state per class such as
we used. This is followed by the results for a HMM
namedwith the help of sophisticated smoothing algorithms [3].
domain such as the model presented in this paper, word lists
to word bigrams.
ing methods can only be as successful as the features with
built patterns and domain specic heuristic rules, e.g. [13],
entity system for two diverse domains: news and
molecuIE. They can be considered to be stochastic nite state
mamidnight time expressions
Houston names of places, countries etc.</p>
      <p>Harvard Law School names of organisations
Washington names of people
$10 million money expressions
Example Description
start-up costs background words
1970s date expressions
2.5% percentage expressions
families, complexes and substructures.
viruses
DNAs, DNA groups, regions and genes
cell line
sublocation
background words
proteins, protein groups,
cell type
Description
multi-organism
tissue
RNAs, RNA groups, regions and genes
mono-organism
upon stimulation of the &lt;PROTEIN&gt;TCR&lt;/PROTEIN&gt;. &lt;PROTEIN&gt;IFN alpha&lt;/PROTEIN&gt; induced the tyrosine
phosphorylation of &lt;PROTEIN&gt;JAK1&lt;/PROTEIN&gt; and &lt;PROTEIN&gt;Tyk2&lt;/PROTEIN&gt;, but not &lt;PROTEIN&gt;JAK2&lt;/PROTEIN&gt;
( &lt;PROTEIN&gt; TCR &lt;/PROTEIN&gt; ) and &lt;PROTEIN&gt; interferon ( IFN ) alpha &lt;/PROTEIN&gt; was explored in
proteins by &lt;PROTEIN&gt; interleukin ( IL ) - 2 &lt;/PROTEIN&gt; , the &lt;PROTEIN&gt; T cell antigen receptor &lt;/PROTEIN&gt;
terleukin - 2 &lt;/PROTEIN&gt; and &lt;PROTEIN&gt; interferon alpha &lt;/PROTEIN&gt; , but not the &lt;PROTEIN&gt; T cell antigen receptor
&lt;/PROTEIN&gt; , in &lt;SOURCE.ct&gt; human T lymphocytes &lt;/SOURCE.ct&gt; .
or &lt;PROTEIN&gt;JAK3&lt;/PROTEIN&gt;. &lt;PROTEIN&gt;IFN alpha&lt;/PROTEIN&gt; activated &lt;PROTEIN&gt;STAT1&lt;/PROTEIN&gt;,
&lt;PROTEIN&gt;JAK3&lt;/PROTEIN&gt;, but not &lt;PROTEIN&gt;JAK2&lt;/PROTEIN&gt; or &lt;PROTEIN&gt;Tyk2&lt;/PROTEIN&gt;, tyrosine
phosKit225 &lt;/SOURCE.cl&gt; . An &lt;PROTEIN&gt;IL-2&lt;/PROTEIN&gt;-induced increase in &lt;PROTEIN&gt;JAK1&lt;/PROTEIN&gt; and
tectable activation of these &lt;PROTEIN&gt;STATs&lt;/PROTEIN&gt; was induced by &lt;PROTEIN&gt;IL-2&lt;/PROTEIN&gt;.
phorylation was observed. In contrast, no induction of tyrosine phosphorylation of &lt;PROTEIN&gt;JAKs&lt;/PROTEIN&gt; was detected
) and &lt;PROTEIN&gt; signal transducer and activator of transcription &lt;/PROTEIN&gt; ( &lt;PROTEIN&gt; STAT &lt;/PROTEIN&gt; )
&lt;PROTEIN&gt;STAT2&lt;/PROTEIN&gt; and &lt;PROTEIN&gt;STAT3&lt;/PROTEIN&gt; in &lt;SOURCE.ct&gt;T cells&lt;/SOURCE.ct&gt;, but no
deTI - Activation of &lt;PROTEIN&gt; JAK kinases &lt;/PROTEIN&gt; and &lt;PROTEIN&gt;STAT proteins &lt;/PROTEIN&gt; by &lt;PROTEIN&gt;
inAB - The activation of &lt;PROTEIN&gt; Janus protein tyrosine kinases &lt;/PROTEIN&gt; ( &lt;PROTEIN&gt; JAKs &lt;/PROTEIN&gt;
&lt;SOURCE.ct&gt; human peripheral blood - derived T cells &lt;/SOURCE.ct&gt; and the &lt;SOURCE.cl&gt; leukemic T cell line
has been a congressional staer since &lt;TIMEX TYPE= "DATE"&gt;1979&lt;/TIMEX&gt;. Separately, &lt;ENAMEX
would report directly to &lt;ENAMEX TYPE="ORGANIZATION"&gt;Treasury&lt;/ENAMEX&gt; Secretary-designate &lt;ENAMEX
A graduate of &lt;ENAMEX TYPE="ORGANIZATION"&gt;Harvard Law School&lt;/ENAMEX&gt;, Ms. &lt;ENAMEX
Corp.&lt;/ENAMEX&gt;, is expected to be nominated as assistant &lt;ENAMEX TYPE="ORGANIZATION"&gt;Treasury&lt;/ENAMEX&gt;
TYPE="PERSON"&gt;Washington&lt;/ENAMEX&gt; worked as a laywer for the corporate nance division of the &lt;ENAMEX
secretary for domestic nance. Mr. &lt;ENAMEX TYPE="PERSON"&gt;Newman&lt;/ENAMEX&gt;, who would be giving
up a job that pays &lt;ENAMEX TYPE="MONEY"&gt;$1 million&lt;/ENAMEX&gt; a year, would oversee the &lt;ENAMEX
TYPE="PERSON"&gt;Lloyd Bentsen&lt;/ENAMEX&gt;. Mr. &lt;ENAMEX TYPE="PERSON"&gt;Bentsen&lt;/ENAMEX&gt;, who headed
TYPE="PERSON"&gt;Clinton&lt;/ENAMEX&gt; transition oÆcials said that &lt;ENAMEX TYPE="PERSON"&gt;Frank
TYPE="ORGANIZATION"&gt;Treasury&lt;/ENAMEX&gt;’s auctions of government securities as well as banking issues. He
the top tax jobs at &lt;ENAMEX TYPE="ORGANIZATION"&gt;Treasury&lt;/ENAMEX&gt;. As early as today, the &lt;ENAMEX
the &lt;ENAMEX TYPE="ORGANIZATION"&gt;Senate Finance Committee&lt;/ENAMEX&gt; for the past six years, also is expected
TYPE="PERSON"&gt;Clinton&lt;/ENAMEX&gt; camp is expected to name v e undersecretaries of state and several assistant secretaries.</p>
      <p>Newman&lt;/ENAMEX&gt;, 50, vice chairman and chief nancial oÆcer of &lt;ENAMEX TYPE="ORGANIZATION"&gt;BankAmerica
to nominate &lt;ENAMEX TYPE="PERSON"&gt; Samuel Sessions&lt;/ENAMEX&gt;, the committee’s chief tax counsel, to one of
TYPE="ORGANIZATION"&gt;SEC&lt;/ENAMEX&gt; in the late &lt;TIMEX TYPE="DATE"&gt;1970s&lt;/TIMEX&gt;. She
of 100 abstracts.</p>
      <p>Table 2: Named entity classes for the news domain. # indicates the number of tagged terms in the corpus
1783
108
3
423
390
542
#
838
30
358
77
90
64
417
21
93
37
for news texts is slightly better than for biology, this needs
The results are summarised for all classes in each domain
diÆculties of the named entity task between marked-up
corin Table 4 and show performance with and without the Unity
module. Despite the small number of training texts used,
soundly motivated metrics (e.g. see [22]) to compare the
F score = (4)
both domains. Although results indicate that performance
pora in dieren t domains. In the following discussion we
2 P recision Recall
P recision + Recall
conrming with a larger test collection to obtain condence
in the conclusion. The result also highlights the need for
provide failure analysis of the results.
the system could achieve reasonably high performance for</p>
      <p>L. Hunter. EDGAR: Extraction of drugs, genes and
[20] New York University. Named Entity Task Denition,
[26] T. Rindesc h, L. Hunter, and A. Aronson. Mining
Two applications of information extraction to
Natural Language Processing ANLP, Washington DC,
Japanese Texts. In Proceedings of the Sixth Workshop
1990.</p>
      <p>Proceedings of the Natural Language Pacic R im
and protein structures. In Proceedings of the Pacic
Proceedings of the 5th Conference on Applications of
1998.
http://cs.nyu.edu/cs/ faculty/grishman/
description can be found at
Rim Symposium on Bio-Computing 2000 (PSB’2000),
[24] L. Rabiner and B. Juang. An introduction to hidden
biological science journal articles: Enzyme interactions
Xml-data, w3c note 05 jan 1998. The XML-Data
USA, January 2000.</p>
      <p>Tree Method for Finding and Classifying Names in
0105/, January
relations from the biomedical literature. In Pacic
http://www.w3.org/TR/1998/NOTE-XML-dataVersion 2.0, This document can be found online at
[22] C. Nobata, N. Collier, and J. Tsujii. Comparison
Linguistics (ACL’2000) Workshop on Comparing
In American Medical Informatics Association
Proceedings of the Association for Computational
approach to identifying sentence boundaries. In
[27] T. Rindesc h, L. Tanabe, N. Weinstein, and
[16] K. Humphreys, G. Demetriou, and R. Gaizauskas.
between tagged corpora for the named entity task. In
the English Language. Longman, Essex, England,
[21] C. Nobata, N. Collier, and J. Tsujii. Automatic term
[15] S. Greenbaum and R. Quirk. A Student’s Grammar of
Language, 6:225{242, 1992.
on Very Large Corpora, Montreal, Canada, August
1999.
[19] MEDLINE. The PubMed database can be found at:,
MEDLINE abstracts. In Proceedings of the Tenth
January 1986.
and J. Tsujii. A semantically annotated corpus from
identication and classication in biology texts. In
molecular binding terminology from biomedical text.</p>
      <p>Symposium (NLPRS’2000), November 1999.</p>
      <p>NEtask20.book 1.html, May 31st 1995.
1999. http://www.ncbi.nlm.nih.gov/PubMed/.
specications are still dev eloping. The latest
[28] S. Sekine, R. Grishman, and H. Shinnou. A Decision
[25] J. Reynar and A. Ratnaparkhi. A maximum entropy
Academy Press, Inc., 14{15 December 1999.</p>
      <p>J. Paoli, J. Tigue, N. H. Mikura, and S. De Rose.</p>
      <p>Corpora, Hong Kong, October 7th 2000.
[18] A. Layman, E. Jung, E. Maler, H. S. Thompson,
[23] T. Ohta, Y. Tateishi, N. Collier, C. Nobata, K. Ibushi,
Symposium on Bio-informatics (PSB’2000), Hawai’i,
hidden markov model. Computer Speech and
[17] J. Kupiec. Robust part-of-speech tagging using a
Workshop on Genome Informatics. Universal
pages 16{19, 1997.</p>
      <p>January 2000.</p>
      <p>Markov models. IEEE ASSP Magazine, pages 4{16,
(AMIA)’99 annual symposium, Washington DC, USA,</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>