<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Event Extraction for DNA Methylation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Tokyo</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1800</year>
      </pub-date>
      <issue>200</issue>
      <abstract>
        <p>We consider the task of automatically extracting DNA methylation events from the biomedical domain literature. DNA methylation is a key mechanism of epigenetic control of gene expression and implicated in many cancers, but there has been little study of automatic information extraction for DNA methylation. We present an annotation scheme following the representation of the recent BioNLP'09 shared task on event extraction, select a set of 200 abstracts including a balanced sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for this corpus marking nearly 3000 gene/protein mentions and 1500 DNA methylation and demethylation events. We retrain a state-of-the-art event extraction system on the corpus and find that automatic extraction can be performed at 78% precision and 76% recall. The introduced resources are freely available for use in research from the GENIA project homepage.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        During the previous decade of concentrated study
of biomedical information extraction (IE), most
efforts have focused on the foundational task of
detecting mentions of entities of interest and the
extraction of simple relations between these
entities, typically represented as undifferentiated
binary associations
        <xref ref-type="bibr" rid="ref26">(Pyysalo et al., 2008)</xref>
        . However,
in recent years there has been increased interest
in biomolecular event extraction using
representations that capture typed, structured n-ary
associations of entities in specific roles, such as
regulation of the phosphorylation of a specific domain
of a particular protein
        <xref ref-type="bibr" rid="ref3">(Ananiadou et al., 2010)</xref>
        .
The state of the art in such extraction methods
was evaluated in the BioNLP’09 Shared Task on
Event Extraction (below, BioNLP ST)
        <xref ref-type="bibr" rid="ref13 ref21 ref36">(Kim et al.,
2009)</xref>
        , and event extraction following the BioNLP
ST model has continued to draw interest also
after the task, with recent work including advances
in extraction methods
        <xref ref-type="bibr" rid="ref17 ref17 ref18 ref22 ref25 ref7">(Miwa et al., 2010a; Poon
and Vanderwende, 2010)</xref>
        , the release of extraction
system software and large-scale automatically
annotated data
        <xref ref-type="bibr" rid="ref5">(Bjo¨rne et al., 2010)</xref>
        and the
development of additional annotated resources following
the event representation
        <xref ref-type="bibr" rid="ref22">(Ohta et al., 2010)</xref>
        .
      </p>
      <p>
        Of the findings of the BioNLP ST evaluation,
it is of particular interest to us that the
highestperforming methods include many that are purely
machine-learning based
        <xref ref-type="bibr" rid="ref13 ref21 ref36">(Kim et al., 2009)</xref>
        ,
learning what to extract directly from a corpus
annotated with examples of the events of interest. This
implies that state-of-the-art extraction methods for
new types of events can be created by providing
annotated resources to an existing system,
without the need for direct development of natural
language processing or IE methods. Here, we apply
this approach to DNA methylation, a specific and
biologically highly relevant entity type not
considered in previous event extraction studies.
      </p>
      <p>
        In the following, we first outline the biological
significance of DNA methylation and discuss
existing resources. We then introduce the event
extraction approach applied, present the new
annotated corpus created in this study, and event
extraction results using a method trained on the corpus.
The term epigenetics refers to a set of
molecular mechanisms “beyond genetics” – i.e. without
change in DNA sequence – that are today
understood to play an important role in several
biological processes, including genetic program for
development, cell differentiation and tissue specific
1800
1600
1400
1200
1000
800
600
400
200
s
n
o
iitt
a
c
n
o
lit
a
y
h
t
e
M
A
N
D
gene expression. DNA methylation was first
suggested as an epigenetic mechanism for the
control of gene activity during development in 1975
        <xref ref-type="bibr" rid="ref28 ref8">(Riggs, 1975; Holliday and Pugh, 1975)</xref>
        , and the
role of DNA methylation in cancer was first
reported in 1987
        <xref ref-type="bibr" rid="ref9">(Holliday, 1987)</xref>
        . DNA
methylation of CpG islands in promoter regions is now
understood to be one of the most consistent
genetic alterations in cancer, and DNA methylation
is a prominent area of study.
      </p>
      <p>Chemically, DNA methylation is a simple
reaction adding a methyl group to a specific
position of cytosine pyrimidine ring or adenine purine
ring. While a single nucleotide can only be
either methylated or unmethylated, in text the
overall degree of promoter methylation is often
reported as hypo- and hyper-methylation, with
hyper-methylation implying that the expression of
a gene is silenced. Because of the precise
definition of the phenomenon and the relatively specific
terms in which it is typically discussed in
publications, we expected it to provide a well-defined
target for annotation and automatic extraction.
2.1</p>
      <sec id="sec-1-1">
        <title>DNA Methylation in PubMed</title>
        <p>
          We follow common practice in biomedical IE in
drawing texts for our corpus from PubMed
abstracts. Currently containing more than 20 million
citations for biomedical literature (over 11M with
abstracts) and growing exponentially
          <xref ref-type="bibr" rid="ref10 ref33">(Hunter and
Cohen, 2006)</xref>
          , the literature database provides a
rich resource for IE and text mining.
        </p>
        <p>
          To facilitate access to documents relevant to
specific topics, each PubMed citation is manually
assigned terms that identify its primary topics
using MeSH, a controlled vocabulary of over 25,000
terms. MeSH contains also a DNA Methylation
term, allowing specific searches for citations on
the topic. Figure 1 shows the number of citations
per year of publication matching this term
contrasted with overall citations, illustrating explosive
growth of interest in DNA methylation,
outstripping the overall growth of the literature.
Particular increases can be seen after the introduction
of DNA microarrays for monitoring gene
expression
          <xref ref-type="bibr" rid="ref30">(Schena et al., 1995)</xref>
          and the introduction of
high-throughput screening methods
          <xref ref-type="bibr" rid="ref14 ref16">(Kononen et
al., 1998; MacBeath and Schreiber, 2000)</xref>
          . The
total number of PubMed citations tagged with DNA
Methylation at the time of this writing is 15456
(14350 of which have an abstract). The large
num1990
1995
2000
2005 0
ber of documents tagged for the DNA
methylation MeSH term and the human judgments
assuring their relevance make querying for this term
a natural choice for selecting text. However,
direct PubMed query as the only selection strategy
would ignore significant existing resources,
discussed in the following.
2.2
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>DNA Methylation Databases</title>
        <p>
          A growing number of databases collating
information on DNA methylation are becoming
available. The first such database, MethDB
          <xref ref-type="bibr" rid="ref2">(Amoreira et al., 2003)</xref>
          , was introduced in 2001 and
remains actively developed. MethDB contains
PubMed citation references as evidence for
contained entries, but no more specific
identification of the expressions stating DNA methylation
events. The methPrimerDB
          <xref ref-type="bibr" rid="ref24">(Pattyn et al., 2006)</xref>
          database provides additional information on PCR
primers on top of MethDB, but does not add
further specification of the methylated gene or
textbound annotation. PubMeth
          <xref ref-type="bibr" rid="ref23">(Ongenaert et al.,
2008)</xref>
          is a database of DNA methylation in
cancer with evidence sentences from the literature.
        </p>
        <p>
          This database stores information on cancer types
and subtypes, methylated genes and the
experimental method used to identify methylation, as
well as evidence sentences. MeInfoText,
          <xref ref-type="bibr" rid="ref6">(Fang et
al., 2008)</xref>
          is a database of DNA methylation and
cancer information automatically extracted from
PubMed documents matching the query terms
human, methylation and cancer using term
cooccurrence statistics. Of the DNA Methylation
resources, only PubMeth and MeInfoText contain
text-bound annotation identifying specific spans of
characters containing the gene mention and
exa) MS-PCR revealed the [methylation] of the [p16] gene in 10(34%)of 29 [NSCLCs]
b) 30% (27 of 91) of [lung tumors] showed [hypermethylation] of the 5’CpG region of the [p14ARF gene]
c) [Promotor hypermethylations] were detected in [O6-methylguanine-DNA methyltransferase (MGMT), RB1,
estrogen receptor, p73, p16INK4a, death-associated protein kinase, p15INK4b, and p14ARF]
d) The promoter region of the [p16INK4] gene was [hypermethylated] in the tumor samples of the primary or metastatic site
pressing DNA methylation in evidence sentences
supporting database entries. In this study, we
consider specifically PubMeth as a source of reference
text-bound annotations due to availability and the
ability to redistribute derived data.
        </p>
        <p>Initial text-bound annotations in PubMeth were
generated using keyword lookup, but the database
annotations are manually reviewed. Table 1 shows
example evidence sentences from PubMeth and
their annotated spans. While the PubMeth
annotation differs from the BioNLP ST representation in
a number of ways, such as not separating
coordinated entities (Table 1c) and not annotating
methylation sites (Table 1d), it provides both a
reference identifying annotation targets from a
biologically motivated perspective and a potential starting
point for full event annotation.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Annotation</title>
      <p>For annotation, we adapted the representation
applied in the BioNLP ST on event extraction with
minimal changes in order to allow systems
developed for the task to be applied also for the newly
annotated corpus. Documents were selected
following the basic motivation presented above, with
reference to the requirements specified by the
annotation scheme, and some automatic
preprocessing was applied as annotator support. This section
details the annotation approach.
3.1</p>
      <sec id="sec-2-1">
        <title>Entity and Event Representation</title>
        <p>
          For the core named entity annotation, we thus
primarily follow the gene/gene product (GGP)
annotation criteria applied for the shared task data
          <xref ref-type="bibr" rid="ref13 ref21">(Ohta et al., 2009)</xref>
          . In brief, the guidelines
specify annotation of minimal contiguous spans
containing mentions of specific gene or gene product
(RNA/protein) names, where specific name is
understood to be one allowing a biologist to identify
the corresponding entry in a gene/protein database
such as Uniprot or Entrez Gene. The annotation
thus excludes e.g. names of families and
complexes. A single annotation type, Gene or gene
product, is applied without distinction between
genes and their products. In addition to the
identification of the modified gene, it is important to
identify the site of the modification. We marked
mentions of sites relevant to the events as DNA
domain or region terms following the original
GENIA term corpus annotation guidelines
          <xref ref-type="bibr" rid="ref20">(Ohta et
al., 2002)</xref>
          .
        </p>
        <p>For representing DNA methylation events, the
annotation applied to capture protein
phosphorylation events in the BioNLP ST task 2 closely
matched the needs for DNA methylation
(Figure 2). While the Site arguments of the ST
Phosphorylation events are protein domains,
machinelearning based extraction methods should be able
to associate this role with DNA domains given
training data. We thus adopted a
representation where DNA methylation events are associated
with a gene/gene product as their Theme and a
DNA domain or region as Site. Each event is also
associated with a particular span of text expressing
it, termed the event trigger.2 We further initially
marked catalysts using Positive regulation events
following the BioNLP ST model, but dropped this
class of annotation as a sufficient number of
examples was not found in the corpus.</p>
        <p>
          The event types of the BioNLP ST are drawn
from the GENIA Event ontology
          <xref ref-type="bibr" rid="ref12">(Kim et al.,
2008)</xref>
          , which in turn draws its type definitions
from the community-standard Gene Ontology
(GO)
          <xref ref-type="bibr" rid="ref34">(The Gene Ontology Consortium, 2000)</xref>
          . To
maintain compatibility with these resources, we
opted to follow the GO also for the definition of
2Annotators were instructed to always mark some trigger
expression. We note that while we do not here specifically
distinguish hypo- and hyper-methylation, the trigger
annotations are expected to facilitate adding these distinctions if
necessary.
the new event type considered here. GO defines
DNA methylation as
        </p>
        <sec id="sec-2-1-1">
          <title>The covalent transfer of a methyl group</title>
          <p>to either N-6 of adenine or C-5 or N-4
of cytosine.</p>
          <p>We note that while the definition may appear
restrictive, methylation of adenine N-6 or cytosine
C-5/N-4 encompasses the entire set of ways in
which DNA can be methylated. This definition
could thus be adopted without limitation to the
scope of the annotation.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Document Selection</title>
        <p>
          The selection of source documents for an
annotated corpus is critical for assuring that the
corpus provides relevant and representative material
for studying the phenomena of interest. Domain
corpora frequently consist of documents from a
particular subdomain of interest: for example, the
GENIA corpus focuses on documents concerning
transcription factors in human blood cells
          <xref ref-type="bibr" rid="ref20">(Ohta et
al., 2002)</xref>
          . Methods trained and evaluated on such
focused resources will not necessarily generalize
well to broader domains. However, there has been
little study of the effect of document selection on
event extraction performance. Here, we applied
two distinct strategies to get a representative
sample of the full scope of DNA methylation events in
the literature and to assure that our annotations are
relevant to the interests of biologists.
        </p>
        <p>In the first strategy, we aimed in particular to
select a representative sample of documents
relevant to the targeted event types. For this
purpose, we directly searched the PubMed literature
database. We further decided not to include any
text-based query in the search to avoid biasing
the selection toward particular entities or forms
of event expression. Instead, we only queried for
the single MeSH term DNA Methylation. While
this search is expected to provide high-prevision
results for the full topic, not all such documents
necessarily discuss events where specific genes are
methylated. In initial efforts to annotate a random
sample of these documents, we found that many
did not mention specific gene names. To reduce
wasted effort in examining documents that contain
no markable events, we added a filter requiring a
minimum number of (likely) gene mentions. We
first tagged all 14350 citations tagged with DNA
Methylation that have an abstract in PubMed
using the BANNER tagger (Leaman and Gonzalez,
0 5 10 15 20 25 30 35 40</p>
        <p>Number of gene/protein mentions
2008). We found that while the overwhelmingly
most frequent number of tagged mentions per
document is zero, a substantial mass of abstracts have
large mention counts (Figure 3).3 We decided
after brief preliminary experiments to filter the
initial selection of documents to include only those
in which at least 5 gene/protein mentions were
marked by an automatic tagger. This excludes
most documents without markable events without
introducing obvious other biases.</p>
        <p>In the second strategy, we extended and
completed the annotation of a random selection of
PubMeth evidence sentences, aiming to leverage
existing resources and to select documents that
had been previously judged relevant to the
interests of biologists studying the topic. This provides
an external definition of document relevance and
allows us to estimate to what extent the applied
annotation strategy can capture biologically relevant
statements. This strategy is also expected to select
a concentrated, event-rich set of documents.
However, the selection may also necessarily carry over
biases toward particular subsets of relevant
documents from the original selection and will not be a
representative sample of the overall distribution of
such documents in the literature.</p>
        <p>For producing the largest number of event
annotations with the least effort, the most efficient
way to use the PubMeth data would have been to
simply extract the evidence sentences and
complete the annotation for these. However,
viewing the context in which event statements occur
as centrally important, we opted to annotate
complete abstracts, with initial annotations from
PubMeth evidence sentences automatically transferred
into the abstracts. We note that not all PubMeth
3The tagger has been evaluated at 86% F-score on a
broad-coverage corpus, suggesting this is unlikely to severely
misestimate the true distribution.
evidence spans were drawn from abstracts, and
not all that were matched a contiguous span of
text. We could align PubMeth evidence
annotations into 667 PubMed abstracts (approximately
57% of the referenced PMID number in PubMeth)
and completed event annotation for a random
sample of these.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Document Preprocessing</title>
        <p>
          To reduce annotation effort, we applied
automatic systems to produce initial candidate
sentence boundaries and GGP annotations for the
corpus. For sentence splitting, we applied the
GENIA sentence splitter4, and for gene/protein
tagging, we applied the BANNER NER system
          <xref ref-type="bibr" rid="ref12 ref15 ref19 ref26 ref31">(Leaman and Gonzalez, 2008)</xref>
          trained on the
GENETAG corpus
          <xref ref-type="bibr" rid="ref32">(Tanabe et al., 2005)</xref>
          . The GENETAG
guidelines and gene/protein entity annotation
coverage are known to differ from those applied for
GGP annotation here
          <xref ref-type="bibr" rid="ref36">(Wang et al., 2009)</xref>
          .
However, the broad coverage of PubMed provided by
the GENETAG suggests taggers trained on the
corpus are likely to generalize to new subdomains
such as that considered here. By contrast, all
annotations following GGP guidelines that we are
aware of are subdomain-specific.
        </p>
        <p>We note that all annotations in the produced
corpus are at a minimum confirmed by a human
annotator and that events are annotated without
performing initial automatic tagging to assure that no
bias toward particular extraction methods or
approaches is introduced.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Corpus Statistics</title>
        <p>
          Corpus statistics are given in Table 2. There
are some notable differences between the
subcorpora created using the different selection
strategies. While the subcorpora are similar in size,
the PubMeth GGP count is 1.4 times that of the
PubMed subcorpus5, yet roughly equal numbers
of methylation sites are annotated in the two. This
difference is even more pronounced in the
statistics for event arguments, where two thirds of
PubMeth subcorpus events contain only a Theme
argument identifying the GGP, while events where
both Theme and Site are identified are more
fre4http://www-tsujii.is.s.u-tokyo.ac.jp/ y-matsu/geniass/
5The differences in the number of GGP annotations may
be affected by the PubMeth entity annotation criteria.
quent in the other subcorpus.6 As the extraction of
events specifying also sites is known to be
particularly challenging
          <xref ref-type="bibr" rid="ref13 ref21 ref36">(Kim et al., 2009)</xref>
          , these
statistics suggest the PubMed subcorpus may
represent a more difficult extraction task. Only very
few DNA demethylation events are found in
either subcorpus. Overall, the PubMeth subcorpus
contains nearly twice as many event annotations as
the PubMed one, indicating that the focused
document selection strategy was successful in
identifying particularly event-rich abstracts.
To measure the consistency of the produced
annotation, we performed independent double
annotation for a sample of 40% of the abstracts selected
from the PubMed subcorpus; 20% of all abstracts.
As the PubMed subcorpus event annotation is
created without initial human annotation as reference
(unlike the PubMeth subcorpus), agreement is
expected to be lower on this subcorpus. This
experiment should thus provide a lower bound on the
overall consistency of the corpus.
        </p>
        <p>We first measured agreement on the gene/gene
product (GGP) entity annotation, and found very
high agreement among 935 entities marked in
total by the two annotators: 91% F-score using exact
match criteria and 97% F-score using the relaxed
“overlap” criterion where any two overlapping
annotations are considered to match.7 We then
separately measured agreement on event annotations
6The number of annotated sites is less than the number
of events with a Site argument as the annotation criteria only
call for annotating a site entity when it is referred to from an
event, and multiple events can refer to the same site entity.</p>
        <p>7The high agreement is not due to annotators simply
agreeing with the automatic initial annotation: the F-score
of the automatic tagger against the two sets of human
annotations was 65%/66% for exact and 85%/86% for overlap
match.
for those events that involved GGPs on which the
annotators agreed, using the standard evaluation
criteria described in Section 4.4. Agreement on
event annotations was also high: 84% F-score
overall (85% for DNA methylation and 75% for
DNA demethylation) over a total of 442 annotated
events.</p>
        <p>The overall consistency of the annotation
depends on joint annotator agreement on the GGP
and event annotations. However, in experimental
settings such as that of the BioNLP ST where gold
GGP annotation is assumed as the starting point
for event extraction, measured performance is not
affected by agreement on GGPs and thus arguably
only the latter factor applies. As this setting is
adopted also in the present study, annotation
consistency suggests a human upper bound no lower
than 84% F-score on extraction performance.</p>
        <p>
          Estimates of the annotation consistency of
biomedical domain corpora are regrettably seldom
provided, and to the best of our knowledge ours is
the first estimate of inter-annotator agreement for
a corpus following the event representation of the
BioNLP ST. Given the complexity of the
annotation – typed associations of event trigger, theme
and site – the agreement compares favorably to
e.g. the reported 67% inter-annotator F-score
reported for protein-protein interactions on the ITI
TXM corpora
          <xref ref-type="bibr" rid="ref1">(Alex et al., 2008)</xref>
          and the full event
agreement on the GREC corpus
          <xref ref-type="bibr" rid="ref35">(Thompson et al.,
2009)</xref>
          .
4.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Event Extraction Method</title>
        <p>
          To estimate the feasibility of automatic
extraction of DNA methylation events and the
suitability of presently available event extraction
methods to this task, we performed experiments using
the EventMine event extraction system of
          <xref ref-type="bibr" rid="ref17 ref18 ref22">(Miwa
et al., 2010b)</xref>
          . On the task 2 of the BioNLP
ST dataset, the benchmark most relevant to our
task setting, the applied version of EventMine was
recently evaluated at 55% F-score
          <xref ref-type="bibr" rid="ref17 ref18 ref22">(Miwa et al.,
2010a)</xref>
          , outperforming the best task 2 system in
the original shared task
          <xref ref-type="bibr" rid="ref27">(Riedel et al., 2009)</xref>
          by
more than 10% points. To the best of our
knowledge, this system represents the state of the art for
this event extraction task.
        </p>
        <p>
          EventMine is an SVM-based machine learning
system following the pipeline design of the best
system in the BioNLP ST
          <xref ref-type="bibr" rid="ref4">(Bjo¨rne et al., 2009)</xref>
          ,
extending it with refinements to the feature set,
the use of a machine learning module for
complex event construction, and the use of two parsers
for syntactic analysis
          <xref ref-type="bibr" rid="ref17 ref18 ref22">(Miwa et al., 2010b)</xref>
          . We
follow Miwa et al. in applying the HPSG-based
deep parser Enju
          <xref ref-type="bibr" rid="ref12 ref15 ref19 ref26 ref31">(Miyao and Tsujii, 2008)</xref>
          using
the high-speed parsing setting (“mogura”) and the
GDep
          <xref ref-type="bibr" rid="ref29">(Sagae and Tsujii, 2007)</xref>
          native dependency
parser, both with biomedical domain models based
on the GENIA treebank data
          <xref ref-type="bibr" rid="ref33">(Tateisi et al., 2006)</xref>
          .
        </p>
        <p>For evaluation, we applied a version of the
BioNLP’09 ST evaluation tools8 modified to
recognize the novel DNA methylation event type.
4.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluation Criteria</title>
        <p>We followed the basic task setup and primary
evaluation criteria of the BioNLP’09 ST. Specifically,
we followed task 2 (“event enrichment”) criteria,
requiring for correct extraction of a DNA
methylation event both the identification of the
modified gene (GGP entity) and the identification of
the modification site (DNA domain or region
entity) when stated. As in the shared task, human
annotation for GGP entities was provided as part
of the system input but other entities were not, so
that the system was required to identify the spans
of the mentioned modification sites.</p>
        <p>The performance of the system was
evaluated using the standard precision, recall and
Fscore metrics for the recovery of events, with
event equality defined following the
“Approximate span” matching criterion applied in the
primary evaluation for the BioNLP’09 ST. This
criterion relaxes strict matching requirements so that
a detected event trigger or entity is considered to
match a gold trigger/entity if its span is entirely
contained within the span of the gold trigger,
extended by one word both to the left and to the right.
4.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Experimental Setup</title>
        <p>We divided the corpus into three parts, first setting
one third of the abstracts aside as a held-out test
set and then splitting the remaining two thirds in a
roughly 1:3 ratio into a training set and a
development test set, giving 100 abstracts for training, 34
for development, and 66 for final test. The splits
were performed randomly, but sampling so that
each set has an equal number of abstracts drawn
from the PubMeth and PubMed subcorpora.</p>
        <p>The EventMine system has a single tunable
threshold parameter that controls the tradeoff
be</p>
        <sec id="sec-3-4-1">
          <title>8http://www-tsujii.is.s.u-tokyo.ac.jp/</title>
          <p>GENIA/SharedTask/downloads.shtml
tween system precision and recall. We first set
the tradeoff using a sparse search of the
parameter space [0:1], evaluating the performance of the
system by training on the training set and
evaluating on the development set. As these experiments
did not indicate any other parameter setting could
provide significantly better performance, we chose
the default threshold setting of 0.5. To study the
effect of training data size on performance, we
performed extraction experiments randomly
downsampling the training data on the document level
with testing on the development set. In final
experiments EventMine was trained on the combined
training and development data and performance
evaluated on the held-out test data.
4.6</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Extraction performance</title>
        <p>Table 3 shows extraction results on the held-out
test data. While DNA methylation events could
be extracted quite reliably, the system performed
poorly for DNA demethylation events. The latter
result is perhaps not surprising given their small
number – only 38 in total in the corpus – and
indicates that a separate selection strategy is necessary
to provide resources for learning the reverse
reaction. Overall performance shows a small
preference for precision over recall at 77% F-score. We
view this level of performance very good as a first
result.</p>
        <p>To evaluate the relative difficulty of the
extraction tasks that the two subcorpora represent and
their merits as training material, we performed
tests separating the two (Table 4). As predicted
from corpus statistics (Section 4.1), the PubMed
subcorpus represents the more challenging
extraction task. When testing on a single subcorpus,
results are, unsurprisingly, better when training data
is drawn from the same subcorpus; however,
training on the combined data gives the best
perfor70
e
r
co60
s
F50
40
30
0</p>
        <p>Test set: Both</p>
        <p>PubMed</p>
        <p>PubMeth
20</p>
        <p>40 60
Fraction of traning data (%)
80
100
mance for all three test sets, indicating that the
subcorpora are compatible.</p>
        <p>The learning curve (Figure 4) shows
relatively high performance and rapid improvement
for modest amounts of data, but performance
improvement with additional data levels out
relatively fast, nearly flattening as use of the training
data approaches 100%. This suggests that
extraction performance for this task is not primarily
limited by training data size and that additional
annotation following the same protocol is unlikely
to yield notable improvement in F-score without
a substantial investment of resources. As
performance for the PubMed subcorpus (for which
interannotator agreement was measured) is not yet
approaching the limit implied by the corpus
annotation consistency (Section 4.2), the results suggest
further need for the development of event
extraction methods to improve DNA methylation event
extraction.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        DNA methylation and related epigenetic
mechanisms of gene expression control have been
a focus of considerable recent research in
biomedicine. There are many excellent reviews of
this broad field; we refer the interested reader to
        <xref ref-type="bibr" rid="ref11 ref12 ref15 ref19 ref2 ref26 ref31">(Jaenisch and Bird, 2003; Suzuki and Bird, 2008)</xref>
        .
      </p>
      <p>
        There is a wealth of recent related work also
on event extraction. In the BioNLP’09 shared
task, 24 teams participated in the primary task and
six teams in Task 2 which mostly resembles our
setup in that it also required the detection of
modified gene/protein and modification site. The
topperforming system in Task 2
        <xref ref-type="bibr" rid="ref27">(Riedel et al., 2009)</xref>
        achieved 44% F-score, and the highest
performance reported since that we are aware of is 55%
F-score for EventMine
        <xref ref-type="bibr" rid="ref17 ref18 ref22">(Miwa et al., 2010b)</xref>
        . The
performance we achieved for DNA methylation is
considerably better than this overall result,
essentially matching the best reported performance for
Phosphorylation events, which we previously
argued to be the closest shared task analogue to the
new event category studied here. Nevertheless,
direct comparison of these results may not be
meaningful due to confounding factors. The only text
mining effort specifically targeting DNA
methylation that we are aware of is that performed for
the initial annotation of the PubMeth and
MeInfoText databases
        <xref ref-type="bibr" rid="ref23 ref6">(Ongenaert et al., 2008; Fang et
al., 2008)</xref>
        , both applying approaches based on
keyword matching. However, neither of these
studies report results for instance-level extraction of
methylation statements.
      </p>
      <p>
        The present study is in many aspects
similar to our previous work targeting protein
posttranslational modification events
        <xref ref-type="bibr" rid="ref22">(Ohta et al.,
2010)</xref>
        . In this work, we annotated 422 events
of 7 different types and showed that retraining
an existing event extraction system allowed these
to be extracted at 42% F-score. Our approach
here clearly differs from this previous work in its
larger scale and concentrated focus on a
particular event type of high interest, reflected also in
results: while extraction performance in our
previous work was limited by training data size, in
the present study notably higher extraction
performance was achieved and a plateau in performance
with increasing data reached.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Future Work</title>
      <p>We have presented a study of the automatic
extraction of DNA methylation events from
literature following the BioNLP’09 shared task event
representation and a state-of-the-art event
extraction system. We created an corpus of 200
publication abstracts selected to include a representative
sample of DNA methylation statements from all of
PubMed and manually annotated for nearly 3000
mentions of genes and gene products, 500 DNA
domain or region mentions and 1500 DNA
methylation and demethylation events. Evaluation using
the EventMine system showed that DNA
methylation events can be extracted simply by
retraining an off-the-shelf event extraction system at 78%
precision and 76% recall. The learning curve
suggested that the corpus size is sufficient and that in
future efforts in DNA methylation event extraction
should focus on extraction method development.</p>
      <p>
        One natural direction for future work is to
apply event extraction systems trained on the newly
introduced data to abstracts available in PubMed
and full texts available at PMC to create a detailed,
up-to-date repository of DNA methylation events
at full literature scale. Such an effort would
require gene name normalization and event
extraction at PubMed scale, both of which have recently
been shown to be technically feasible
        <xref ref-type="bibr" rid="ref5 ref7">(Gerner et
al., 2010; Bjo¨ rne et al., 2010)</xref>
        . Further combining
the extracted events with cancer mention detection
could provide a valuable resource for epigenetics
research.
      </p>
      <p>The newly annotated corpus, the first
resource annotated for DNA methylation using
the event representation, is freely available
for use in research from from the GENIA
project homepage http://www-tsujii.is.
s.u-tokyo.ac.jp/GENIA.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank Mate´ Ongenaert and other
creators of PubMeth for their generosity in
allowing the release of resources building on their
work and the anonymous reviewers for their many
insightful comments. This work was supported
by Grant-in-Aid for Specially Promoted Research
(MEXT, Japan).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Bea</given-names>
            <surname>Alex</surname>
          </string-name>
          , Claire Grover, Barry Haddow, Mijail Kabadjov, Ewan Klein, Michael Matthews, Stuart Roebuck, Richard Tobin, and
          <string-name>
            <given-names>Xinglong</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The ITI TXM corpora: Tissue expressions and protein-protein interactions</article-title>
          .
          <source>In Proceedings of LREC'08.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Celine</given-names>
            <surname>Amoreira</surname>
          </string-name>
          , Winfried Hindermann, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Grunau</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>An improved version of the DNA methylation database (MethDB)</article-title>
          .
          <source>Nucl. Acids Res</source>
          .,
          <volume>31</volume>
          (
          <issue>1</issue>
          ):
          <fpage>75</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Sophia</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          , Sampo Pyysalo,
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          , and
          <string-name>
            <surname>Douglas</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kell</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Event extraction for systems biology by text mining the literature</article-title>
          .
          <source>Trends in Biotechnology</source>
          ,
          <volume>28</volume>
          (
          <issue>7</issue>
          ):
          <fpage>381</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jari</given-names>
            <surname>Bjo</surname>
          </string-name>
          ¨rne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio Pahikkala, and
          <string-name>
            <given-names>Tapio</given-names>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Extracting complex biological events with rich graph-based feature sets</article-title>
          .
          <source>In Proceedings of BioNLP'09 Shared Task</source>
          , pages
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jari</given-names>
            <surname>Bjo</surname>
          </string-name>
          ¨rne, Filip Ginter, Sampo Pyysalo,
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
            , and
            <given-names>Tapio</given-names>
          </string-name>
          <string-name>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Scaling up biomedical event extraction to the entire pubmed</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Yu-Ching</surname>
            <given-names>Fang</given-names>
          </string-name>
          , Hsuan-Cheng Huang, and
          <string-name>
            <surname>Hsueh-Fen Juan</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Meinfotext: associated gene methylation and cancer information from text mining</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Gerner</surname>
          </string-name>
          , Goran Nenadic, and
          <string-name>
            <surname>Casey</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bergman</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>An exploration of mining gene expression mentions and their anatomical locations from biomedical text</article-title>
          .
          <source>In Proceedings of BioNLP 2010</source>
          , pages
          <fpage>72</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Robin</given-names>
            <surname>Holliday</surname>
          </string-name>
          and
          <string-name>
            <given-names>JE</given-names>
            <surname>Pugh</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>Dna modification mechanisms and gene activity during development</article-title>
          .
          <source>Science</source>
          ,
          <volume>187</volume>
          :
          <fpage>226</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Robin</given-names>
            <surname>Holliday</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>The inheritance of epigenetic defects</article-title>
          .
          <source>Science</source>
          ,
          <volume>238</volume>
          :
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Lawrenece</given-names>
            <surname>Hunter</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. Bretonnel</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Biomedical language processing: What's beyond PubMed? Molecular Cell</article-title>
          ,
          <volume>21</volume>
          (
          <issue>5</issue>
          ):
          <fpage>589</fpage>
          -
          <lpage>594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Rudolf</given-names>
            <surname>Jaenisch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adrian</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals</article-title>
          .
          <source>Nature Genetics</source>
          ,
          <volume>33</volume>
          :
          <fpage>245</fpage>
          -
          <lpage>254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Corpus annotation for mining biomedical events from literature</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>10</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Overview of bionlp'09 shared task on event extraction</article-title>
          .
          <source>In Proceedings of BioNLP'09.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Juha</given-names>
            <surname>Kononen</surname>
          </string-name>
          , Lukas Bubendorf, Anne Kallionimeni, Maarit Barlund,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Schraml</surname>
          </string-name>
          , Stephen Leighton, Joachim Torhorst, Michael J Mihatsch, Guido Sauter, and OlliP. Kallionimeni.
          <year>1998</year>
          .
          <article-title>Tissue microarrays for highthroughput molecular profiling of tumor specimens</article-title>
          .
          <source>Nat Med</source>
          ,
          <volume>4</volume>
          (
          <issue>7</issue>
          ):
          <fpage>844</fpage>
          -
          <lpage>847</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Leaman</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Banner: An executable survey of advances in biomedical named entity recognition</article-title>
          .
          <source>In Proceedings of PSB'08</source>
          , pages
          <fpage>652</fpage>
          -
          <lpage>663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Gavin MacBeath and Stuart L. Schreiber</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Printing Proteins as Microarrays for High-Throughput Function Determination</article-title>
          . Science,
          <volume>289</volume>
          (
          <issue>5485</issue>
          ):
          <fpage>1760</fpage>
          -
          <lpage>1763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Miwa</surname>
          </string-name>
          , Sampo Pyysalo, Tadayoshi Hara, and
          <article-title>Jun'ichi Tsujii. 2010a. A comparative study of syntactic parsers for event extraction</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>37</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Miwa</surname>
          </string-name>
          , Rune Saetre,
          <string-name>
            <surname>Jin-Dong Kim</surname>
          </string-name>
          , and
          <article-title>Jun'ichi Tsujii. 2010b. Event extraction with complex event classification using rich features</article-title>
          .
          <source>Journal of Bioinformatics and Computational Biology (JBCB)</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <fpage>131</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Yusuke</given-names>
            <surname>Miyao</surname>
          </string-name>
          and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Feature forest models for probabilistic HPSG parsing</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):
          <fpage>35</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Tomoko</given-names>
            <surname>Ohta</surname>
          </string-name>
          , Yuka Tateisi, Hideki Mima, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>GENIA corpus: An annotated research abstract corpus in molecular biology domain</article-title>
          .
          <source>In Proceedings of HLT'02</source>
          , pages
          <fpage>73</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Tomoko</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Sampo Pyysalo,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Incorporating GENETAGstyle annotation to GENIA corpus</article-title>
          .
          <source>In Proceedings of BioNLP'09</source>
          , pages
          <fpage>106</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Tomoko</given-names>
            <surname>Ohta</surname>
          </string-name>
          , Sampo Pyysalo, Makoto Miwa,
          <string-name>
            <surname>Jin-Dong Kim</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Event extraction for post-translational modifications</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Mate</surname>
          </string-name>
          ´ Ongenaert, Leander Van Neste, Tim De Meyer, Gerben Menschaert, Sofie Bekaert, and Wim Van Criekinge.
          <year>2008</year>
          .
          <article-title>PubMeth: a cancer methylation database combining text-mining and expert annotation</article-title>
          .
          <source>Nucl. Acids Res</source>
          .,
          <volume>36</volume>
          (
          <issue>suppl 1</issue>
          ):
          <fpage>D842</fpage>
          -
          <lpage>846</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Filip</given-names>
            <surname>Pattyn</surname>
          </string-name>
          , Jasmien Hoebeeck, Piet Robbrecht, Evi Michels, Anne De Paepe, Guy Bottu, David Coornaert, Robert Herzog, Frank Speleman, and
          <string-name>
            <given-names>Jo</given-names>
            <surname>Vandesompele</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>methblast and methprimerdb: web-tools for pcr based methylation analysis</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>496</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Hoifung</given-names>
            <surname>Poon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lucy</given-names>
            <surname>Vanderwende</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Joint inference for knowledge extraction from biomedical literature</article-title>
          .
          <source>In Proceedings of NAACL/HLT'10</source>
          , pages
          <fpage>813</fpage>
          -
          <lpage>821</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Sampo</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , Antti Airola, Juho Heimonen, and Jari Bjo¨rne.
          <year>2008</year>
          .
          <article-title>Comparative analysis of five proteinprotein interaction corpora</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>Suppl</issue>
          . 3):
          <fpage>S6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hong-Woo</surname>
            <given-names>Chun</given-names>
          </string-name>
          , Toshihisa Takagi, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>A markov logic approach to biomolecular event extraction</article-title>
          .
          <source>In Proceedings of BioNLP'09 Shared Task</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>A.D.</given-names>
            <surname>Riggs</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>X inactivation, differentiation, and dna methylation</article-title>
          .
          <source>Cytogenetic and Genome Research</source>
          ,
          <volume>14</volume>
          :
          <fpage>9</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Kenji</given-names>
            <surname>Sagae</surname>
          </string-name>
          and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Dependency parsing and domain adaptation with LR models and parser ensembles</article-title>
          .
          <source>In Proceedings of EMNLP-CoNLL</source>
          <year>2007</year>
          , pages
          <fpage>1044</fpage>
          -
          <lpage>1050</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Mark</given-names>
            <surname>Schena</surname>
          </string-name>
          , Dari Shalon, Ronald W. Davis, and
          <string-name>
            <surname>Patrick O. Brown</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray</article-title>
          .
          <source>Science</source>
          ,
          <volume>270</volume>
          (
          <issue>5235</issue>
          ):
          <fpage>467</fpage>
          -
          <lpage>470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Miho M. Suzuki</surname>
            and
            <given-names>Adrian</given-names>
          </string-name>
          <string-name>
            <surname>Bird</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Dna methylation landscapes: provocative insights from epigenomics</article-title>
          .
          <source>Nature Review Genetics</source>
          ,
          <volume>9</volume>
          :
          <fpage>465</fpage>
          -
          <lpage>476</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Tanabe</surname>
          </string-name>
          , Natalie Xie, Lynne H Thom, Wayne Matten, and
          <string-name>
            <given-names>W John</given-names>
            <surname>Wilbur</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>GENETAG: A tagged corpus for gene/protein named entity recognition</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>6</volume>
          (
          <issue>Suppl</issue>
          . 1):
          <fpage>S3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Yuka</given-names>
            <surname>Tateisi</surname>
          </string-name>
          , Yoshimasa Tsuruoka, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Subdomain adaptation of a pos tagger with a small corpus</article-title>
          .
          <source>In Proceedings of BioNLP'06</source>
          ,
          <string-name>
            <surname>page</surname>
            <given-names>136137</given-names>
          </string-name>
          , New York, USA, June.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>The</given-names>
            <surname>Gene Ontology Consortium</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Gene ontology: tool for the unification of biology</article-title>
          .
          <source>Nature Genetics</source>
          ,
          <volume>25</volume>
          :
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Paul</given-names>
            <surname>Thompson</surname>
          </string-name>
          , Syed Iqbal,
          <string-name>
            <surname>John McNaught</surname>
            , and
            <given-names>Sophia</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Construction of an annotated corpus to support biomedical information extraction</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>10</volume>
          (
          <issue>1</issue>
          ):
          <fpage>349</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <given-names>Yue</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Rune Saetre, Sampo Pyysalo, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Investigating heterogeneous protein annotations toward cross-corpora utilization</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>10</volume>
          (
          <issue>403</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>