=Paper=
{{Paper
|id=None
|storemode=property
|title=Event extraction for DNA methylation
|pdfUrl=https://ceur-ws.org/Vol-714/Paper06_Ohta.pdf
|volume=Vol-714
|dblpUrl=https://dblp.org/rec/conf/smbm/OhtaPMT10
}}
==Event extraction for DNA methylation==
Event Extraction for DNA Methylation
Tomoko Ohta∗ Sampo Pyysalo∗ Makoto Miwa∗ Jun’ichi Tsujii∗†‡
∗
Department of Computer Science, University of Tokyo, Tokyo, Japan
†
School of Computer Science, University of Manchester, Manchester, UK
‡
National Centre for Text Mining, University of Manchester, Manchester, UK
{okap,smp,mmiwa,tsujii}@is.s.u-tokyo.ac.jp
Abstract of a particular protein (Ananiadou et al., 2010).
We consider the task of automatically The state of the art in such extraction methods
extracting DNA methylation events from was evaluated in the BioNLP’09 Shared Task on
the biomedical domain literature. DNA Event Extraction (below, BioNLP ST) (Kim et al.,
methylation is a key mechanism of epige- 2009), and event extraction following the BioNLP
netic control of gene expression and impli- ST model has continued to draw interest also af-
cated in many cancers, but there has been ter the task, with recent work including advances
little study of automatic information ex- in extraction methods (Miwa et al., 2010a; Poon
traction for DNA methylation. We present and Vanderwende, 2010), the release of extraction
an annotation scheme following the repre- system software and large-scale automatically an-
sentation of the recent BioNLP’09 shared notated data (Björne et al., 2010) and the develop-
task on event extraction, select a set of ment of additional annotated resources following
200 abstracts including a balanced sam- the event representation (Ohta et al., 2010).
ple of all PubMed citations relevant to Of the findings of the BioNLP ST evaluation,
DNA methylation, and introduce man- it is of particular interest to us that the highest-
ual annotation for this corpus marking performing methods include many that are purely
nearly 3000 gene/protein mentions and machine-learning based (Kim et al., 2009), learn-
1500 DNA methylation and demethylation ing what to extract directly from a corpus anno-
events. We retrain a state-of-the-art event tated with examples of the events of interest. This
extraction system on the corpus and find implies that state-of-the-art extraction methods for
that automatic extraction can be performed new types of events can be created by providing
at 78% precision and 76% recall. The in- annotated resources to an existing system, with-
troduced resources are freely available for out the need for direct development of natural lan-
use in research from the GENIA project guage processing or IE methods. Here, we apply
homepage.1 this approach to DNA methylation, a specific and
biologically highly relevant entity type not consid-
1 Introduction ered in previous event extraction studies.
During the previous decade of concentrated study In the following, we first outline the biological
of biomedical information extraction (IE), most significance of DNA methylation and discuss ex-
efforts have focused on the foundational task of isting resources. We then introduce the event ex-
detecting mentions of entities of interest and the traction approach applied, present the new anno-
extraction of simple relations between these enti- tated corpus created in this study, and event extrac-
ties, typically represented as undifferentiated bi- tion results using a method trained on the corpus.
nary associations (Pyysalo et al., 2008). However,
2 DNA Methylation
in recent years there has been increased interest
in biomolecular event extraction using representa- The term epigenetics refers to a set of molecu-
tions that capture typed, structured n-ary associa- lar mechanisms “beyond genetics” – i.e. without
tions of entities in specific roles, such as regula- change in DNA sequence – that are today under-
tion of the phosphorylation of a specific domain stood to play an important role in several biolog-
1
http://www-tsujii.is.s.u-tokyo.ac.jp/ ical processes, including genetic program for de-
GENIA velopment, cell differentiation and tissue specific
gene expression. DNA methylation was first sug- 700000
All
1800
DNA Methylation
gested as an epigenetic mechanism for the con- 650000 1600
DNA Methylation citations
1400
trol of gene activity during development in 1975 600000
All PubMed citations
1200
(Riggs, 1975; Holliday and Pugh, 1975), and the 550000
1000
role of DNA methylation in cancer was first re- 500000
800
ported in 1987 (Holliday, 1987). DNA methyla- 450000
600
tion of CpG islands in promoter regions is now 400000
400
understood to be one of the most consistent ge- 350000 200
netic alterations in cancer, and DNA methylation 300000 0
1985 1990 1995 2000 2005
is a prominent area of study.
Chemically, DNA methylation is a simple re- Figure 1: Citations tagged with the MeSH term
action adding a methyl group to a specific posi- DNA Methylation compared to all citations in
tion of cytosine pyrimidine ring or adenine purine PubMed by publication year. Note different
ring. While a single nucleotide can only be scales.
either methylated or unmethylated, in text the
overall degree of promoter methylation is often
ber of documents tagged for the DNA methyla-
reported as hypo- and hyper-methylation, with
tion MeSH term and the human judgments assur-
hyper-methylation implying that the expression of
ing their relevance make querying for this term
a gene is silenced. Because of the precise defini-
a natural choice for selecting text. However, di-
tion of the phenomenon and the relatively specific
rect PubMed query as the only selection strategy
terms in which it is typically discussed in publi-
would ignore significant existing resources, dis-
cations, we expected it to provide a well-defined
cussed in the following.
target for annotation and automatic extraction.
2.2 DNA Methylation Databases
2.1 DNA Methylation in PubMed
A growing number of databases collating infor-
We follow common practice in biomedical IE in mation on DNA methylation are becoming avail-
drawing texts for our corpus from PubMed ab- able. The first such database, MethDB (Amor-
stracts. Currently containing more than 20 million eira et al., 2003), was introduced in 2001 and
citations for biomedical literature (over 11M with remains actively developed. MethDB contains
abstracts) and growing exponentially (Hunter and PubMed citation references as evidence for con-
Cohen, 2006), the literature database provides a tained entries, but no more specific identifica-
rich resource for IE and text mining. tion of the expressions stating DNA methylation
To facilitate access to documents relevant to events. The methPrimerDB (Pattyn et al., 2006)
specific topics, each PubMed citation is manually database provides additional information on PCR
assigned terms that identify its primary topics us- primers on top of MethDB, but does not add fur-
ing MeSH, a controlled vocabulary of over 25,000 ther specification of the methylated gene or text-
terms. MeSH contains also a DNA Methylation bound annotation. PubMeth (Ongenaert et al.,
term, allowing specific searches for citations on 2008) is a database of DNA methylation in can-
the topic. Figure 1 shows the number of citations cer with evidence sentences from the literature.
per year of publication matching this term con- This database stores information on cancer types
trasted with overall citations, illustrating explosive and subtypes, methylated genes and the experi-
growth of interest in DNA methylation, outstrip- mental method used to identify methylation, as
ping the overall growth of the literature. Partic- well as evidence sentences. MeInfoText, (Fang et
ular increases can be seen after the introduction al., 2008) is a database of DNA methylation and
of DNA microarrays for monitoring gene expres- cancer information automatically extracted from
sion (Schena et al., 1995) and the introduction of PubMed documents matching the query terms
high-throughput screening methods (Kononen et human, methylation and cancer using term co-
al., 1998; MacBeath and Schreiber, 2000). The to- occurrence statistics. Of the DNA Methylation
tal number of PubMed citations tagged with DNA resources, only PubMeth and MeInfoText contain
Methylation at the time of this writing is 15456 text-bound annotation identifying specific spans of
(14350 of which have an abstract). The large num- characters containing the gene mention and ex-
a) MS-PCR revealed the [methylation] of the [p16] gene in 10(34%)of 29 [NSCLCs]
b) 30% (27 of 91) of [lung tumors] showed [hypermethylation] of the 5’CpG region of the [p14ARF gene]
c) [Promotor hypermethylations] were detected in [O6-methylguanine-DNA methyltransferase (MGMT), RB1,
estrogen receptor, p73, p16INK4a, death-associated protein kinase, p15INK4b, and p14ARF]
d) The promoter region of the [p16INK4] gene was [hypermethylated] in the tumor samples of the primary or metastatic site
Table 1: Examples of PubMeth evidence sentence annotation. Annotated spans delimited by brackets
and statements expressing methylation underlined, gene mentions shown in italics, and cancer mentions
in bold.
pressing DNA methylation in evidence sentences
supporting database entries. In this study, we con-
sider specifically PubMeth as a source of reference
text-bound annotations due to availability and the
ability to redistribute derived data. Figure 2: Event annotation for phosphorylation.
Initial text-bound annotations in PubMeth were
generated using keyword lookup, but the database product, is applied without distinction between
annotations are manually reviewed. Table 1 shows genes and their products. In addition to the iden-
example evidence sentences from PubMeth and tification of the modified gene, it is important to
their annotated spans. While the PubMeth annota- identify the site of the modification. We marked
tion differs from the BioNLP ST representation in mentions of sites relevant to the events as DNA
a number of ways, such as not separating coordi- domain or region terms following the original GE-
nated entities (Table 1c) and not annotating methy- NIA term corpus annotation guidelines (Ohta et
lation sites (Table 1d), it provides both a refer- al., 2002).
ence identifying annotation targets from a biologi- For representing DNA methylation events, the
cally motivated perspective and a potential starting annotation applied to capture protein phosphory-
point for full event annotation. lation events in the BioNLP ST task 2 closely
matched the needs for DNA methylation (Fig-
3 Annotation ure 2). While the Site arguments of the ST Phos-
phorylation events are protein domains, machine-
For annotation, we adapted the representation ap-
learning based extraction methods should be able
plied in the BioNLP ST on event extraction with
to associate this role with DNA domains given
minimal changes in order to allow systems devel-
training data. We thus adopted a representa-
oped for the task to be applied also for the newly
tion where DNA methylation events are associated
annotated corpus. Documents were selected fol-
with a gene/gene product as their Theme and a
lowing the basic motivation presented above, with
DNA domain or region as Site. Each event is also
reference to the requirements specified by the an-
associated with a particular span of text expressing
notation scheme, and some automatic preprocess-
it, termed the event trigger.2 We further initially
ing was applied as annotator support. This section
marked catalysts using Positive regulation events
details the annotation approach.
following the BioNLP ST model, but dropped this
3.1 Entity and Event Representation class of annotation as a sufficient number of exam-
For the core named entity annotation, we thus pri- ples was not found in the corpus.
marily follow the gene/gene product (GGP) an- The event types of the BioNLP ST are drawn
notation criteria applied for the shared task data from the GENIA Event ontology (Kim et al.,
(Ohta et al., 2009). In brief, the guidelines spec- 2008), which in turn draws its type definitions
ify annotation of minimal contiguous spans con- from the community-standard Gene Ontology
taining mentions of specific gene or gene product (GO) (The Gene Ontology Consortium, 2000). To
(RNA/protein) names, where specific name is un- maintain compatibility with these resources, we
derstood to be one allowing a biologist to identify opted to follow the GO also for the definition of
the corresponding entry in a gene/protein database 2
Annotators were instructed to always mark some trigger
such as Uniprot or Entrez Gene. The annotation expression. We note that while we do not here specifically
distinguish hypo- and hyper-methylation, the trigger anno-
thus excludes e.g. names of families and com- tations are expected to facilitate adding these distinctions if
plexes. A single annotation type, Gene or gene necessary.
the new event type considered here. GO defines 1800
1600
DNA methylation as
Number of documents
1400
The covalent transfer of a methyl group 1200
1000
to either N-6 of adenine or C-5 or N-4 800
of cytosine. 600
400
We note that while the definition may appear re- 200
strictive, methylation of adenine N-6 or cytosine 0
0 5 10 15 20 25 30 35 40
C-5/N-4 encompasses the entire set of ways in Number of gene/protein mentions
which DNA can be methylated. This definition
could thus be adopted without limitation to the Figure 3: Number of citations with given number
scope of the annotation. of automatically tagged gene/protein mentions.
3.2 Document Selection
2008). We found that while the overwhelmingly
The selection of source documents for an anno- most frequent number of tagged mentions per doc-
tated corpus is critical for assuring that the cor- ument is zero, a substantial mass of abstracts have
pus provides relevant and representative material large mention counts (Figure 3).3 We decided af-
for studying the phenomena of interest. Domain ter brief preliminary experiments to filter the ini-
corpora frequently consist of documents from a tial selection of documents to include only those
particular subdomain of interest: for example, the in which at least 5 gene/protein mentions were
GENIA corpus focuses on documents concerning marked by an automatic tagger. This excludes
transcription factors in human blood cells (Ohta et most documents without markable events without
al., 2002). Methods trained and evaluated on such introducing obvious other biases.
focused resources will not necessarily generalize In the second strategy, we extended and com-
well to broader domains. However, there has been pleted the annotation of a random selection of
little study of the effect of document selection on PubMeth evidence sentences, aiming to leverage
event extraction performance. Here, we applied existing resources and to select documents that
two distinct strategies to get a representative sam- had been previously judged relevant to the inter-
ple of the full scope of DNA methylation events in ests of biologists studying the topic. This provides
the literature and to assure that our annotations are an external definition of document relevance and
relevant to the interests of biologists. allows us to estimate to what extent the applied an-
In the first strategy, we aimed in particular to notation strategy can capture biologically relevant
select a representative sample of documents rel- statements. This strategy is also expected to select
evant to the targeted event types. For this pur- a concentrated, event-rich set of documents. How-
pose, we directly searched the PubMed literature ever, the selection may also necessarily carry over
database. We further decided not to include any biases toward particular subsets of relevant docu-
text-based query in the search to avoid biasing ments from the original selection and will not be a
the selection toward particular entities or forms representative sample of the overall distribution of
of event expression. Instead, we only queried for such documents in the literature.
the single MeSH term DNA Methylation. While For producing the largest number of event an-
this search is expected to provide high-prevision notations with the least effort, the most efficient
results for the full topic, not all such documents way to use the PubMeth data would have been to
necessarily discuss events where specific genes are simply extract the evidence sentences and com-
methylated. In initial efforts to annotate a random plete the annotation for these. However, view-
sample of these documents, we found that many ing the context in which event statements occur
did not mention specific gene names. To reduce as centrally important, we opted to annotate com-
wasted effort in examining documents that contain plete abstracts, with initial annotations from Pub-
no markable events, we added a filter requiring a Meth evidence sentences automatically transferred
minimum number of (likely) gene mentions. We into the abstracts. We note that not all PubMeth
first tagged all 14350 citations tagged with DNA 3
The tagger has been evaluated at 86% F-score on a
Methylation that have an abstract in PubMed us- broad-coverage corpus, suggesting this is unlikely to severely
ing the BANNER tagger (Leaman and Gonzalez, misestimate the true distribution.
evidence spans were drawn from abstracts, and PubMeth PubMed Total
Abstracts 100 100 200
not all that were matched a contiguous span of Sentences 1118 1009 2127
text. We could align PubMeth evidence annota- Entities
tions into 667 PubMed abstracts (approximately GGP 1695 1195 2890
57% of the referenced PMID number in PubMeth) Site 240 234 474
Total 1935 1429 3364
and completed event annotation for a random sam-
Events
ple of these. Theme only 660 214 874
Theme and Site 323 297 620
3.3 Document Preprocessing DNA methylation 977 485 1462
DNA demethyl. 6 26 38
To reduce annotation effort, we applied auto- Total 983 511 1494
matic systems to produce initial candidate sen-
Table 2: Corpus statistics.
tence boundaries and GGP annotations for the cor-
pus. For sentence splitting, we applied the GE-
NIA sentence splitter4 , and for gene/protein tag- quent in the other subcorpus.6 As the extraction of
ging, we applied the BANNER NER system (Lea- events specifying also sites is known to be partic-
man and Gonzalez, 2008) trained on the GENE- ularly challenging (Kim et al., 2009), these statis-
TAG corpus (Tanabe et al., 2005). The GENETAG tics suggest the PubMed subcorpus may repre-
guidelines and gene/protein entity annotation cov- sent a more difficult extraction task. Only very
erage are known to differ from those applied for few DNA demethylation events are found in ei-
GGP annotation here (Wang et al., 2009). How- ther subcorpus. Overall, the PubMeth subcorpus
ever, the broad coverage of PubMed provided by contains nearly twice as many event annotations as
the GENETAG suggests taggers trained on the cor- the PubMed one, indicating that the focused doc-
pus are likely to generalize to new subdomains ument selection strategy was successful in identi-
such as that considered here. By contrast, all an- fying particularly event-rich abstracts.
notations following GGP guidelines that we are
aware of are subdomain-specific. 4.2 Annotation Quality
We note that all annotations in the produced cor- To measure the consistency of the produced anno-
pus are at a minimum confirmed by a human an- tation, we performed independent double annota-
notator and that events are annotated without per- tion for a sample of 40% of the abstracts selected
forming initial automatic tagging to assure that no from the PubMed subcorpus; 20% of all abstracts.
bias toward particular extraction methods or ap- As the PubMed subcorpus event annotation is cre-
proaches is introduced. ated without initial human annotation as reference
(unlike the PubMeth subcorpus), agreement is ex-
4 Results pected to be lower on this subcorpus. This exper-
iment should thus provide a lower bound on the
4.1 Corpus Statistics overall consistency of the corpus.
Corpus statistics are given in Table 2. There We first measured agreement on the gene/gene
are some notable differences between the subcor- product (GGP) entity annotation, and found very
pora created using the different selection strate- high agreement among 935 entities marked in to-
gies. While the subcorpora are similar in size, tal by the two annotators: 91% F-score using exact
the PubMeth GGP count is 1.4 times that of the match criteria and 97% F-score using the relaxed
PubMed subcorpus5 , yet roughly equal numbers “overlap” criterion where any two overlapping an-
of methylation sites are annotated in the two. This notations are considered to match.7 We then sep-
difference is even more pronounced in the statis- arately measured agreement on event annotations
tics for event arguments, where two thirds of Pub- 6
The number of annotated sites is less than the number
Meth subcorpus events contain only a Theme ar- of events with a Site argument as the annotation criteria only
call for annotating a site entity when it is referred to from an
gument identifying the GGP, while events where event, and multiple events can refer to the same site entity.
both Theme and Site are identified are more fre- 7
The high agreement is not due to annotators simply
agreeing with the automatic initial annotation: the F-score
4
http://www-tsujii.is.s.u-tokyo.ac.jp/∼y-matsu/geniass/ of the automatic tagger against the two sets of human an-
5
The differences in the number of GGP annotations may notations was 65%/66% for exact and 85%/86% for overlap
be affected by the PubMeth entity annotation criteria. match.
for those events that involved GGPs on which the the use of a machine learning module for com-
annotators agreed, using the standard evaluation plex event construction, and the use of two parsers
criteria described in Section 4.4. Agreement on for syntactic analysis (Miwa et al., 2010b). We
event annotations was also high: 84% F-score follow Miwa et al. in applying the HPSG-based
overall (85% for DNA methylation and 75% for deep parser Enju (Miyao and Tsujii, 2008) using
DNA demethylation) over a total of 442 annotated the high-speed parsing setting (“mogura”) and the
events. GDep (Sagae and Tsujii, 2007) native dependency
The overall consistency of the annotation de- parser, both with biomedical domain models based
pends on joint annotator agreement on the GGP on the GENIA treebank data (Tateisi et al., 2006).
and event annotations. However, in experimental For evaluation, we applied a version of the
settings such as that of the BioNLP ST where gold BioNLP’09 ST evaluation tools8 modified to rec-
GGP annotation is assumed as the starting point ognize the novel DNA methylation event type.
for event extraction, measured performance is not
4.4 Evaluation Criteria
affected by agreement on GGPs and thus arguably
only the latter factor applies. As this setting is We followed the basic task setup and primary eval-
adopted also in the present study, annotation con- uation criteria of the BioNLP’09 ST. Specifically,
sistency suggests a human upper bound no lower we followed task 2 (“event enrichment”) criteria,
than 84% F-score on extraction performance. requiring for correct extraction of a DNA methy-
Estimates of the annotation consistency of lation event both the identification of the modi-
biomedical domain corpora are regrettably seldom fied gene (GGP entity) and the identification of
provided, and to the best of our knowledge ours is the modification site (DNA domain or region en-
the first estimate of inter-annotator agreement for tity) when stated. As in the shared task, human
a corpus following the event representation of the annotation for GGP entities was provided as part
BioNLP ST. Given the complexity of the annota- of the system input but other entities were not, so
tion – typed associations of event trigger, theme that the system was required to identify the spans
and site – the agreement compares favorably to of the mentioned modification sites.
e.g. the reported 67% inter-annotator F-score re- The performance of the system was evalu-
ported for protein-protein interactions on the ITI ated using the standard precision, recall and F-
TXM corpora (Alex et al., 2008) and the full event score metrics for the recovery of events, with
agreement on the GREC corpus (Thompson et al., event equality defined following the “Approxi-
2009). mate span” matching criterion applied in the pri-
mary evaluation for the BioNLP’09 ST. This cri-
4.3 Event Extraction Method terion relaxes strict matching requirements so that
a detected event trigger or entity is considered to
To estimate the feasibility of automatic extrac- match a gold trigger/entity if its span is entirely
tion of DNA methylation events and the suitabil- contained within the span of the gold trigger, ex-
ity of presently available event extraction meth- tended by one word both to the left and to the right.
ods to this task, we performed experiments using
the EventMine event extraction system of (Miwa 4.5 Experimental Setup
et al., 2010b). On the task 2 of the BioNLP We divided the corpus into three parts, first setting
ST dataset, the benchmark most relevant to our one third of the abstracts aside as a held-out test
task setting, the applied version of EventMine was set and then splitting the remaining two thirds in a
recently evaluated at 55% F-score (Miwa et al., roughly 1:3 ratio into a training set and a develop-
2010a), outperforming the best task 2 system in ment test set, giving 100 abstracts for training, 34
the original shared task (Riedel et al., 2009) by for development, and 66 for final test. The splits
more than 10% points. To the best of our knowl- were performed randomly, but sampling so that
edge, this system represents the state of the art for each set has an equal number of abstracts drawn
this event extraction task. from the PubMeth and PubMed subcorpora.
EventMine is an SVM-based machine learning The EventMine system has a single tunable
system following the pipeline design of the best threshold parameter that controls the tradeoff be-
system in the BioNLP ST (Björne et al., 2009), 8
http://www-tsujii.is.s.u-tokyo.ac.jp/
extending it with refinements to the feature set, GENIA/SharedTask/downloads.shtml
Event type prec. recall F-score 90
DNA methylation 77.6% 77.2% 77.4%
80
DNA demethylation 100.0% 11.1% 20.0%
Total 77.7% 76.0% 76.8% 70
F-score
Table 3: Overall extraction performance. 60
50
Test set
Training set PubMed PubMeth Both 40
PubMed 64.9% 71.2% 71.6% Test set: Both PubMed PubMeth
30
PubMeth 62.9% 80.0% 74.0%
0 20 40 60 80 100
Both 66.2% 82.5% 76.8%
Fraction of traning data (%)
Table 4: F-score by subcorpus.
Figure 4: Learning curve for the two subcorpora
and their combination. Both subcorpora used for
tween system precision and recall. We first set training. Average and error bars calculated by
the tradeoff using a sparse search of the parame- 10 repetitions of random subsampling of training
ter space [0:1], evaluating the performance of the data, testing on the development set.
system by training on the training set and evaluat-
ing on the development set. As these experiments
did not indicate any other parameter setting could mance for all three test sets, indicating that the
provide significantly better performance, we chose subcorpora are compatible.
the default threshold setting of 0.5. To study the The learning curve (Figure 4) shows rela-
effect of training data size on performance, we per- tively high performance and rapid improvement
formed extraction experiments randomly down- for modest amounts of data, but performance im-
sampling the training data on the document level provement with additional data levels out rela-
with testing on the development set. In final exper- tively fast, nearly flattening as use of the training
iments EventMine was trained on the combined data approaches 100%. This suggests that extrac-
training and development data and performance tion performance for this task is not primarily lim-
evaluated on the held-out test data. ited by training data size and that additional an-
notation following the same protocol is unlikely
4.6 Extraction performance to yield notable improvement in F-score without
a substantial investment of resources. As perfor-
Table 3 shows extraction results on the held-out
mance for the PubMed subcorpus (for which inter-
test data. While DNA methylation events could
annotator agreement was measured) is not yet ap-
be extracted quite reliably, the system performed
proaching the limit implied by the corpus annota-
poorly for DNA demethylation events. The latter
tion consistency (Section 4.2), the results suggest
result is perhaps not surprising given their small
further need for the development of event extrac-
number – only 38 in total in the corpus – and indi-
tion methods to improve DNA methylation event
cates that a separate selection strategy is necessary
extraction.
to provide resources for learning the reverse reac-
tion. Overall performance shows a small prefer- 5 Related Work
ence for precision over recall at 77% F-score. We
view this level of performance very good as a first DNA methylation and related epigenetic mech-
result. anisms of gene expression control have been
To evaluate the relative difficulty of the extrac- a focus of considerable recent research in
tion tasks that the two subcorpora represent and biomedicine. There are many excellent reviews of
their merits as training material, we performed this broad field; we refer the interested reader to
tests separating the two (Table 4). As predicted (Jaenisch and Bird, 2003; Suzuki and Bird, 2008).
from corpus statistics (Section 4.1), the PubMed There is a wealth of recent related work also
subcorpus represents the more challenging extrac- on event extraction. In the BioNLP’09 shared
tion task. When testing on a single subcorpus, re- task, 24 teams participated in the primary task and
sults are, unsurprisingly, better when training data six teams in Task 2 which mostly resembles our
is drawn from the same subcorpus; however, train- setup in that it also required the detection of mod-
ing on the combined data gives the best perfor- ified gene/protein and modification site. The top-
performing system in Task 2 (Riedel et al., 2009) precision and 76% recall. The learning curve sug-
achieved 44% F-score, and the highest perfor- gested that the corpus size is sufficient and that in
mance reported since that we are aware of is 55% future efforts in DNA methylation event extraction
F-score for EventMine (Miwa et al., 2010b). The should focus on extraction method development.
performance we achieved for DNA methylation is One natural direction for future work is to ap-
considerably better than this overall result, essen- ply event extraction systems trained on the newly
tially matching the best reported performance for introduced data to abstracts available in PubMed
Phosphorylation events, which we previously ar- and full texts available at PMC to create a detailed,
gued to be the closest shared task analogue to the up-to-date repository of DNA methylation events
new event category studied here. Nevertheless, di- at full literature scale. Such an effort would re-
rect comparison of these results may not be mean- quire gene name normalization and event extrac-
ingful due to confounding factors. The only text tion at PubMed scale, both of which have recently
mining effort specifically targeting DNA methy- been shown to be technically feasible (Gerner et
lation that we are aware of is that performed for al., 2010; Björne et al., 2010). Further combining
the initial annotation of the PubMeth and MeIn- the extracted events with cancer mention detection
foText databases (Ongenaert et al., 2008; Fang et could provide a valuable resource for epigenetics
al., 2008), both applying approaches based on key- research.
word matching. However, neither of these stud- The newly annotated corpus, the first re-
ies report results for instance-level extraction of source annotated for DNA methylation using
methylation statements. the event representation, is freely available
The present study is in many aspects simi- for use in research from from the GENIA
lar to our previous work targeting protein post- project homepage http://www-tsujii.is.
translational modification events (Ohta et al., s.u-tokyo.ac.jp/GENIA.
2010). In this work, we annotated 422 events
of 7 different types and showed that retraining Acknowledgments
an existing event extraction system allowed these We would like to thank Maté Ongenaert and other
to be extracted at 42% F-score. Our approach creators of PubMeth for their generosity in al-
here clearly differs from this previous work in its lowing the release of resources building on their
larger scale and concentrated focus on a particu- work and the anonymous reviewers for their many
lar event type of high interest, reflected also in insightful comments. This work was supported
results: while extraction performance in our pre- by Grant-in-Aid for Specially Promoted Research
vious work was limited by training data size, in (MEXT, Japan).
the present study notably higher extraction perfor-
mance was achieved and a plateau in performance
References
with increasing data reached.
Bea Alex, Claire Grover, Barry Haddow, Mijail Kabadjov,
Ewan Klein, Michael Matthews, Stuart Roebuck, Richard
6 Discussion and Future Work Tobin, and Xinglong Wang. 2008. The ITI TXM cor-
pora: Tissue expressions and protein-protein interactions.
We have presented a study of the automatic ex- In Proceedings of LREC’08.
traction of DNA methylation events from litera- Celine Amoreira, Winfried Hindermann, and Christoph
ture following the BioNLP’09 shared task event Grunau. 2003. An improved version of the DNA methy-
lation database (MethDB). Nucl. Acids Res., 31(1):75–77.
representation and a state-of-the-art event extrac-
tion system. We created an corpus of 200 publica- Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Dou-
glas B. Kell. 2010. Event extraction for systems biology
tion abstracts selected to include a representative by text mining the literature. Trends in Biotechnology,
sample of DNA methylation statements from all of 28(7):381–390.
PubMed and manually annotated for nearly 3000 Jari Björne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio
mentions of genes and gene products, 500 DNA Pahikkala, and Tapio Salakoski. 2009. Extracting com-
domain or region mentions and 1500 DNA methy- plex biological events with rich graph-based feature sets.
In Proceedings of BioNLP’09 Shared Task, pages 10–18.
lation and demethylation events. Evaluation using
the EventMine system showed that DNA methy- Jari Björne, Filip Ginter, Sampo Pyysalo, Jun’ichi Tsujii,
and Tapio Salakoski. 2010. Scaling up biomedical
lation events can be extracted simply by retrain- event extraction to the entire pubmed. In Proceedings of
ing an off-the-shelf event extraction system at 78% BioNLP’10, pages 28–36.
Yu-Ching Fang, Hsuan-Cheng Huang, and Hsueh-Fen Juan. Tomoko Ohta, Sampo Pyysalo, Makoto Miwa, Jin-Dong
2008. Meinfotext: associated gene methylation and can- Kim, and Jun’ichi Tsujii. 2010. Event extraction
cer information from text mining. BMC Bioinformatics, for post-translational modifications. In Proceedings of
9(1):22. BioNLP’10, pages 19–27.
Martin Gerner, Goran Nenadic, and Casey M. Bergman. Maté Ongenaert, Leander Van Neste, Tim De Meyer, Ger-
2010. An exploration of mining gene expression mentions ben Menschaert, Sofie Bekaert, and Wim Van Criekinge.
and their anatomical locations from biomedical text. In 2008. PubMeth: a cancer methylation database combin-
Proceedings of BioNLP 2010, pages 72–80. ing text-mining and expert annotation. Nucl. Acids Res.,
36(suppl 1):D842–846.
Robin Holliday and JE Pugh. 1975. Dna modification mech-
anisms and gene activity during development. Science, Filip Pattyn, Jasmien Hoebeeck, Piet Robbrecht, Evi
187:226–232. Michels, Anne De Paepe, Guy Bottu, David Coornaert,
Robert Herzog, Frank Speleman, and Jo Vandesom-
Robin Holliday. 1987. The inheritance of epigenetic defects. pele. 2006. methblast and methprimerdb: web-tools
Science, 238:163–170. for pcr based methylation analysis. BMC Bioinformatics,
7(1):496.
Lawrenece Hunter and K. Bretonnel Cohen. 2006. Biomed-
ical language processing: What’s beyond PubMed? Hoifung Poon and Lucy Vanderwende. 2010. Joint inference
Molecular Cell, 21(5):589–594. for knowledge extraction from biomedical literature. In
Proceedings of NAACL/HLT’10, pages 813–821.
Rudolf Jaenisch and Adrian Bird. 2003. Epigenetic regula-
tion of gene expression: how the genome integrates intrin- Sampo Pyysalo, Antti Airola, Juho Heimonen, and Jari
sic and environmental signals. Nature Genetics, 33:245– Björne. 2008. Comparative analysis of five protein-
254. protein interaction corpora. BMC Bioinformatics,
9(Suppl. 3):S6.
Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. 2008.
Corpus annotation for mining biomedical events from lit- Sebastian Riedel, Hong-Woo Chun, Toshihisa Takagi, and
erature. BMC Bioinformatics, 9(10). Jun’ichi Tsujii. 2009. A markov logic approach to bio-
molecular event extraction. In Proceedings of BioNLP’09
Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Shared Task, pages 41–49.
Kano, and Jun’ichi Tsujii. 2009. Overview of bionlp’09
shared task on event extraction. In Proceedings of A.D. Riggs. 1975. X inactivation, differentiation, and dna
BioNLP’09. methylation. Cytogenetic and Genome Research, 14:9–
25.
Juha Kononen, Lukas Bubendorf, Anne Kallionimeni, Maarit
Barlund, Peter Schraml, Stephen Leighton, Joachim Kenji Sagae and Jun’ichi Tsujii. 2007. Dependency pars-
Torhorst, Michael J Mihatsch, Guido Sauter, and Olli- ing and domain adaptation with LR models and parser en-
P. Kallionimeni. 1998. Tissue microarrays for high- sembles. In Proceedings of EMNLP-CoNLL 2007, pages
throughput molecular profiling of tumor specimens. Nat 1044–1050.
Med, 4(7):844–847.
Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O.
R. Leaman and G. Gonzalez. 2008. Banner: An executable Brown. 1995. Quantitative Monitoring of Gene Expres-
survey of advances in biomedical named entity recogni- sion Patterns with a Complementary DNA Microarray.
tion. In Proceedings of PSB’08, pages 652–663. Science, 270(5235):467–470.
Gavin MacBeath and Stuart L. Schreiber. 2000. Printing Pro- Miho M. Suzuki and Adrian Bird. 2008. Dna methylation
teins as Microarrays for High-Throughput Function Deter- landscapes: provocative insights from epigenomics. Na-
mination. Science, 289(5485):1760–1763. ture Review Genetics, 9:465–476.
Makoto Miwa, Sampo Pyysalo, Tadayoshi Hara, and Jun’ichi Lorraine Tanabe, Natalie Xie, Lynne H Thom, Wayne Mat-
Tsujii. 2010a. A comparative study of syntactic parsers ten, and W John Wilbur. 2005. GENETAG: A tagged
for event extraction. In Proceedings of BioNLP’10, pages corpus for gene/protein named entity recognition. BMC
37–45. Bioinformatics, 6(Suppl. 1):S3.
Makoto Miwa, Rune Sætre, Jin-Dong Kim, and Jun’ichi Tsu- Yuka Tateisi, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2006.
jii. 2010b. Event extraction with complex event classifi- Subdomain adaptation of a pos tagger with a small corpus.
cation using rich features. Journal of Bioinformatics and In Proceedings of BioNLP’06, page 136137, New York,
Computational Biology (JBCB), 8(1):131–146. USA, June.
Yusuke Miyao and Jun’ichi Tsujii. 2008. Feature forest mod- The Gene Ontology Consortium. 2000. Gene ontology: tool
els for probabilistic HPSG parsing. Computational Lin- for the unification of biology. Nature Genetics, 25:25–29.
guistics, 34(1):35–80.
Paul Thompson, Syed Iqbal, John McNaught, and Sophia
Tomoko Ohta, Yuka Tateisi, Hideki Mima, and Jun’ichi Tsu- Ananiadou. 2009. Construction of an annotated corpus to
jii. 2002. GENIA corpus: An annotated research abstract support biomedical information extraction. BMC Bioin-
corpus in molecular biology domain. In Proceedings of formatics, 10(1):349.
HLT’02, pages 73–77.
Yue Wang, Jin-Dong Kim, Rune Sætre, Sampo Pyysalo, and
Tomoko Ohta, Jin-Dong Kim, Sampo Pyysalo, Yue Wang, Jun’ichi Tsujii. 2009. Investigating heterogeneous pro-
and Jun’ichi Tsujii. 2009. Incorporating GENETAG- tein annotations toward cross-corpora utilization. BMC
style annotation to GENIA corpus. In Proceedings of Bioinformatics, 10(403).
BioNLP’09, pages 106–107.