Event Extraction for DNA Methylation

           Tomoko Ohta∗ Sampo Pyysalo∗ Makoto Miwa∗ Jun’ichi Tsujii∗†‡
               ∗
                 Department of Computer Science, University of Tokyo, Tokyo, Japan
            †
              School of Computer Science, University of Manchester, Manchester, UK
         ‡
           National Centre for Text Mining, University of Manchester, Manchester, UK
                   {okap,smp,mmiwa,tsujii}@is.s.u-tokyo.ac.jp


                    Abstract                           of a particular protein (Ananiadou et al., 2010).
    We consider the task of automatically              The state of the art in such extraction methods
    extracting DNA methylation events from             was evaluated in the BioNLP’09 Shared Task on
    the biomedical domain literature. DNA              Event Extraction (below, BioNLP ST) (Kim et al.,
    methylation is a key mechanism of epige-           2009), and event extraction following the BioNLP
    netic control of gene expression and impli-        ST model has continued to draw interest also af-
    cated in many cancers, but there has been          ter the task, with recent work including advances
    little study of automatic information ex-          in extraction methods (Miwa et al., 2010a; Poon
    traction for DNA methylation. We present           and Vanderwende, 2010), the release of extraction
    an annotation scheme following the repre-          system software and large-scale automatically an-
    sentation of the recent BioNLP’09 shared           notated data (Björne et al., 2010) and the develop-
    task on event extraction, select a set of          ment of additional annotated resources following
    200 abstracts including a balanced sam-            the event representation (Ohta et al., 2010).
    ple of all PubMed citations relevant to               Of the findings of the BioNLP ST evaluation,
    DNA methylation, and introduce man-                it is of particular interest to us that the highest-
    ual annotation for this corpus marking             performing methods include many that are purely
    nearly 3000 gene/protein mentions and              machine-learning based (Kim et al., 2009), learn-
    1500 DNA methylation and demethylation             ing what to extract directly from a corpus anno-
    events. We retrain a state-of-the-art event        tated with examples of the events of interest. This
    extraction system on the corpus and find           implies that state-of-the-art extraction methods for
    that automatic extraction can be performed         new types of events can be created by providing
    at 78% precision and 76% recall. The in-           annotated resources to an existing system, with-
    troduced resources are freely available for        out the need for direct development of natural lan-
    use in research from the GENIA project             guage processing or IE methods. Here, we apply
    homepage.1                                         this approach to DNA methylation, a specific and
                                                       biologically highly relevant entity type not consid-
1   Introduction                                       ered in previous event extraction studies.
During the previous decade of concentrated study          In the following, we first outline the biological
of biomedical information extraction (IE), most        significance of DNA methylation and discuss ex-
efforts have focused on the foundational task of       isting resources. We then introduce the event ex-
detecting mentions of entities of interest and the     traction approach applied, present the new anno-
extraction of simple relations between these enti-     tated corpus created in this study, and event extrac-
ties, typically represented as undifferentiated bi-    tion results using a method trained on the corpus.
nary associations (Pyysalo et al., 2008). However,
                                                       2   DNA Methylation
in recent years there has been increased interest
in biomolecular event extraction using representa-     The term epigenetics refers to a set of molecu-
tions that capture typed, structured n-ary associa-    lar mechanisms “beyond genetics” – i.e. without
tions of entities in specific roles, such as regula-   change in DNA sequence – that are today under-
tion of the phosphorylation of a specific domain       stood to play an important role in several biolog-
  1
    http://www-tsujii.is.s.u-tokyo.ac.jp/              ical processes, including genetic program for de-
GENIA                                                  velopment, cell differentiation and tissue specific
gene expression. DNA methylation was first sug-                                       700000
                                                                                                               All
                                                                                                                                   1800
                                                                                                  DNA Methylation
gested as an epigenetic mechanism for the con-                                        650000                                       1600


                                                                                                                                          DNA Methylation citations
                                                                                                                                   1400
trol of gene activity during development in 1975                                      600000


                                                               All PubMed citations
                                                                                                                                   1200
(Riggs, 1975; Holliday and Pugh, 1975), and the                                       550000
                                                                                                                                   1000
role of DNA methylation in cancer was first re-                                       500000
                                                                                                                                   800
ported in 1987 (Holliday, 1987). DNA methyla-                                         450000
                                                                                                                                   600
tion of CpG islands in promoter regions is now                                        400000
                                                                                                                                   400
understood to be one of the most consistent ge-                                       350000                                       200
netic alterations in cancer, and DNA methylation                                      300000                                       0
                                                                                           1985     1990     1995    2000   2005
is a prominent area of study.
   Chemically, DNA methylation is a simple re-           Figure 1: Citations tagged with the MeSH term
action adding a methyl group to a specific posi-         DNA Methylation compared to all citations in
tion of cytosine pyrimidine ring or adenine purine       PubMed by publication year. Note different
ring. While a single nucleotide can only be              scales.
either methylated or unmethylated, in text the
overall degree of promoter methylation is often
                                                         ber of documents tagged for the DNA methyla-
reported as hypo- and hyper-methylation, with
                                                         tion MeSH term and the human judgments assur-
hyper-methylation implying that the expression of
                                                         ing their relevance make querying for this term
a gene is silenced. Because of the precise defini-
                                                         a natural choice for selecting text. However, di-
tion of the phenomenon and the relatively specific
                                                         rect PubMed query as the only selection strategy
terms in which it is typically discussed in publi-
                                                         would ignore significant existing resources, dis-
cations, we expected it to provide a well-defined
                                                         cussed in the following.
target for annotation and automatic extraction.
                                                         2.2                   DNA Methylation Databases
2.1   DNA Methylation in PubMed
                                                         A growing number of databases collating infor-
We follow common practice in biomedical IE in            mation on DNA methylation are becoming avail-
drawing texts for our corpus from PubMed ab-             able. The first such database, MethDB (Amor-
stracts. Currently containing more than 20 million       eira et al., 2003), was introduced in 2001 and
citations for biomedical literature (over 11M with       remains actively developed. MethDB contains
abstracts) and growing exponentially (Hunter and         PubMed citation references as evidence for con-
Cohen, 2006), the literature database provides a         tained entries, but no more specific identifica-
rich resource for IE and text mining.                    tion of the expressions stating DNA methylation
   To facilitate access to documents relevant to         events. The methPrimerDB (Pattyn et al., 2006)
specific topics, each PubMed citation is manually        database provides additional information on PCR
assigned terms that identify its primary topics us-      primers on top of MethDB, but does not add fur-
ing MeSH, a controlled vocabulary of over 25,000         ther specification of the methylated gene or text-
terms. MeSH contains also a DNA Methylation              bound annotation. PubMeth (Ongenaert et al.,
term, allowing specific searches for citations on        2008) is a database of DNA methylation in can-
the topic. Figure 1 shows the number of citations        cer with evidence sentences from the literature.
per year of publication matching this term con-          This database stores information on cancer types
trasted with overall citations, illustrating explosive   and subtypes, methylated genes and the experi-
growth of interest in DNA methylation, outstrip-         mental method used to identify methylation, as
ping the overall growth of the literature. Partic-       well as evidence sentences. MeInfoText, (Fang et
ular increases can be seen after the introduction        al., 2008) is a database of DNA methylation and
of DNA microarrays for monitoring gene expres-           cancer information automatically extracted from
sion (Schena et al., 1995) and the introduction of       PubMed documents matching the query terms
high-throughput screening methods (Kononen et            human, methylation and cancer using term co-
al., 1998; MacBeath and Schreiber, 2000). The to-        occurrence statistics. Of the DNA Methylation
tal number of PubMed citations tagged with DNA           resources, only PubMeth and MeInfoText contain
Methylation at the time of this writing is 15456         text-bound annotation identifying specific spans of
(14350 of which have an abstract). The large num-        characters containing the gene mention and ex-
    a)    MS-PCR revealed the [methylation] of the [p16] gene in 10(34%)of 29 [NSCLCs]
    b)    30% (27 of 91) of [lung tumors] showed [hypermethylation] of the 5’CpG region of the [p14ARF gene]
    c)    [Promotor hypermethylations] were detected in [O6-methylguanine-DNA methyltransferase (MGMT), RB1,
          estrogen receptor, p73, p16INK4a, death-associated protein kinase, p15INK4b, and p14ARF]
    d)    The promoter region of the [p16INK4] gene was [hypermethylated] in the tumor samples of the primary or metastatic site

Table 1: Examples of PubMeth evidence sentence annotation. Annotated spans delimited by brackets
and statements expressing methylation underlined, gene mentions shown in italics, and cancer mentions
in bold.

pressing DNA methylation in evidence sentences
supporting database entries. In this study, we con-
sider specifically PubMeth as a source of reference
text-bound annotations due to availability and the
ability to redistribute derived data.                               Figure 2: Event annotation for phosphorylation.
   Initial text-bound annotations in PubMeth were
generated using keyword lookup, but the database                  product, is applied without distinction between
annotations are manually reviewed. Table 1 shows                  genes and their products. In addition to the iden-
example evidence sentences from PubMeth and                       tification of the modified gene, it is important to
their annotated spans. While the PubMeth annota-                  identify the site of the modification. We marked
tion differs from the BioNLP ST representation in                 mentions of sites relevant to the events as DNA
a number of ways, such as not separating coordi-                  domain or region terms following the original GE-
nated entities (Table 1c) and not annotating methy-               NIA term corpus annotation guidelines (Ohta et
lation sites (Table 1d), it provides both a refer-                al., 2002).
ence identifying annotation targets from a biologi-                   For representing DNA methylation events, the
cally motivated perspective and a potential starting              annotation applied to capture protein phosphory-
point for full event annotation.                                  lation events in the BioNLP ST task 2 closely
                                                                  matched the needs for DNA methylation (Fig-
3        Annotation                                               ure 2). While the Site arguments of the ST Phos-
                                                                  phorylation events are protein domains, machine-
For annotation, we adapted the representation ap-
                                                                  learning based extraction methods should be able
plied in the BioNLP ST on event extraction with
                                                                  to associate this role with DNA domains given
minimal changes in order to allow systems devel-
                                                                  training data. We thus adopted a representa-
oped for the task to be applied also for the newly
                                                                  tion where DNA methylation events are associated
annotated corpus. Documents were selected fol-
                                                                  with a gene/gene product as their Theme and a
lowing the basic motivation presented above, with
                                                                  DNA domain or region as Site. Each event is also
reference to the requirements specified by the an-
                                                                  associated with a particular span of text expressing
notation scheme, and some automatic preprocess-
                                                                  it, termed the event trigger.2 We further initially
ing was applied as annotator support. This section
                                                                  marked catalysts using Positive regulation events
details the annotation approach.
                                                                  following the BioNLP ST model, but dropped this
3.1       Entity and Event Representation                         class of annotation as a sufficient number of exam-
For the core named entity annotation, we thus pri-                ples was not found in the corpus.
marily follow the gene/gene product (GGP) an-                         The event types of the BioNLP ST are drawn
notation criteria applied for the shared task data                from the GENIA Event ontology (Kim et al.,
(Ohta et al., 2009). In brief, the guidelines spec-               2008), which in turn draws its type definitions
ify annotation of minimal contiguous spans con-                   from the community-standard Gene Ontology
taining mentions of specific gene or gene product                 (GO) (The Gene Ontology Consortium, 2000). To
(RNA/protein) names, where specific name is un-                   maintain compatibility with these resources, we
derstood to be one allowing a biologist to identify               opted to follow the GO also for the definition of
the corresponding entry in a gene/protein database                    2
                                                                       Annotators were instructed to always mark some trigger
such as Uniprot or Entrez Gene. The annotation                    expression. We note that while we do not here specifically
                                                                  distinguish hypo- and hyper-methylation, the trigger anno-
thus excludes e.g. names of families and com-                     tations are expected to facilitate adding these distinctions if
plexes. A single annotation type, Gene or gene                    necessary.
the new event type considered here. GO defines                                       1800
                                                                                     1600
DNA methylation as


                                                               Number of documents
                                                                                     1400
      The covalent transfer of a methyl group                                        1200
                                                                                     1000
      to either N-6 of adenine or C-5 or N-4                                         800
      of cytosine.                                                                   600
                                                                                     400
We note that while the definition may appear re-                                     200
strictive, methylation of adenine N-6 or cytosine                                      0
                                                                                            0    5   10   15   20   25   30   35   40
C-5/N-4 encompasses the entire set of ways in                                                   Number of gene/protein mentions
which DNA can be methylated. This definition
could thus be adopted without limitation to the         Figure 3: Number of citations with given number
scope of the annotation.                                of automatically tagged gene/protein mentions.

3.2   Document Selection
                                                        2008). We found that while the overwhelmingly
The selection of source documents for an anno-          most frequent number of tagged mentions per doc-
tated corpus is critical for assuring that the cor-     ument is zero, a substantial mass of abstracts have
pus provides relevant and representative material       large mention counts (Figure 3).3 We decided af-
for studying the phenomena of interest. Domain          ter brief preliminary experiments to filter the ini-
corpora frequently consist of documents from a          tial selection of documents to include only those
particular subdomain of interest: for example, the      in which at least 5 gene/protein mentions were
GENIA corpus focuses on documents concerning            marked by an automatic tagger. This excludes
transcription factors in human blood cells (Ohta et     most documents without markable events without
al., 2002). Methods trained and evaluated on such       introducing obvious other biases.
focused resources will not necessarily generalize          In the second strategy, we extended and com-
well to broader domains. However, there has been        pleted the annotation of a random selection of
little study of the effect of document selection on     PubMeth evidence sentences, aiming to leverage
event extraction performance. Here, we applied          existing resources and to select documents that
two distinct strategies to get a representative sam-    had been previously judged relevant to the inter-
ple of the full scope of DNA methylation events in      ests of biologists studying the topic. This provides
the literature and to assure that our annotations are   an external definition of document relevance and
relevant to the interests of biologists.                allows us to estimate to what extent the applied an-
    In the first strategy, we aimed in particular to    notation strategy can capture biologically relevant
select a representative sample of documents rel-        statements. This strategy is also expected to select
evant to the targeted event types. For this pur-        a concentrated, event-rich set of documents. How-
pose, we directly searched the PubMed literature        ever, the selection may also necessarily carry over
database. We further decided not to include any         biases toward particular subsets of relevant docu-
text-based query in the search to avoid biasing         ments from the original selection and will not be a
the selection toward particular entities or forms       representative sample of the overall distribution of
of event expression. Instead, we only queried for       such documents in the literature.
the single MeSH term DNA Methylation. While                For producing the largest number of event an-
this search is expected to provide high-prevision       notations with the least effort, the most efficient
results for the full topic, not all such documents      way to use the PubMeth data would have been to
necessarily discuss events where specific genes are     simply extract the evidence sentences and com-
methylated. In initial efforts to annotate a random     plete the annotation for these. However, view-
sample of these documents, we found that many           ing the context in which event statements occur
did not mention specific gene names. To reduce          as centrally important, we opted to annotate com-
wasted effort in examining documents that contain       plete abstracts, with initial annotations from Pub-
no markable events, we added a filter requiring a       Meth evidence sentences automatically transferred
minimum number of (likely) gene mentions. We            into the abstracts. We note that not all PubMeth
first tagged all 14350 citations tagged with DNA           3
                                                            The tagger has been evaluated at 86% F-score on a
Methylation that have an abstract in PubMed us-         broad-coverage corpus, suggesting this is unlikely to severely
ing the BANNER tagger (Leaman and Gonzalez,             misestimate the true distribution.
evidence spans were drawn from abstracts, and                                            PubMeth PubMed Total
                                                                     Abstracts           100     100    200
not all that were matched a contiguous span of                       Sentences           1118    1009   2127
text. We could align PubMeth evidence annota-                        Entities
tions into 667 PubMed abstracts (approximately                       GGP                 1695        1195        2890
57% of the referenced PMID number in PubMeth)                        Site                240         234         474
                                                                     Total               1935        1429        3364
and completed event annotation for a random sam-
                                                                     Events
ple of these.                                                        Theme only          660         214         874
                                                                     Theme and Site      323         297         620
3.3      Document Preprocessing                                      DNA methylation     977         485         1462
                                                                     DNA demethyl.       6           26          38
To reduce annotation effort, we applied auto-                        Total               983         511         1494
matic systems to produce initial candidate sen-
                                                                             Table 2: Corpus statistics.
tence boundaries and GGP annotations for the cor-
pus. For sentence splitting, we applied the GE-
NIA sentence splitter4 , and for gene/protein tag-            quent in the other subcorpus.6 As the extraction of
ging, we applied the BANNER NER system (Lea-                  events specifying also sites is known to be partic-
man and Gonzalez, 2008) trained on the GENE-                  ularly challenging (Kim et al., 2009), these statis-
TAG corpus (Tanabe et al., 2005). The GENETAG                 tics suggest the PubMed subcorpus may repre-
guidelines and gene/protein entity annotation cov-            sent a more difficult extraction task. Only very
erage are known to differ from those applied for              few DNA demethylation events are found in ei-
GGP annotation here (Wang et al., 2009). How-                 ther subcorpus. Overall, the PubMeth subcorpus
ever, the broad coverage of PubMed provided by                contains nearly twice as many event annotations as
the GENETAG suggests taggers trained on the cor-              the PubMed one, indicating that the focused doc-
pus are likely to generalize to new subdomains                ument selection strategy was successful in identi-
such as that considered here. By contrast, all an-            fying particularly event-rich abstracts.
notations following GGP guidelines that we are
aware of are subdomain-specific.                              4.2    Annotation Quality
   We note that all annotations in the produced cor-          To measure the consistency of the produced anno-
pus are at a minimum confirmed by a human an-                 tation, we performed independent double annota-
notator and that events are annotated without per-            tion for a sample of 40% of the abstracts selected
forming initial automatic tagging to assure that no           from the PubMed subcorpus; 20% of all abstracts.
bias toward particular extraction methods or ap-              As the PubMed subcorpus event annotation is cre-
proaches is introduced.                                       ated without initial human annotation as reference
                                                              (unlike the PubMeth subcorpus), agreement is ex-
4       Results                                               pected to be lower on this subcorpus. This exper-
                                                              iment should thus provide a lower bound on the
4.1      Corpus Statistics                                    overall consistency of the corpus.
Corpus statistics are given in Table 2. There                    We first measured agreement on the gene/gene
are some notable differences between the subcor-              product (GGP) entity annotation, and found very
pora created using the different selection strate-            high agreement among 935 entities marked in to-
gies. While the subcorpora are similar in size,               tal by the two annotators: 91% F-score using exact
the PubMeth GGP count is 1.4 times that of the                match criteria and 97% F-score using the relaxed
PubMed subcorpus5 , yet roughly equal numbers                 “overlap” criterion where any two overlapping an-
of methylation sites are annotated in the two. This           notations are considered to match.7 We then sep-
difference is even more pronounced in the statis-             arately measured agreement on event annotations
tics for event arguments, where two thirds of Pub-               6
                                                                    The number of annotated sites is less than the number
Meth subcorpus events contain only a Theme ar-                of events with a Site argument as the annotation criteria only
                                                              call for annotating a site entity when it is referred to from an
gument identifying the GGP, while events where                event, and multiple events can refer to the same site entity.
both Theme and Site are identified are more fre-                  7
                                                                    The high agreement is not due to annotators simply
                                                              agreeing with the automatic initial annotation: the F-score
    4
     http://www-tsujii.is.s.u-tokyo.ac.jp/∼y-matsu/geniass/   of the automatic tagger against the two sets of human an-
    5
     The differences in the number of GGP annotations may     notations was 65%/66% for exact and 85%/86% for overlap
be affected by the PubMeth entity annotation criteria.        match.
for those events that involved GGPs on which the        the use of a machine learning module for com-
annotators agreed, using the standard evaluation        plex event construction, and the use of two parsers
criteria described in Section 4.4. Agreement on         for syntactic analysis (Miwa et al., 2010b). We
event annotations was also high: 84% F-score            follow Miwa et al. in applying the HPSG-based
overall (85% for DNA methylation and 75% for            deep parser Enju (Miyao and Tsujii, 2008) using
DNA demethylation) over a total of 442 annotated        the high-speed parsing setting (“mogura”) and the
events.                                                 GDep (Sagae and Tsujii, 2007) native dependency
   The overall consistency of the annotation de-        parser, both with biomedical domain models based
pends on joint annotator agreement on the GGP           on the GENIA treebank data (Tateisi et al., 2006).
and event annotations. However, in experimental            For evaluation, we applied a version of the
settings such as that of the BioNLP ST where gold       BioNLP’09 ST evaluation tools8 modified to rec-
GGP annotation is assumed as the starting point         ognize the novel DNA methylation event type.
for event extraction, measured performance is not
                                                        4.4   Evaluation Criteria
affected by agreement on GGPs and thus arguably
only the latter factor applies. As this setting is      We followed the basic task setup and primary eval-
adopted also in the present study, annotation con-      uation criteria of the BioNLP’09 ST. Specifically,
sistency suggests a human upper bound no lower          we followed task 2 (“event enrichment”) criteria,
than 84% F-score on extraction performance.             requiring for correct extraction of a DNA methy-
   Estimates of the annotation consistency of           lation event both the identification of the modi-
biomedical domain corpora are regrettably seldom        fied gene (GGP entity) and the identification of
provided, and to the best of our knowledge ours is      the modification site (DNA domain or region en-
the first estimate of inter-annotator agreement for     tity) when stated. As in the shared task, human
a corpus following the event representation of the      annotation for GGP entities was provided as part
BioNLP ST. Given the complexity of the annota-          of the system input but other entities were not, so
tion – typed associations of event trigger, theme       that the system was required to identify the spans
and site – the agreement compares favorably to          of the mentioned modification sites.
e.g. the reported 67% inter-annotator F-score re-          The performance of the system was evalu-
ported for protein-protein interactions on the ITI      ated using the standard precision, recall and F-
TXM corpora (Alex et al., 2008) and the full event      score metrics for the recovery of events, with
agreement on the GREC corpus (Thompson et al.,          event equality defined following the “Approxi-
2009).                                                  mate span” matching criterion applied in the pri-
                                                        mary evaluation for the BioNLP’09 ST. This cri-
4.3   Event Extraction Method                           terion relaxes strict matching requirements so that
                                                        a detected event trigger or entity is considered to
To estimate the feasibility of automatic extrac-        match a gold trigger/entity if its span is entirely
tion of DNA methylation events and the suitabil-        contained within the span of the gold trigger, ex-
ity of presently available event extraction meth-       tended by one word both to the left and to the right.
ods to this task, we performed experiments using
the EventMine event extraction system of (Miwa          4.5   Experimental Setup
et al., 2010b). On the task 2 of the BioNLP             We divided the corpus into three parts, first setting
ST dataset, the benchmark most relevant to our          one third of the abstracts aside as a held-out test
task setting, the applied version of EventMine was      set and then splitting the remaining two thirds in a
recently evaluated at 55% F-score (Miwa et al.,         roughly 1:3 ratio into a training set and a develop-
2010a), outperforming the best task 2 system in         ment test set, giving 100 abstracts for training, 34
the original shared task (Riedel et al., 2009) by       for development, and 66 for final test. The splits
more than 10% points. To the best of our knowl-         were performed randomly, but sampling so that
edge, this system represents the state of the art for   each set has an equal number of abstracts drawn
this event extraction task.                             from the PubMeth and PubMed subcorpora.
   EventMine is an SVM-based machine learning              The EventMine system has a single tunable
system following the pipeline design of the best        threshold parameter that controls the tradeoff be-
system in the BioNLP ST (Björne et al., 2009),           8
                                                            http://www-tsujii.is.s.u-tokyo.ac.jp/
extending it with refinements to the feature set,       GENIA/SharedTask/downloads.shtml
      Event type            prec.    recall    F-score             90
      DNA methylation      77.6%    77.2%       77.4%
                                                                   80
      DNA demethylation   100.0%    11.1%       20.0%
      Total                77.7%    76.0%       76.8%              70


                                                         F-score
      Table 3: Overall extraction performance.                     60

                                                                   50
                                Test set
        Training set   PubMed   PubMeth        Both                40
        PubMed          64.9%    71.2%        71.6%                         Test set:    Both   PubMed       PubMeth
                                                                   30
        PubMeth         62.9%    80.0%        74.0%
                                                                        0   20          40      60          80     100
        Both            66.2%    82.5%        76.8%
                                                                             Fraction of traning data (%)
            Table 4: F-score by subcorpus.
                                                         Figure 4: Learning curve for the two subcorpora
                                                         and their combination. Both subcorpora used for
tween system precision and recall. We first set          training. Average and error bars calculated by
the tradeoff using a sparse search of the parame-        10 repetitions of random subsampling of training
ter space [0:1], evaluating the performance of the       data, testing on the development set.
system by training on the training set and evaluat-
ing on the development set. As these experiments
did not indicate any other parameter setting could       mance for all three test sets, indicating that the
provide significantly better performance, we chose       subcorpora are compatible.
the default threshold setting of 0.5. To study the          The learning curve (Figure 4) shows rela-
effect of training data size on performance, we per-     tively high performance and rapid improvement
formed extraction experiments randomly down-             for modest amounts of data, but performance im-
sampling the training data on the document level         provement with additional data levels out rela-
with testing on the development set. In final exper-     tively fast, nearly flattening as use of the training
iments EventMine was trained on the combined             data approaches 100%. This suggests that extrac-
training and development data and performance            tion performance for this task is not primarily lim-
evaluated on the held-out test data.                     ited by training data size and that additional an-
                                                         notation following the same protocol is unlikely
4.6    Extraction performance                            to yield notable improvement in F-score without
                                                         a substantial investment of resources. As perfor-
Table 3 shows extraction results on the held-out
                                                         mance for the PubMed subcorpus (for which inter-
test data. While DNA methylation events could
                                                         annotator agreement was measured) is not yet ap-
be extracted quite reliably, the system performed
                                                         proaching the limit implied by the corpus annota-
poorly for DNA demethylation events. The latter
                                                         tion consistency (Section 4.2), the results suggest
result is perhaps not surprising given their small
                                                         further need for the development of event extrac-
number – only 38 in total in the corpus – and indi-
                                                         tion methods to improve DNA methylation event
cates that a separate selection strategy is necessary
                                                         extraction.
to provide resources for learning the reverse reac-
tion. Overall performance shows a small prefer-          5          Related Work
ence for precision over recall at 77% F-score. We
view this level of performance very good as a first      DNA methylation and related epigenetic mech-
result.                                                  anisms of gene expression control have been
   To evaluate the relative difficulty of the extrac-    a focus of considerable recent research in
tion tasks that the two subcorpora represent and         biomedicine. There are many excellent reviews of
their merits as training material, we performed          this broad field; we refer the interested reader to
tests separating the two (Table 4). As predicted         (Jaenisch and Bird, 2003; Suzuki and Bird, 2008).
from corpus statistics (Section 4.1), the PubMed            There is a wealth of recent related work also
subcorpus represents the more challenging extrac-        on event extraction. In the BioNLP’09 shared
tion task. When testing on a single subcorpus, re-       task, 24 teams participated in the primary task and
sults are, unsurprisingly, better when training data     six teams in Task 2 which mostly resembles our
is drawn from the same subcorpus; however, train-        setup in that it also required the detection of mod-
ing on the combined data gives the best perfor-          ified gene/protein and modification site. The top-
performing system in Task 2 (Riedel et al., 2009)      precision and 76% recall. The learning curve sug-
achieved 44% F-score, and the highest perfor-          gested that the corpus size is sufficient and that in
mance reported since that we are aware of is 55%       future efforts in DNA methylation event extraction
F-score for EventMine (Miwa et al., 2010b). The        should focus on extraction method development.
performance we achieved for DNA methylation is            One natural direction for future work is to ap-
considerably better than this overall result, essen-   ply event extraction systems trained on the newly
tially matching the best reported performance for      introduced data to abstracts available in PubMed
Phosphorylation events, which we previously ar-        and full texts available at PMC to create a detailed,
gued to be the closest shared task analogue to the     up-to-date repository of DNA methylation events
new event category studied here. Nevertheless, di-     at full literature scale. Such an effort would re-
rect comparison of these results may not be mean-      quire gene name normalization and event extrac-
ingful due to confounding factors. The only text       tion at PubMed scale, both of which have recently
mining effort specifically targeting DNA methy-        been shown to be technically feasible (Gerner et
lation that we are aware of is that performed for      al., 2010; Björne et al., 2010). Further combining
the initial annotation of the PubMeth and MeIn-        the extracted events with cancer mention detection
foText databases (Ongenaert et al., 2008; Fang et      could provide a valuable resource for epigenetics
al., 2008), both applying approaches based on key-     research.
word matching. However, neither of these stud-            The newly annotated corpus, the first re-
ies report results for instance-level extraction of    source annotated for DNA methylation using
methylation statements.                                the event representation, is freely available
   The present study is in many aspects simi-          for use in research from from the GENIA
lar to our previous work targeting protein post-       project homepage http://www-tsujii.is.
translational modification events (Ohta et al.,        s.u-tokyo.ac.jp/GENIA.
2010). In this work, we annotated 422 events
of 7 different types and showed that retraining        Acknowledgments
an existing event extraction system allowed these      We would like to thank Maté Ongenaert and other
to be extracted at 42% F-score. Our approach           creators of PubMeth for their generosity in al-
here clearly differs from this previous work in its    lowing the release of resources building on their
larger scale and concentrated focus on a particu-      work and the anonymous reviewers for their many
lar event type of high interest, reflected also in     insightful comments. This work was supported
results: while extraction performance in our pre-      by Grant-in-Aid for Specially Promoted Research
vious work was limited by training data size, in       (MEXT, Japan).
the present study notably higher extraction perfor-
mance was achieved and a plateau in performance
                                                       References
with increasing data reached.
                                                       Bea Alex, Claire Grover, Barry Haddow, Mijail Kabadjov,
                                                         Ewan Klein, Michael Matthews, Stuart Roebuck, Richard
6   Discussion and Future Work                           Tobin, and Xinglong Wang. 2008. The ITI TXM cor-
                                                         pora: Tissue expressions and protein-protein interactions.
We have presented a study of the automatic ex-           In Proceedings of LREC’08.
traction of DNA methylation events from litera-        Celine Amoreira, Winfried Hindermann, and Christoph
ture following the BioNLP’09 shared task event           Grunau. 2003. An improved version of the DNA methy-
                                                         lation database (MethDB). Nucl. Acids Res., 31(1):75–77.
representation and a state-of-the-art event extrac-
tion system. We created an corpus of 200 publica-      Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Dou-
                                                         glas B. Kell. 2010. Event extraction for systems biology
tion abstracts selected to include a representative      by text mining the literature. Trends in Biotechnology,
sample of DNA methylation statements from all of         28(7):381–390.
PubMed and manually annotated for nearly 3000          Jari Björne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio
mentions of genes and gene products, 500 DNA              Pahikkala, and Tapio Salakoski. 2009. Extracting com-
domain or region mentions and 1500 DNA methy-             plex biological events with rich graph-based feature sets.
                                                          In Proceedings of BioNLP’09 Shared Task, pages 10–18.
lation and demethylation events. Evaluation using
the EventMine system showed that DNA methy-            Jari Björne, Filip Ginter, Sampo Pyysalo, Jun’ichi Tsujii,
                                                          and Tapio Salakoski. 2010. Scaling up biomedical
lation events can be extracted simply by retrain-         event extraction to the entire pubmed. In Proceedings of
ing an off-the-shelf event extraction system at 78%       BioNLP’10, pages 28–36.
Yu-Ching Fang, Hsuan-Cheng Huang, and Hsueh-Fen Juan.          Tomoko Ohta, Sampo Pyysalo, Makoto Miwa, Jin-Dong
  2008. Meinfotext: associated gene methylation and can-         Kim, and Jun’ichi Tsujii. 2010. Event extraction
  cer information from text mining. BMC Bioinformatics,          for post-translational modifications. In Proceedings of
  9(1):22.                                                       BioNLP’10, pages 19–27.

Martin Gerner, Goran Nenadic, and Casey M. Bergman.            Maté Ongenaert, Leander Van Neste, Tim De Meyer, Ger-
  2010. An exploration of mining gene expression mentions        ben Menschaert, Sofie Bekaert, and Wim Van Criekinge.
  and their anatomical locations from biomedical text. In        2008. PubMeth: a cancer methylation database combin-
  Proceedings of BioNLP 2010, pages 72–80.                       ing text-mining and expert annotation. Nucl. Acids Res.,
                                                                 36(suppl 1):D842–846.
Robin Holliday and JE Pugh. 1975. Dna modification mech-
  anisms and gene activity during development. Science,        Filip Pattyn, Jasmien Hoebeeck, Piet Robbrecht, Evi
  187:226–232.                                                    Michels, Anne De Paepe, Guy Bottu, David Coornaert,
                                                                  Robert Herzog, Frank Speleman, and Jo Vandesom-
Robin Holliday. 1987. The inheritance of epigenetic defects.      pele. 2006. methblast and methprimerdb: web-tools
  Science, 238:163–170.                                           for pcr based methylation analysis. BMC Bioinformatics,
                                                                  7(1):496.
Lawrenece Hunter and K. Bretonnel Cohen. 2006. Biomed-
  ical language processing: What’s beyond PubMed?              Hoifung Poon and Lucy Vanderwende. 2010. Joint inference
  Molecular Cell, 21(5):589–594.                                 for knowledge extraction from biomedical literature. In
                                                                 Proceedings of NAACL/HLT’10, pages 813–821.
Rudolf Jaenisch and Adrian Bird. 2003. Epigenetic regula-
  tion of gene expression: how the genome integrates intrin-   Sampo Pyysalo, Antti Airola, Juho Heimonen, and Jari
  sic and environmental signals. Nature Genetics, 33:245–        Björne. 2008. Comparative analysis of five protein-
  254.                                                           protein interaction corpora.   BMC Bioinformatics,
                                                                 9(Suppl. 3):S6.
Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. 2008.
   Corpus annotation for mining biomedical events from lit-    Sebastian Riedel, Hong-Woo Chun, Toshihisa Takagi, and
   erature. BMC Bioinformatics, 9(10).                           Jun’ichi Tsujii. 2009. A markov logic approach to bio-
                                                                 molecular event extraction. In Proceedings of BioNLP’09
Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu              Shared Task, pages 41–49.
   Kano, and Jun’ichi Tsujii. 2009. Overview of bionlp’09
   shared task on event extraction. In Proceedings of          A.D. Riggs. 1975. X inactivation, differentiation, and dna
   BioNLP’09.                                                    methylation. Cytogenetic and Genome Research, 14:9–
                                                                 25.
Juha Kononen, Lukas Bubendorf, Anne Kallionimeni, Maarit
   Barlund, Peter Schraml, Stephen Leighton, Joachim           Kenji Sagae and Jun’ichi Tsujii. 2007. Dependency pars-
   Torhorst, Michael J Mihatsch, Guido Sauter, and Olli-         ing and domain adaptation with LR models and parser en-
   P. Kallionimeni. 1998. Tissue microarrays for high-           sembles. In Proceedings of EMNLP-CoNLL 2007, pages
   throughput molecular profiling of tumor specimens. Nat        1044–1050.
   Med, 4(7):844–847.
                                                               Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O.
R. Leaman and G. Gonzalez. 2008. Banner: An executable           Brown. 1995. Quantitative Monitoring of Gene Expres-
   survey of advances in biomedical named entity recogni-        sion Patterns with a Complementary DNA Microarray.
   tion. In Proceedings of PSB’08, pages 652–663.                Science, 270(5235):467–470.
Gavin MacBeath and Stuart L. Schreiber. 2000. Printing Pro-    Miho M. Suzuki and Adrian Bird. 2008. Dna methylation
  teins as Microarrays for High-Throughput Function Deter-       landscapes: provocative insights from epigenomics. Na-
  mination. Science, 289(5485):1760–1763.                        ture Review Genetics, 9:465–476.
Makoto Miwa, Sampo Pyysalo, Tadayoshi Hara, and Jun’ichi       Lorraine Tanabe, Natalie Xie, Lynne H Thom, Wayne Mat-
  Tsujii. 2010a. A comparative study of syntactic parsers        ten, and W John Wilbur. 2005. GENETAG: A tagged
  for event extraction. In Proceedings of BioNLP’10, pages       corpus for gene/protein named entity recognition. BMC
  37–45.                                                         Bioinformatics, 6(Suppl. 1):S3.
Makoto Miwa, Rune Sætre, Jin-Dong Kim, and Jun’ichi Tsu-       Yuka Tateisi, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2006.
  jii. 2010b. Event extraction with complex event classifi-      Subdomain adaptation of a pos tagger with a small corpus.
  cation using rich features. Journal of Bioinformatics and      In Proceedings of BioNLP’06, page 136137, New York,
  Computational Biology (JBCB), 8(1):131–146.                    USA, June.
Yusuke Miyao and Jun’ichi Tsujii. 2008. Feature forest mod-    The Gene Ontology Consortium. 2000. Gene ontology: tool
  els for probabilistic HPSG parsing. Computational Lin-         for the unification of biology. Nature Genetics, 25:25–29.
  guistics, 34(1):35–80.
                                                               Paul Thompson, Syed Iqbal, John McNaught, and Sophia
Tomoko Ohta, Yuka Tateisi, Hideki Mima, and Jun’ichi Tsu-        Ananiadou. 2009. Construction of an annotated corpus to
  jii. 2002. GENIA corpus: An annotated research abstract        support biomedical information extraction. BMC Bioin-
  corpus in molecular biology domain. In Proceedings of          formatics, 10(1):349.
  HLT’02, pages 73–77.
                                                               Yue Wang, Jin-Dong Kim, Rune Sætre, Sampo Pyysalo, and
Tomoko Ohta, Jin-Dong Kim, Sampo Pyysalo, Yue Wang,              Jun’ichi Tsujii. 2009. Investigating heterogeneous pro-
  and Jun’ichi Tsujii. 2009. Incorporating GENETAG-              tein annotations toward cross-corpora utilization. BMC
  style annotation to GENIA corpus. In Proceedings of            Bioinformatics, 10(403).
  BioNLP’09, pages 106–107.