Event Extraction for DNA Methylation Tomoko Ohta∗ Sampo Pyysalo∗ Makoto Miwa∗ Jun’ichi Tsujii∗†‡ ∗ Department of Computer Science, University of Tokyo, Tokyo, Japan † School of Computer Science, University of Manchester, Manchester, UK ‡ National Centre for Text Mining, University of Manchester, Manchester, UK {okap,smp,mmiwa,tsujii}@is.s.u-tokyo.ac.jp Abstract of a particular protein (Ananiadou et al., 2010). We consider the task of automatically The state of the art in such extraction methods extracting DNA methylation events from was evaluated in the BioNLP’09 Shared Task on the biomedical domain literature. DNA Event Extraction (below, BioNLP ST) (Kim et al., methylation is a key mechanism of epige- 2009), and event extraction following the BioNLP netic control of gene expression and impli- ST model has continued to draw interest also af- cated in many cancers, but there has been ter the task, with recent work including advances little study of automatic information ex- in extraction methods (Miwa et al., 2010a; Poon traction for DNA methylation. We present and Vanderwende, 2010), the release of extraction an annotation scheme following the repre- system software and large-scale automatically an- sentation of the recent BioNLP’09 shared notated data (Björne et al., 2010) and the develop- task on event extraction, select a set of ment of additional annotated resources following 200 abstracts including a balanced sam- the event representation (Ohta et al., 2010). ple of all PubMed citations relevant to Of the findings of the BioNLP ST evaluation, DNA methylation, and introduce man- it is of particular interest to us that the highest- ual annotation for this corpus marking performing methods include many that are purely nearly 3000 gene/protein mentions and machine-learning based (Kim et al., 2009), learn- 1500 DNA methylation and demethylation ing what to extract directly from a corpus anno- events. We retrain a state-of-the-art event tated with examples of the events of interest. This extraction system on the corpus and find implies that state-of-the-art extraction methods for that automatic extraction can be performed new types of events can be created by providing at 78% precision and 76% recall. The in- annotated resources to an existing system, with- troduced resources are freely available for out the need for direct development of natural lan- use in research from the GENIA project guage processing or IE methods. Here, we apply homepage.1 this approach to DNA methylation, a specific and biologically highly relevant entity type not consid- 1 Introduction ered in previous event extraction studies. During the previous decade of concentrated study In the following, we first outline the biological of biomedical information extraction (IE), most significance of DNA methylation and discuss ex- efforts have focused on the foundational task of isting resources. We then introduce the event ex- detecting mentions of entities of interest and the traction approach applied, present the new anno- extraction of simple relations between these enti- tated corpus created in this study, and event extrac- ties, typically represented as undifferentiated bi- tion results using a method trained on the corpus. nary associations (Pyysalo et al., 2008). However, 2 DNA Methylation in recent years there has been increased interest in biomolecular event extraction using representa- The term epigenetics refers to a set of molecu- tions that capture typed, structured n-ary associa- lar mechanisms “beyond genetics” – i.e. without tions of entities in specific roles, such as regula- change in DNA sequence – that are today under- tion of the phosphorylation of a specific domain stood to play an important role in several biolog- 1 http://www-tsujii.is.s.u-tokyo.ac.jp/ ical processes, including genetic program for de- GENIA velopment, cell differentiation and tissue specific gene expression. DNA methylation was first sug- 700000 All 1800 DNA Methylation gested as an epigenetic mechanism for the con- 650000 1600 DNA Methylation citations 1400 trol of gene activity during development in 1975 600000 All PubMed citations 1200 (Riggs, 1975; Holliday and Pugh, 1975), and the 550000 1000 role of DNA methylation in cancer was first re- 500000 800 ported in 1987 (Holliday, 1987). DNA methyla- 450000 600 tion of CpG islands in promoter regions is now 400000 400 understood to be one of the most consistent ge- 350000 200 netic alterations in cancer, and DNA methylation 300000 0 1985 1990 1995 2000 2005 is a prominent area of study. Chemically, DNA methylation is a simple re- Figure 1: Citations tagged with the MeSH term action adding a methyl group to a specific posi- DNA Methylation compared to all citations in tion of cytosine pyrimidine ring or adenine purine PubMed by publication year. Note different ring. While a single nucleotide can only be scales. either methylated or unmethylated, in text the overall degree of promoter methylation is often ber of documents tagged for the DNA methyla- reported as hypo- and hyper-methylation, with tion MeSH term and the human judgments assur- hyper-methylation implying that the expression of ing their relevance make querying for this term a gene is silenced. Because of the precise defini- a natural choice for selecting text. However, di- tion of the phenomenon and the relatively specific rect PubMed query as the only selection strategy terms in which it is typically discussed in publi- would ignore significant existing resources, dis- cations, we expected it to provide a well-defined cussed in the following. target for annotation and automatic extraction. 2.2 DNA Methylation Databases 2.1 DNA Methylation in PubMed A growing number of databases collating infor- We follow common practice in biomedical IE in mation on DNA methylation are becoming avail- drawing texts for our corpus from PubMed ab- able. The first such database, MethDB (Amor- stracts. Currently containing more than 20 million eira et al., 2003), was introduced in 2001 and citations for biomedical literature (over 11M with remains actively developed. MethDB contains abstracts) and growing exponentially (Hunter and PubMed citation references as evidence for con- Cohen, 2006), the literature database provides a tained entries, but no more specific identifica- rich resource for IE and text mining. tion of the expressions stating DNA methylation To facilitate access to documents relevant to events. The methPrimerDB (Pattyn et al., 2006) specific topics, each PubMed citation is manually database provides additional information on PCR assigned terms that identify its primary topics us- primers on top of MethDB, but does not add fur- ing MeSH, a controlled vocabulary of over 25,000 ther specification of the methylated gene or text- terms. MeSH contains also a DNA Methylation bound annotation. PubMeth (Ongenaert et al., term, allowing specific searches for citations on 2008) is a database of DNA methylation in can- the topic. Figure 1 shows the number of citations cer with evidence sentences from the literature. per year of publication matching this term con- This database stores information on cancer types trasted with overall citations, illustrating explosive and subtypes, methylated genes and the experi- growth of interest in DNA methylation, outstrip- mental method used to identify methylation, as ping the overall growth of the literature. Partic- well as evidence sentences. MeInfoText, (Fang et ular increases can be seen after the introduction al., 2008) is a database of DNA methylation and of DNA microarrays for monitoring gene expres- cancer information automatically extracted from sion (Schena et al., 1995) and the introduction of PubMed documents matching the query terms high-throughput screening methods (Kononen et human, methylation and cancer using term co- al., 1998; MacBeath and Schreiber, 2000). The to- occurrence statistics. Of the DNA Methylation tal number of PubMed citations tagged with DNA resources, only PubMeth and MeInfoText contain Methylation at the time of this writing is 15456 text-bound annotation identifying specific spans of (14350 of which have an abstract). The large num- characters containing the gene mention and ex- a) MS-PCR revealed the [methylation] of the [p16] gene in 10(34%)of 29 [NSCLCs] b) 30% (27 of 91) of [lung tumors] showed [hypermethylation] of the 5’CpG region of the [p14ARF gene] c) [Promotor hypermethylations] were detected in [O6-methylguanine-DNA methyltransferase (MGMT), RB1, estrogen receptor, p73, p16INK4a, death-associated protein kinase, p15INK4b, and p14ARF] d) The promoter region of the [p16INK4] gene was [hypermethylated] in the tumor samples of the primary or metastatic site Table 1: Examples of PubMeth evidence sentence annotation. Annotated spans delimited by brackets and statements expressing methylation underlined, gene mentions shown in italics, and cancer mentions in bold. pressing DNA methylation in evidence sentences supporting database entries. In this study, we con- sider specifically PubMeth as a source of reference text-bound annotations due to availability and the ability to redistribute derived data. Figure 2: Event annotation for phosphorylation. Initial text-bound annotations in PubMeth were generated using keyword lookup, but the database product, is applied without distinction between annotations are manually reviewed. Table 1 shows genes and their products. In addition to the iden- example evidence sentences from PubMeth and tification of the modified gene, it is important to their annotated spans. While the PubMeth annota- identify the site of the modification. We marked tion differs from the BioNLP ST representation in mentions of sites relevant to the events as DNA a number of ways, such as not separating coordi- domain or region terms following the original GE- nated entities (Table 1c) and not annotating methy- NIA term corpus annotation guidelines (Ohta et lation sites (Table 1d), it provides both a refer- al., 2002). ence identifying annotation targets from a biologi- For representing DNA methylation events, the cally motivated perspective and a potential starting annotation applied to capture protein phosphory- point for full event annotation. lation events in the BioNLP ST task 2 closely matched the needs for DNA methylation (Fig- 3 Annotation ure 2). While the Site arguments of the ST Phos- phorylation events are protein domains, machine- For annotation, we adapted the representation ap- learning based extraction methods should be able plied in the BioNLP ST on event extraction with to associate this role with DNA domains given minimal changes in order to allow systems devel- training data. We thus adopted a representa- oped for the task to be applied also for the newly tion where DNA methylation events are associated annotated corpus. Documents were selected fol- with a gene/gene product as their Theme and a lowing the basic motivation presented above, with DNA domain or region as Site. Each event is also reference to the requirements specified by the an- associated with a particular span of text expressing notation scheme, and some automatic preprocess- it, termed the event trigger.2 We further initially ing was applied as annotator support. This section marked catalysts using Positive regulation events details the annotation approach. following the BioNLP ST model, but dropped this 3.1 Entity and Event Representation class of annotation as a sufficient number of exam- For the core named entity annotation, we thus pri- ples was not found in the corpus. marily follow the gene/gene product (GGP) an- The event types of the BioNLP ST are drawn notation criteria applied for the shared task data from the GENIA Event ontology (Kim et al., (Ohta et al., 2009). In brief, the guidelines spec- 2008), which in turn draws its type definitions ify annotation of minimal contiguous spans con- from the community-standard Gene Ontology taining mentions of specific gene or gene product (GO) (The Gene Ontology Consortium, 2000). To (RNA/protein) names, where specific name is un- maintain compatibility with these resources, we derstood to be one allowing a biologist to identify opted to follow the GO also for the definition of the corresponding entry in a gene/protein database 2 Annotators were instructed to always mark some trigger such as Uniprot or Entrez Gene. The annotation expression. We note that while we do not here specifically distinguish hypo- and hyper-methylation, the trigger anno- thus excludes e.g. names of families and com- tations are expected to facilitate adding these distinctions if plexes. A single annotation type, Gene or gene necessary. the new event type considered here. GO defines 1800 1600 DNA methylation as Number of documents 1400 The covalent transfer of a methyl group 1200 1000 to either N-6 of adenine or C-5 or N-4 800 of cytosine. 600 400 We note that while the definition may appear re- 200 strictive, methylation of adenine N-6 or cytosine 0 0 5 10 15 20 25 30 35 40 C-5/N-4 encompasses the entire set of ways in Number of gene/protein mentions which DNA can be methylated. This definition could thus be adopted without limitation to the Figure 3: Number of citations with given number scope of the annotation. of automatically tagged gene/protein mentions. 3.2 Document Selection 2008). We found that while the overwhelmingly The selection of source documents for an anno- most frequent number of tagged mentions per doc- tated corpus is critical for assuring that the cor- ument is zero, a substantial mass of abstracts have pus provides relevant and representative material large mention counts (Figure 3).3 We decided af- for studying the phenomena of interest. Domain ter brief preliminary experiments to filter the ini- corpora frequently consist of documents from a tial selection of documents to include only those particular subdomain of interest: for example, the in which at least 5 gene/protein mentions were GENIA corpus focuses on documents concerning marked by an automatic tagger. This excludes transcription factors in human blood cells (Ohta et most documents without markable events without al., 2002). Methods trained and evaluated on such introducing obvious other biases. focused resources will not necessarily generalize In the second strategy, we extended and com- well to broader domains. However, there has been pleted the annotation of a random selection of little study of the effect of document selection on PubMeth evidence sentences, aiming to leverage event extraction performance. Here, we applied existing resources and to select documents that two distinct strategies to get a representative sam- had been previously judged relevant to the inter- ple of the full scope of DNA methylation events in ests of biologists studying the topic. This provides the literature and to assure that our annotations are an external definition of document relevance and relevant to the interests of biologists. allows us to estimate to what extent the applied an- In the first strategy, we aimed in particular to notation strategy can capture biologically relevant select a representative sample of documents rel- statements. This strategy is also expected to select evant to the targeted event types. For this pur- a concentrated, event-rich set of documents. How- pose, we directly searched the PubMed literature ever, the selection may also necessarily carry over database. We further decided not to include any biases toward particular subsets of relevant docu- text-based query in the search to avoid biasing ments from the original selection and will not be a the selection toward particular entities or forms representative sample of the overall distribution of of event expression. Instead, we only queried for such documents in the literature. the single MeSH term DNA Methylation. While For producing the largest number of event an- this search is expected to provide high-prevision notations with the least effort, the most efficient results for the full topic, not all such documents way to use the PubMeth data would have been to necessarily discuss events where specific genes are simply extract the evidence sentences and com- methylated. In initial efforts to annotate a random plete the annotation for these. However, view- sample of these documents, we found that many ing the context in which event statements occur did not mention specific gene names. To reduce as centrally important, we opted to annotate com- wasted effort in examining documents that contain plete abstracts, with initial annotations from Pub- no markable events, we added a filter requiring a Meth evidence sentences automatically transferred minimum number of (likely) gene mentions. We into the abstracts. We note that not all PubMeth first tagged all 14350 citations tagged with DNA 3 The tagger has been evaluated at 86% F-score on a Methylation that have an abstract in PubMed us- broad-coverage corpus, suggesting this is unlikely to severely ing the BANNER tagger (Leaman and Gonzalez, misestimate the true distribution. evidence spans were drawn from abstracts, and PubMeth PubMed Total Abstracts 100 100 200 not all that were matched a contiguous span of Sentences 1118 1009 2127 text. We could align PubMeth evidence annota- Entities tions into 667 PubMed abstracts (approximately GGP 1695 1195 2890 57% of the referenced PMID number in PubMeth) Site 240 234 474 Total 1935 1429 3364 and completed event annotation for a random sam- Events ple of these. Theme only 660 214 874 Theme and Site 323 297 620 3.3 Document Preprocessing DNA methylation 977 485 1462 DNA demethyl. 6 26 38 To reduce annotation effort, we applied auto- Total 983 511 1494 matic systems to produce initial candidate sen- Table 2: Corpus statistics. tence boundaries and GGP annotations for the cor- pus. For sentence splitting, we applied the GE- NIA sentence splitter4 , and for gene/protein tag- quent in the other subcorpus.6 As the extraction of ging, we applied the BANNER NER system (Lea- events specifying also sites is known to be partic- man and Gonzalez, 2008) trained on the GENE- ularly challenging (Kim et al., 2009), these statis- TAG corpus (Tanabe et al., 2005). The GENETAG tics suggest the PubMed subcorpus may repre- guidelines and gene/protein entity annotation cov- sent a more difficult extraction task. Only very erage are known to differ from those applied for few DNA demethylation events are found in ei- GGP annotation here (Wang et al., 2009). How- ther subcorpus. Overall, the PubMeth subcorpus ever, the broad coverage of PubMed provided by contains nearly twice as many event annotations as the GENETAG suggests taggers trained on the cor- the PubMed one, indicating that the focused doc- pus are likely to generalize to new subdomains ument selection strategy was successful in identi- such as that considered here. By contrast, all an- fying particularly event-rich abstracts. notations following GGP guidelines that we are aware of are subdomain-specific. 4.2 Annotation Quality We note that all annotations in the produced cor- To measure the consistency of the produced anno- pus are at a minimum confirmed by a human an- tation, we performed independent double annota- notator and that events are annotated without per- tion for a sample of 40% of the abstracts selected forming initial automatic tagging to assure that no from the PubMed subcorpus; 20% of all abstracts. bias toward particular extraction methods or ap- As the PubMed subcorpus event annotation is cre- proaches is introduced. ated without initial human annotation as reference (unlike the PubMeth subcorpus), agreement is ex- 4 Results pected to be lower on this subcorpus. This exper- iment should thus provide a lower bound on the 4.1 Corpus Statistics overall consistency of the corpus. Corpus statistics are given in Table 2. There We first measured agreement on the gene/gene are some notable differences between the subcor- product (GGP) entity annotation, and found very pora created using the different selection strate- high agreement among 935 entities marked in to- gies. While the subcorpora are similar in size, tal by the two annotators: 91% F-score using exact the PubMeth GGP count is 1.4 times that of the match criteria and 97% F-score using the relaxed PubMed subcorpus5 , yet roughly equal numbers “overlap” criterion where any two overlapping an- of methylation sites are annotated in the two. This notations are considered to match.7 We then sep- difference is even more pronounced in the statis- arately measured agreement on event annotations tics for event arguments, where two thirds of Pub- 6 The number of annotated sites is less than the number Meth subcorpus events contain only a Theme ar- of events with a Site argument as the annotation criteria only call for annotating a site entity when it is referred to from an gument identifying the GGP, while events where event, and multiple events can refer to the same site entity. both Theme and Site are identified are more fre- 7 The high agreement is not due to annotators simply agreeing with the automatic initial annotation: the F-score 4 http://www-tsujii.is.s.u-tokyo.ac.jp/∼y-matsu/geniass/ of the automatic tagger against the two sets of human an- 5 The differences in the number of GGP annotations may notations was 65%/66% for exact and 85%/86% for overlap be affected by the PubMeth entity annotation criteria. match. for those events that involved GGPs on which the the use of a machine learning module for com- annotators agreed, using the standard evaluation plex event construction, and the use of two parsers criteria described in Section 4.4. Agreement on for syntactic analysis (Miwa et al., 2010b). We event annotations was also high: 84% F-score follow Miwa et al. in applying the HPSG-based overall (85% for DNA methylation and 75% for deep parser Enju (Miyao and Tsujii, 2008) using DNA demethylation) over a total of 442 annotated the high-speed parsing setting (“mogura”) and the events. GDep (Sagae and Tsujii, 2007) native dependency The overall consistency of the annotation de- parser, both with biomedical domain models based pends on joint annotator agreement on the GGP on the GENIA treebank data (Tateisi et al., 2006). and event annotations. However, in experimental For evaluation, we applied a version of the settings such as that of the BioNLP ST where gold BioNLP’09 ST evaluation tools8 modified to rec- GGP annotation is assumed as the starting point ognize the novel DNA methylation event type. for event extraction, measured performance is not 4.4 Evaluation Criteria affected by agreement on GGPs and thus arguably only the latter factor applies. As this setting is We followed the basic task setup and primary eval- adopted also in the present study, annotation con- uation criteria of the BioNLP’09 ST. Specifically, sistency suggests a human upper bound no lower we followed task 2 (“event enrichment”) criteria, than 84% F-score on extraction performance. requiring for correct extraction of a DNA methy- Estimates of the annotation consistency of lation event both the identification of the modi- biomedical domain corpora are regrettably seldom fied gene (GGP entity) and the identification of provided, and to the best of our knowledge ours is the modification site (DNA domain or region en- the first estimate of inter-annotator agreement for tity) when stated. As in the shared task, human a corpus following the event representation of the annotation for GGP entities was provided as part BioNLP ST. Given the complexity of the annota- of the system input but other entities were not, so tion – typed associations of event trigger, theme that the system was required to identify the spans and site – the agreement compares favorably to of the mentioned modification sites. e.g. the reported 67% inter-annotator F-score re- The performance of the system was evalu- ported for protein-protein interactions on the ITI ated using the standard precision, recall and F- TXM corpora (Alex et al., 2008) and the full event score metrics for the recovery of events, with agreement on the GREC corpus (Thompson et al., event equality defined following the “Approxi- 2009). mate span” matching criterion applied in the pri- mary evaluation for the BioNLP’09 ST. This cri- 4.3 Event Extraction Method terion relaxes strict matching requirements so that a detected event trigger or entity is considered to To estimate the feasibility of automatic extrac- match a gold trigger/entity if its span is entirely tion of DNA methylation events and the suitabil- contained within the span of the gold trigger, ex- ity of presently available event extraction meth- tended by one word both to the left and to the right. ods to this task, we performed experiments using the EventMine event extraction system of (Miwa 4.5 Experimental Setup et al., 2010b). On the task 2 of the BioNLP We divided the corpus into three parts, first setting ST dataset, the benchmark most relevant to our one third of the abstracts aside as a held-out test task setting, the applied version of EventMine was set and then splitting the remaining two thirds in a recently evaluated at 55% F-score (Miwa et al., roughly 1:3 ratio into a training set and a develop- 2010a), outperforming the best task 2 system in ment test set, giving 100 abstracts for training, 34 the original shared task (Riedel et al., 2009) by for development, and 66 for final test. The splits more than 10% points. To the best of our knowl- were performed randomly, but sampling so that edge, this system represents the state of the art for each set has an equal number of abstracts drawn this event extraction task. from the PubMeth and PubMed subcorpora. EventMine is an SVM-based machine learning The EventMine system has a single tunable system following the pipeline design of the best threshold parameter that controls the tradeoff be- system in the BioNLP ST (Björne et al., 2009), 8 http://www-tsujii.is.s.u-tokyo.ac.jp/ extending it with refinements to the feature set, GENIA/SharedTask/downloads.shtml Event type prec. recall F-score 90 DNA methylation 77.6% 77.2% 77.4% 80 DNA demethylation 100.0% 11.1% 20.0% Total 77.7% 76.0% 76.8% 70 F-score Table 3: Overall extraction performance. 60 50 Test set Training set PubMed PubMeth Both 40 PubMed 64.9% 71.2% 71.6% Test set: Both PubMed PubMeth 30 PubMeth 62.9% 80.0% 74.0% 0 20 40 60 80 100 Both 66.2% 82.5% 76.8% Fraction of traning data (%) Table 4: F-score by subcorpus. Figure 4: Learning curve for the two subcorpora and their combination. Both subcorpora used for tween system precision and recall. We first set training. Average and error bars calculated by the tradeoff using a sparse search of the parame- 10 repetitions of random subsampling of training ter space [0:1], evaluating the performance of the data, testing on the development set. system by training on the training set and evaluat- ing on the development set. As these experiments did not indicate any other parameter setting could mance for all three test sets, indicating that the provide significantly better performance, we chose subcorpora are compatible. the default threshold setting of 0.5. To study the The learning curve (Figure 4) shows rela- effect of training data size on performance, we per- tively high performance and rapid improvement formed extraction experiments randomly down- for modest amounts of data, but performance im- sampling the training data on the document level provement with additional data levels out rela- with testing on the development set. In final exper- tively fast, nearly flattening as use of the training iments EventMine was trained on the combined data approaches 100%. This suggests that extrac- training and development data and performance tion performance for this task is not primarily lim- evaluated on the held-out test data. ited by training data size and that additional an- notation following the same protocol is unlikely 4.6 Extraction performance to yield notable improvement in F-score without a substantial investment of resources. As perfor- Table 3 shows extraction results on the held-out mance for the PubMed subcorpus (for which inter- test data. While DNA methylation events could annotator agreement was measured) is not yet ap- be extracted quite reliably, the system performed proaching the limit implied by the corpus annota- poorly for DNA demethylation events. The latter tion consistency (Section 4.2), the results suggest result is perhaps not surprising given their small further need for the development of event extrac- number – only 38 in total in the corpus – and indi- tion methods to improve DNA methylation event cates that a separate selection strategy is necessary extraction. to provide resources for learning the reverse reac- tion. Overall performance shows a small prefer- 5 Related Work ence for precision over recall at 77% F-score. We view this level of performance very good as a first DNA methylation and related epigenetic mech- result. anisms of gene expression control have been To evaluate the relative difficulty of the extrac- a focus of considerable recent research in tion tasks that the two subcorpora represent and biomedicine. There are many excellent reviews of their merits as training material, we performed this broad field; we refer the interested reader to tests separating the two (Table 4). As predicted (Jaenisch and Bird, 2003; Suzuki and Bird, 2008). from corpus statistics (Section 4.1), the PubMed There is a wealth of recent related work also subcorpus represents the more challenging extrac- on event extraction. In the BioNLP’09 shared tion task. When testing on a single subcorpus, re- task, 24 teams participated in the primary task and sults are, unsurprisingly, better when training data six teams in Task 2 which mostly resembles our is drawn from the same subcorpus; however, train- setup in that it also required the detection of mod- ing on the combined data gives the best perfor- ified gene/protein and modification site. The top- performing system in Task 2 (Riedel et al., 2009) precision and 76% recall. The learning curve sug- achieved 44% F-score, and the highest perfor- gested that the corpus size is sufficient and that in mance reported since that we are aware of is 55% future efforts in DNA methylation event extraction F-score for EventMine (Miwa et al., 2010b). The should focus on extraction method development. performance we achieved for DNA methylation is One natural direction for future work is to ap- considerably better than this overall result, essen- ply event extraction systems trained on the newly tially matching the best reported performance for introduced data to abstracts available in PubMed Phosphorylation events, which we previously ar- and full texts available at PMC to create a detailed, gued to be the closest shared task analogue to the up-to-date repository of DNA methylation events new event category studied here. Nevertheless, di- at full literature scale. Such an effort would re- rect comparison of these results may not be mean- quire gene name normalization and event extrac- ingful due to confounding factors. The only text tion at PubMed scale, both of which have recently mining effort specifically targeting DNA methy- been shown to be technically feasible (Gerner et lation that we are aware of is that performed for al., 2010; Björne et al., 2010). Further combining the initial annotation of the PubMeth and MeIn- the extracted events with cancer mention detection foText databases (Ongenaert et al., 2008; Fang et could provide a valuable resource for epigenetics al., 2008), both applying approaches based on key- research. word matching. However, neither of these stud- The newly annotated corpus, the first re- ies report results for instance-level extraction of source annotated for DNA methylation using methylation statements. the event representation, is freely available The present study is in many aspects simi- for use in research from from the GENIA lar to our previous work targeting protein post- project homepage http://www-tsujii.is. translational modification events (Ohta et al., s.u-tokyo.ac.jp/GENIA. 2010). In this work, we annotated 422 events of 7 different types and showed that retraining Acknowledgments an existing event extraction system allowed these We would like to thank Maté Ongenaert and other to be extracted at 42% F-score. Our approach creators of PubMeth for their generosity in al- here clearly differs from this previous work in its lowing the release of resources building on their larger scale and concentrated focus on a particu- work and the anonymous reviewers for their many lar event type of high interest, reflected also in insightful comments. This work was supported results: while extraction performance in our pre- by Grant-in-Aid for Specially Promoted Research vious work was limited by training data size, in (MEXT, Japan). the present study notably higher extraction perfor- mance was achieved and a plateau in performance References with increasing data reached. Bea Alex, Claire Grover, Barry Haddow, Mijail Kabadjov, Ewan Klein, Michael Matthews, Stuart Roebuck, Richard 6 Discussion and Future Work Tobin, and Xinglong Wang. 2008. The ITI TXM cor- pora: Tissue expressions and protein-protein interactions. We have presented a study of the automatic ex- In Proceedings of LREC’08. traction of DNA methylation events from litera- Celine Amoreira, Winfried Hindermann, and Christoph ture following the BioNLP’09 shared task event Grunau. 2003. An improved version of the DNA methy- lation database (MethDB). Nucl. Acids Res., 31(1):75–77. representation and a state-of-the-art event extrac- tion system. We created an corpus of 200 publica- Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Dou- glas B. Kell. 2010. Event extraction for systems biology tion abstracts selected to include a representative by text mining the literature. Trends in Biotechnology, sample of DNA methylation statements from all of 28(7):381–390. PubMed and manually annotated for nearly 3000 Jari Björne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio mentions of genes and gene products, 500 DNA Pahikkala, and Tapio Salakoski. 2009. Extracting com- domain or region mentions and 1500 DNA methy- plex biological events with rich graph-based feature sets. In Proceedings of BioNLP’09 Shared Task, pages 10–18. lation and demethylation events. Evaluation using the EventMine system showed that DNA methy- Jari Björne, Filip Ginter, Sampo Pyysalo, Jun’ichi Tsujii, and Tapio Salakoski. 2010. Scaling up biomedical lation events can be extracted simply by retrain- event extraction to the entire pubmed. In Proceedings of ing an off-the-shelf event extraction system at 78% BioNLP’10, pages 28–36. Yu-Ching Fang, Hsuan-Cheng Huang, and Hsueh-Fen Juan. Tomoko Ohta, Sampo Pyysalo, Makoto Miwa, Jin-Dong 2008. Meinfotext: associated gene methylation and can- Kim, and Jun’ichi Tsujii. 2010. Event extraction cer information from text mining. BMC Bioinformatics, for post-translational modifications. In Proceedings of 9(1):22. BioNLP’10, pages 19–27. Martin Gerner, Goran Nenadic, and Casey M. Bergman. Maté Ongenaert, Leander Van Neste, Tim De Meyer, Ger- 2010. An exploration of mining gene expression mentions ben Menschaert, Sofie Bekaert, and Wim Van Criekinge. and their anatomical locations from biomedical text. In 2008. PubMeth: a cancer methylation database combin- Proceedings of BioNLP 2010, pages 72–80. ing text-mining and expert annotation. Nucl. Acids Res., 36(suppl 1):D842–846. Robin Holliday and JE Pugh. 1975. Dna modification mech- anisms and gene activity during development. Science, Filip Pattyn, Jasmien Hoebeeck, Piet Robbrecht, Evi 187:226–232. Michels, Anne De Paepe, Guy Bottu, David Coornaert, Robert Herzog, Frank Speleman, and Jo Vandesom- Robin Holliday. 1987. The inheritance of epigenetic defects. pele. 2006. methblast and methprimerdb: web-tools Science, 238:163–170. for pcr based methylation analysis. BMC Bioinformatics, 7(1):496. Lawrenece Hunter and K. Bretonnel Cohen. 2006. Biomed- ical language processing: What’s beyond PubMed? Hoifung Poon and Lucy Vanderwende. 2010. Joint inference Molecular Cell, 21(5):589–594. for knowledge extraction from biomedical literature. In Proceedings of NAACL/HLT’10, pages 813–821. Rudolf Jaenisch and Adrian Bird. 2003. Epigenetic regula- tion of gene expression: how the genome integrates intrin- Sampo Pyysalo, Antti Airola, Juho Heimonen, and Jari sic and environmental signals. Nature Genetics, 33:245– Björne. 2008. Comparative analysis of five protein- 254. protein interaction corpora. BMC Bioinformatics, 9(Suppl. 3):S6. Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. 2008. Corpus annotation for mining biomedical events from lit- Sebastian Riedel, Hong-Woo Chun, Toshihisa Takagi, and erature. BMC Bioinformatics, 9(10). Jun’ichi Tsujii. 2009. A markov logic approach to bio- molecular event extraction. In Proceedings of BioNLP’09 Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Shared Task, pages 41–49. Kano, and Jun’ichi Tsujii. 2009. Overview of bionlp’09 shared task on event extraction. In Proceedings of A.D. Riggs. 1975. X inactivation, differentiation, and dna BioNLP’09. methylation. Cytogenetic and Genome Research, 14:9– 25. Juha Kononen, Lukas Bubendorf, Anne Kallionimeni, Maarit Barlund, Peter Schraml, Stephen Leighton, Joachim Kenji Sagae and Jun’ichi Tsujii. 2007. Dependency pars- Torhorst, Michael J Mihatsch, Guido Sauter, and Olli- ing and domain adaptation with LR models and parser en- P. Kallionimeni. 1998. Tissue microarrays for high- sembles. In Proceedings of EMNLP-CoNLL 2007, pages throughput molecular profiling of tumor specimens. Nat 1044–1050. Med, 4(7):844–847. Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O. R. Leaman and G. Gonzalez. 2008. Banner: An executable Brown. 1995. Quantitative Monitoring of Gene Expres- survey of advances in biomedical named entity recogni- sion Patterns with a Complementary DNA Microarray. tion. In Proceedings of PSB’08, pages 652–663. Science, 270(5235):467–470. Gavin MacBeath and Stuart L. Schreiber. 2000. Printing Pro- Miho M. Suzuki and Adrian Bird. 2008. Dna methylation teins as Microarrays for High-Throughput Function Deter- landscapes: provocative insights from epigenomics. Na- mination. Science, 289(5485):1760–1763. ture Review Genetics, 9:465–476. Makoto Miwa, Sampo Pyysalo, Tadayoshi Hara, and Jun’ichi Lorraine Tanabe, Natalie Xie, Lynne H Thom, Wayne Mat- Tsujii. 2010a. A comparative study of syntactic parsers ten, and W John Wilbur. 2005. GENETAG: A tagged for event extraction. In Proceedings of BioNLP’10, pages corpus for gene/protein named entity recognition. BMC 37–45. Bioinformatics, 6(Suppl. 1):S3. Makoto Miwa, Rune Sætre, Jin-Dong Kim, and Jun’ichi Tsu- Yuka Tateisi, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2006. jii. 2010b. Event extraction with complex event classifi- Subdomain adaptation of a pos tagger with a small corpus. cation using rich features. Journal of Bioinformatics and In Proceedings of BioNLP’06, page 136137, New York, Computational Biology (JBCB), 8(1):131–146. USA, June. Yusuke Miyao and Jun’ichi Tsujii. 2008. Feature forest mod- The Gene Ontology Consortium. 2000. Gene ontology: tool els for probabilistic HPSG parsing. Computational Lin- for the unification of biology. Nature Genetics, 25:25–29. guistics, 34(1):35–80. Paul Thompson, Syed Iqbal, John McNaught, and Sophia Tomoko Ohta, Yuka Tateisi, Hideki Mima, and Jun’ichi Tsu- Ananiadou. 2009. Construction of an annotated corpus to jii. 2002. GENIA corpus: An annotated research abstract support biomedical information extraction. BMC Bioin- corpus in molecular biology domain. In Proceedings of formatics, 10(1):349. HLT’02, pages 73–77. Yue Wang, Jin-Dong Kim, Rune Sætre, Sampo Pyysalo, and Tomoko Ohta, Jin-Dong Kim, Sampo Pyysalo, Yue Wang, Jun’ichi Tsujii. 2009. Investigating heterogeneous pro- and Jun’ichi Tsujii. 2009. Incorporating GENETAG- tein annotations toward cross-corpora utilization. BMC style annotation to GENIA corpus. In Proceedings of Bioinformatics, 10(403). BioNLP’09, pages 106–107.