<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Event Extraction for DNA Methylation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tomoko</forename><surname>Ohta</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Tokyo</orgName>
								<address>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Tokyo</orgName>
								<address>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Makoto</forename><surname>Miwa</surname></persName>
							<email>mmiwa@is.s.u-tokyo.ac.jp</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Tokyo</orgName>
								<address>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><surname>Jun'ichi Tsujii</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Tokyo</orgName>
								<address>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">School of Computer Science</orgName>
								<orgName type="institution">University of Manchester</orgName>
								<address>
									<settlement>Manchester</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">National Centre for Text Mining</orgName>
								<orgName type="institution">University of Manchester</orgName>
								<address>
									<settlement>Manchester</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Event Extraction for DNA Methylation</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">ABBE06524ECFD9F15B7781D7070C0DBE</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We consider the task of automatically extracting DNA methylation events from the biomedical domain literature. DNA methylation is a key mechanism of epigenetic control of gene expression and implicated in many cancers, but there has been little study of automatic information extraction for DNA methylation. We present an annotation scheme following the representation of the recent BioNLP'09 shared task on event extraction, select a set of 200 abstracts including a balanced sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for this corpus marking nearly 3000 gene/protein mentions and 1500 DNA methylation and demethylation events. We retrain a state-of-the-art event extraction system on the corpus and find that automatic extraction can be performed at 78% precision and 76% recall. The introduced resources are freely available for use in research from the GENIA project homepage. 1  </p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>During the previous decade of concentrated study of biomedical information extraction (IE), most efforts have focused on the foundational task of detecting mentions of entities of interest and the extraction of simple relations between these entities, typically represented as undifferentiated binary associations <ref type="bibr" target="#b25">(Pyysalo et al., 2008)</ref>. However, in recent years there has been increased interest in biomolecular event extraction using representations that capture typed, structured n-ary associations of entities in specific roles, such as regulation of the phosphorylation of a specific domain 1 http://www-tsujii.is.s.u-tokyo.ac.jp/ GENIA of a particular protein <ref type="bibr" target="#b2">(Ananiadou et al., 2010)</ref>. The state of the art in such extraction methods was evaluated in the BioNLP'09 Shared Task on Event Extraction (below, BioNLP ST) <ref type="bibr" target="#b12">(Kim et al., 2009)</ref>, and event extraction following the BioNLP ST model has continued to draw interest also after the task, with recent work including advances in extraction methods <ref type="bibr" target="#b16">(Miwa et al., 2010a;</ref><ref type="bibr" target="#b24">Poon and Vanderwende, 2010)</ref>, the release of extraction system software and large-scale automatically annotated data <ref type="bibr" target="#b4">(Björne et al., 2010)</ref> and the development of additional annotated resources following the event representation <ref type="bibr" target="#b21">(Ohta et al., 2010)</ref>.</p><p>Of the findings of the BioNLP ST evaluation, it is of particular interest to us that the highestperforming methods include many that are purely machine-learning based <ref type="bibr" target="#b12">(Kim et al., 2009)</ref>, learning what to extract directly from a corpus annotated with examples of the events of interest. This implies that state-of-the-art extraction methods for new types of events can be created by providing annotated resources to an existing system, without the need for direct development of natural language processing or IE methods. Here, we apply this approach to DNA methylation, a specific and biologically highly relevant entity type not considered in previous event extraction studies.</p><p>In the following, we first outline the biological significance of DNA methylation and discuss existing resources. We then introduce the event extraction approach applied, present the new annotated corpus created in this study, and event extraction results using a method trained on the corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">DNA Methylation</head><p>The term epigenetics refers to a set of molecular mechanisms "beyond genetics" -i.e. without change in DNA sequence -that are today understood to play an important role in several biological processes, including genetic program for development, cell differentiation and tissue specific gene expression. DNA methylation was first suggested as an epigenetic mechanism for the control of gene activity during development in 1975 <ref type="bibr" target="#b27">(Riggs, 1975;</ref><ref type="bibr" target="#b7">Holliday and Pugh, 1975)</ref>, and the role of DNA methylation in cancer was first reported in 1987 <ref type="bibr" target="#b8">(Holliday, 1987)</ref>. DNA methylation of CpG islands in promoter regions is now understood to be one of the most consistent genetic alterations in cancer, and DNA methylation is a prominent area of study.</p><p>Chemically, DNA methylation is a simple reaction adding a methyl group to a specific position of cytosine pyrimidine ring or adenine purine ring. While a single nucleotide can only be either methylated or unmethylated, in text the overall degree of promoter methylation is often reported as hypo-and hyper-methylation, with hyper-methylation implying that the expression of a gene is silenced. Because of the precise definition of the phenomenon and the relatively specific terms in which it is typically discussed in publications, we expected it to provide a well-defined target for annotation and automatic extraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">DNA Methylation in PubMed</head><p>We follow common practice in biomedical IE in drawing texts for our corpus from PubMed abstracts. Currently containing more than 20 million citations for biomedical literature (over 11M with abstracts) and growing exponentially <ref type="bibr" target="#b9">(Hunter and Cohen, 2006)</ref>, the literature database provides a rich resource for IE and text mining.</p><p>To facilitate access to documents relevant to specific topics, each PubMed citation is manually assigned terms that identify its primary topics using MeSH, a controlled vocabulary of over 25,000 terms. MeSH contains also a DNA Methylation term, allowing specific searches for citations on the topic. Figure <ref type="figure" target="#fig_0">1</ref> shows the number of citations per year of publication matching this term contrasted with overall citations, illustrating explosive growth of interest in DNA methylation, outstripping the overall growth of the literature. Particular increases can be seen after the introduction of DNA microarrays for monitoring gene expression <ref type="bibr" target="#b29">(Schena et al., 1995)</ref> and the introduction of high-throughput screening methods <ref type="bibr" target="#b13">(Kononen et al., 1998;</ref><ref type="bibr" target="#b15">MacBeath and Schreiber, 2000)</ref>. The total number of PubMed citations tagged with DNA Methylation at the time of this writing is 15456 (14350 of which have an abstract). The large num- ber of documents tagged for the DNA methylation MeSH term and the human judgments assuring their relevance make querying for this term a natural choice for selecting text. However, direct PubMed query as the only selection strategy would ignore significant existing resources, discussed in the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">DNA Methylation Databases</head><p>A growing number of databases collating information on DNA methylation are becoming available. The first such database, MethDB <ref type="bibr" target="#b1">(Amoreira et al., 2003)</ref>, was introduced in 2001 and remains actively developed. MethDB contains PubMed citation references as evidence for contained entries, but no more specific identification of the expressions stating DNA methylation events. The methPrimerDB <ref type="bibr" target="#b23">(Pattyn et al., 2006)</ref> database provides additional information on PCR primers on top of MethDB, but does not add further specification of the methylated gene or textbound annotation. PubMeth <ref type="bibr" target="#b22">(Ongenaert et al., 2008)</ref> is a database of DNA methylation in cancer with evidence sentences from the literature. This database stores information on cancer types and subtypes, methylated genes and the experimental method used to identify methylation, as well as evidence sentences. MeInfoText, <ref type="bibr" target="#b5">(Fang et al., 2008)</ref> is a database of DNA methylation and cancer information automatically extracted from PubMed documents matching the query terms human, methylation and cancer using term cooccurrence statistics. Of the DNA Methylation resources, only PubMeth and MeInfoText contain text-bound annotation identifying specific spans of characters containing the gene mention and ex-  pressing DNA methylation in evidence sentences supporting database entries. In this study, we consider specifically PubMeth as a source of reference text-bound annotations due to availability and the ability to redistribute derived data.</p><p>Initial text-bound annotations in PubMeth were generated using keyword lookup, but the database annotations are manually reviewed. Table <ref type="table" target="#tab_0">1</ref> shows example evidence sentences from PubMeth and their annotated spans. While the PubMeth annotation differs from the BioNLP ST representation in a number of ways, such as not separating coordinated entities (Table <ref type="table" target="#tab_0">1c</ref>) and not annotating methylation sites (Table <ref type="table" target="#tab_0">1d</ref>), it provides both a reference identifying annotation targets from a biologically motivated perspective and a potential starting point for full event annotation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Annotation</head><p>For annotation, we adapted the representation applied in the BioNLP ST on event extraction with minimal changes in order to allow systems developed for the task to be applied also for the newly annotated corpus. Documents were selected following the basic motivation presented above, with reference to the requirements specified by the annotation scheme, and some automatic preprocessing was applied as annotator support. This section details the annotation approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Entity and Event Representation</head><p>For the core named entity annotation, we thus primarily follow the gene/gene product (GGP) annotation criteria applied for the shared task data <ref type="bibr" target="#b20">(Ohta et al., 2009)</ref>. In brief, the guidelines specify annotation of minimal contiguous spans containing mentions of specific gene or gene product (RNA/protein) names, where specific name is understood to be one allowing a biologist to identify the corresponding entry in a gene/protein database such as Uniprot or Entrez Gene. The annotation thus excludes e.g. names of families and complexes. A single annotation type, Gene or gene product, is applied without distinction between genes and their products. In addition to the identification of the modified gene, it is important to identify the site of the modification. We marked mentions of sites relevant to the events as DNA domain or region terms following the original GE-NIA term corpus annotation guidelines <ref type="bibr" target="#b19">(Ohta et al., 2002)</ref>.</p><p>For representing DNA methylation events, the annotation applied to capture protein phosphorylation events in the BioNLP ST task 2 closely matched the needs for DNA methylation (Figure <ref type="figure" target="#fig_2">2</ref>). While the Site arguments of the ST Phosphorylation events are protein domains, machinelearning based extraction methods should be able to associate this role with DNA domains given training data. We thus adopted a representation where DNA methylation events are associated with a gene/gene product as their Theme and a DNA domain or region as Site. Each event is also associated with a particular span of text expressing it, termed the event trigger.<ref type="foot" target="#foot_0">2</ref> We further initially marked catalysts using Positive regulation events following the BioNLP ST model, but dropped this class of annotation as a sufficient number of examples was not found in the corpus.</p><p>The event types of the BioNLP ST are drawn from the GENIA Event ontology <ref type="bibr" target="#b11">(Kim et al., 2008)</ref>, which in turn draws its type definitions from the community-standard Gene Ontology (GO) (The Gene Ontology Consortium, 2000). To maintain compatibility with these resources, we opted to follow the GO also for the definition of the new event type considered here. GO defines DNA methylation as</p><p>The covalent transfer of a methyl group to either N-6 of adenine or C-5 or N-4 of cytosine.</p><p>We note that while the definition may appear restrictive, methylation of adenine N-6 or cytosine C-5/N-4 encompasses the entire set of ways in which DNA can be methylated. This definition could thus be adopted without limitation to the scope of the annotation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Document Selection</head><p>The selection of source documents for an annotated corpus is critical for assuring that the corpus provides relevant and representative material for studying the phenomena of interest. Domain corpora frequently consist of documents from a particular subdomain of interest: for example, the GENIA corpus focuses on documents concerning transcription factors in human blood cells <ref type="bibr" target="#b19">(Ohta et al., 2002)</ref>. Methods trained and evaluated on such focused resources will not necessarily generalize well to broader domains. However, there has been little study of the effect of document selection on event extraction performance. Here, we applied two distinct strategies to get a representative sample of the full scope of DNA methylation events in the literature and to assure that our annotations are relevant to the interests of biologists.</p><p>In the first strategy, we aimed in particular to select a representative sample of documents relevant to the targeted event types. For this purpose, we directly searched the PubMed literature database. We further decided not to include any text-based query in the search to avoid biasing the selection toward particular entities or forms of event expression. Instead, we only queried for the single MeSH term DNA Methylation. While this search is expected to provide high-prevision results for the full topic, not all such documents necessarily discuss events where specific genes are methylated. In initial efforts to annotate a random sample of these documents, we found that many did not mention specific gene names. To reduce wasted effort in examining documents that contain no markable events, we added a filter requiring a minimum number of (likely) gene mentions. We first tagged all 14350 citations tagged with DNA Methylation that have an abstract in PubMed using the BANNER tagger (Leaman and Gonzalez,  2008). We found that while the overwhelmingly most frequent number of tagged mentions per document is zero, a substantial mass of abstracts have large mention counts (Figure <ref type="figure" target="#fig_4">3</ref>). <ref type="foot" target="#foot_1">3</ref> We decided after brief preliminary experiments to filter the initial selection of documents to include only those in which at least 5 gene/protein mentions were marked by an automatic tagger. This excludes most documents without markable events without introducing obvious other biases.</p><p>In the second strategy, we extended and completed the annotation of a random selection of PubMeth evidence sentences, aiming to leverage existing resources and to select documents that had been previously judged relevant to the interests of biologists studying the topic. This provides an external definition of document relevance and allows us to estimate to what extent the applied annotation strategy can capture biologically relevant statements. This strategy is also expected to select a concentrated, event-rich set of documents. However, the selection may also necessarily carry over biases toward particular subsets of relevant documents from the original selection and will not be a representative sample of the overall distribution of such documents in the literature.</p><p>For producing the largest number of event annotations with the least effort, the most efficient way to use the PubMeth data would have been to simply extract the evidence sentences and complete the annotation for these. However, viewing the context in which event statements occur as centrally important, we opted to annotate complete abstracts, with initial annotations from Pub-Meth evidence sentences automatically transferred into the abstracts. We note that not all PubMeth evidence spans were drawn from abstracts, and not all that were matched a contiguous span of text. We could align PubMeth evidence annotations into 667 PubMed abstracts (approximately 57% of the referenced PMID number in PubMeth) and completed event annotation for a random sample of these.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Document Preprocessing</head><p>To reduce annotation effort, we applied automatic systems to produce initial candidate sentence boundaries and GGP annotations for the corpus. For sentence splitting, we applied the GE-NIA sentence splitter 4 , and for gene/protein tagging, we applied the BANNER NER system (Leaman and Gonzalez, 2008) trained on the GENE-TAG corpus <ref type="bibr" target="#b31">(Tanabe et al., 2005)</ref>. The GENETAG guidelines and gene/protein entity annotation coverage are known to differ from those applied for GGP annotation here <ref type="bibr" target="#b35">(Wang et al., 2009)</ref>. However, the broad coverage of PubMed provided by the GENETAG suggests taggers trained on the corpus are likely to generalize to new subdomains such as that considered here. By contrast, all annotations following GGP guidelines that we are aware of are subdomain-specific.</p><p>We note that all annotations in the produced corpus are at a minimum confirmed by a human annotator and that events are annotated without performing initial automatic tagging to assure that no bias toward particular extraction methods or approaches is introduced. <ref type="table" target="#tab_1">2</ref>. There are some notable differences between the subcorpora created using the different selection strategies. While the subcorpora are similar in size, the PubMeth GGP count is 1.4 times that of the PubMed subcorpus 5 , yet roughly equal numbers of methylation sites are annotated in the two. This difference is even more pronounced in the statistics for event arguments, where two thirds of Pub-Meth subcorpus events contain only a Theme argument identifying the GGP, while events where both Theme and Site are identified are more fre-4 http://www-tsujii.is.s.u-tokyo.ac.jp/∼y-matsu/geniass/ 5 The differences in the number of GGP annotations may be affected by the PubMeth entity annotation criteria. quent in the other subcorpus. 6 As the extraction of events specifying also sites is known to be particularly challenging <ref type="bibr" target="#b12">(Kim et al., 2009)</ref>, these statistics suggest the PubMed subcorpus may represent a more difficult extraction task. Only very few DNA demethylation events are found in either subcorpus. Overall, the PubMeth subcorpus contains nearly twice as many event annotations as the PubMed one, indicating that the focused document selection strategy was successful in identifying particularly event-rich abstracts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Corpus Statistics</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corpus statistics are given in Table</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Annotation Quality</head><p>To measure the consistency of the produced annotation, we performed independent double annotation for a sample of 40% of the abstracts selected from the PubMed subcorpus; 20% of all abstracts.</p><p>As the PubMed subcorpus event annotation is created without initial human annotation as reference (unlike the PubMeth subcorpus), agreement is expected to be lower on this subcorpus. This experiment should thus provide a lower bound on the overall consistency of the corpus. We first measured agreement on the gene/gene product (GGP) entity annotation, and found very high agreement among 935 entities marked in total by the two annotators: 91% F-score using exact match criteria and 97% F-score using the relaxed "overlap" criterion where any two overlapping annotations are considered to match. 7 We then separately measured agreement on event annotations 6 The number of annotated sites is less than the number of events with a Site argument as the annotation criteria only call for annotating a site entity when it is referred to from an event, and multiple events can refer to the same site entity. 7 The high agreement is not due to annotators simply agreeing with the automatic initial annotation: the F-score of the automatic tagger against the two sets of human annotations was 65%/66% for exact and 85%/86% for overlap match.</p><p>for those events that involved GGPs on which the annotators agreed, using the standard evaluation criteria described in Section 4.4. Agreement on event annotations was also high: 84% F-score overall (85% for DNA methylation and 75% for DNA demethylation) over a total of 442 annotated events.</p><p>The overall consistency of the annotation depends on joint annotator agreement on the GGP and event annotations. However, in experimental settings such as that of the BioNLP ST where gold GGP annotation is assumed as the starting point for event extraction, measured performance is not affected by agreement on GGPs and thus arguably only the latter factor applies. As this setting is adopted also in the present study, annotation consistency suggests a human upper bound no lower than 84% F-score on extraction performance.</p><p>Estimates of the annotation consistency of biomedical domain corpora are regrettably seldom provided, and to the best of our knowledge ours is the first estimate of inter-annotator agreement for a corpus following the event representation of the BioNLP ST. Given the complexity of the annotation -typed associations of event trigger, theme and site -the agreement compares favorably to e.g. the reported 67% inter-annotator F-score reported for protein-protein interactions on the ITI TXM corpora <ref type="bibr" target="#b0">(Alex et al., 2008)</ref> and the full event agreement on the GREC corpus <ref type="bibr" target="#b34">(Thompson et al., 2009)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Event Extraction Method</head><p>To estimate the feasibility of automatic extraction of DNA methylation events and the suitability of presently available event extraction methods to this task, we performed experiments using the EventMine event extraction system of <ref type="bibr" target="#b17">(Miwa et al., 2010b)</ref>. On the task 2 of the BioNLP ST dataset, the benchmark most relevant to our task setting, the applied version of EventMine was recently evaluated at 55% F-score <ref type="bibr" target="#b16">(Miwa et al., 2010a)</ref>, outperforming the best task 2 system in the original shared task <ref type="bibr" target="#b26">(Riedel et al., 2009</ref>) by more than 10% points. To the best of our knowledge, this system represents the state of the art for this event extraction task.</p><p>EventMine is an SVM-based machine learning system following the pipeline design of the best system in the BioNLP ST <ref type="bibr" target="#b3">(Björne et al., 2009)</ref>, extending it with refinements to the feature set, the use of a machine learning module for complex event construction, and the use of two parsers for syntactic analysis <ref type="bibr" target="#b17">(Miwa et al., 2010b)</ref>. We follow Miwa et al. in applying the HPSG-based deep parser Enju <ref type="bibr" target="#b18">(Miyao and Tsujii, 2008)</ref> using the high-speed parsing setting ("mogura") and the GDep <ref type="bibr" target="#b28">(Sagae and Tsujii, 2007)</ref> native dependency parser, both with biomedical domain models based on the GENIA treebank data <ref type="bibr" target="#b32">(Tateisi et al., 2006)</ref>.</p><p>For evaluation, we applied a version of the BioNLP'09 ST evaluation tools<ref type="foot" target="#foot_2">8</ref> modified to recognize the novel DNA methylation event type.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Evaluation Criteria</head><p>We followed the basic task setup and primary evaluation criteria of the BioNLP'09 ST. Specifically, we followed task 2 ("event enrichment") criteria, requiring for correct extraction of a DNA methylation event both the identification of the modified gene (GGP entity) and the identification of the modification site (DNA domain or region entity) when stated. As in the shared task, human annotation for GGP entities was provided as part of the system input but other entities were not, so that the system was required to identify the spans of the mentioned modification sites.</p><p>The performance of the system was evaluated using the standard precision, recall and Fscore metrics for the recovery of events, with event equality defined following the "Approximate span" matching criterion applied in the primary evaluation for the BioNLP'09 ST. This criterion relaxes strict matching requirements so that a detected event trigger or entity is considered to match a gold trigger/entity if its span is entirely contained within the span of the gold trigger, extended by one word both to the left and to the right.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Experimental Setup</head><p>We divided the corpus into three parts, first setting one third of the abstracts aside as a held-out test set and then splitting the remaining two thirds in a roughly 1:3 ratio into a training set and a development test set, giving 100 abstracts for training, 34 for development, and 66 for final test. The splits were performed randomly, but sampling so that each set has an equal number of abstracts drawn from the PubMeth and PubMed subcorpora.</p><p>The EventMine system has a single tunable threshold parameter that controls the tradeoff be-   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Extraction performance</head><p>Table <ref type="table" target="#tab_3">3</ref> shows extraction results on the held-out test data. While DNA methylation events could be extracted quite reliably, the system performed poorly for DNA demethylation events. The latter result is perhaps not surprising given their small number -only 38 in total in the corpus -and indicates that a separate selection strategy is necessary to provide resources for learning the reverse reaction. Overall performance shows a small preference for precision over recall at 77% F-score. We view this level of performance very good as a first result.</p><p>To evaluate the relative difficulty of the extraction tasks that the two subcorpora represent and their merits as training material, we performed tests separating the two (Table <ref type="table" target="#tab_4">4</ref>). As predicted from corpus statistics (Section 4.1), the PubMed subcorpus represents the more challenging extraction task. When testing on a single subcorpus, results are, unsurprisingly, better when training data is drawn from the same subcorpus; however, training on the combined data gives the best perfor-  mance for all three test sets, indicating that the subcorpora are compatible.</p><p>The learning curve (Figure <ref type="figure" target="#fig_6">4</ref>) shows relatively high performance and rapid improvement for modest amounts of data, but performance improvement with additional data levels out relatively fast, nearly flattening as use of the training data approaches 100%. This suggests that extraction performance for this task is not primarily limited by training data size and that additional annotation following the same protocol is unlikely to yield notable improvement in F-score without a substantial investment of resources. As performance for the PubMed subcorpus (for which interannotator agreement was measured) is not yet approaching the limit implied by the corpus annotation consistency (Section 4.2), the results suggest further need for the development of event extraction methods to improve DNA methylation event extraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>DNA methylation and related epigenetic mechanisms of gene expression control have been a focus of considerable recent research in biomedicine. There are many excellent reviews of this broad field; we refer the interested reader to <ref type="bibr" target="#b10">(Jaenisch and Bird, 2003;</ref><ref type="bibr" target="#b30">Suzuki and Bird, 2008)</ref>.</p><p>There is a wealth of recent related work also on event extraction. In the BioNLP'09 shared task, 24 teams participated in the primary task and six teams in Task 2 which mostly resembles our setup in that it also required the detection of modified gene/protein and modification site. The top-performing system in Task 2 <ref type="bibr" target="#b26">(Riedel et al., 2009)</ref> achieved 44% F-score, and the highest performance reported since that we are aware of is 55% F-score for EventMine <ref type="bibr" target="#b17">(Miwa et al., 2010b)</ref>. The performance we achieved for DNA methylation is considerably better than this overall result, essentially matching the best reported performance for Phosphorylation events, which we previously argued to be the closest shared task analogue to the new event category studied here. Nevertheless, direct comparison of these results may not be meaningful due to confounding factors. The only text mining effort specifically targeting DNA methylation that we are aware of is that performed for the initial annotation of the PubMeth and MeIn-foText databases <ref type="bibr" target="#b22">(Ongenaert et al., 2008;</ref><ref type="bibr" target="#b5">Fang et al., 2008)</ref>, both applying approaches based on keyword matching. However, neither of these studies report results for instance-level extraction of methylation statements.</p><p>The present study is in many aspects similar to our previous work targeting protein posttranslational modification events <ref type="bibr" target="#b21">(Ohta et al., 2010)</ref>. In this work, we annotated 422 events of 7 different types and showed that retraining an existing event extraction system allowed these to be extracted at 42% F-score. Our approach here clearly differs from this previous work in its larger scale and concentrated focus on a particular event type of high interest, reflected also in results: while extraction performance in our previous work was limited by training data size, in the present study notably higher extraction performance was achieved and a plateau in performance with increasing data reached.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion and Future Work</head><p>We have presented a study of the automatic extraction of DNA methylation events from literature following the BioNLP'09 shared task event representation and a state-of-the-art event extraction system. We created an corpus of 200 publication abstracts selected to include a representative sample of DNA methylation statements from all of PubMed and manually annotated for nearly 3000 mentions of genes and gene products, 500 DNA domain or region mentions and 1500 DNA methylation and demethylation events. Evaluation using the EventMine system showed that DNA methylation events can be extracted simply by retraining an off-the-shelf event extraction system at 78% precision and 76% recall. The learning curve suggested that the corpus size is sufficient and that in future efforts in DNA methylation event extraction should focus on extraction method development.</p><p>One natural direction for future work is to apply event extraction systems trained on the newly introduced data to abstracts available in PubMed and full texts available at PMC to create a detailed, up-to-date repository of DNA methylation events at full literature scale. Such an effort would require gene name normalization and event extraction at PubMed scale, both of which have recently been shown to be technically feasible <ref type="bibr" target="#b6">(Gerner et al., 2010;</ref><ref type="bibr" target="#b4">Björne et al., 2010)</ref>. Further combining the extracted events with cancer mention detection could provide a valuable resource for epigenetics research.</p><p>The newly annotated corpus, the first resource annotated for DNA methylation using the event representation, is freely available for use in research from from the GENIA project homepage http://www-tsujii.is. s.u-tokyo.ac.jp/GENIA.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Citations tagged with the MeSH term DNA Methylation compared to all citations in PubMed by publication year. Note different scales.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>a) MS-PCR revealed the [methylation] of the [p16] gene in 10(34%)of 29 [NSCLCs] b) 30% (27 of 91) of [lung tumors] showed [hypermethylation] of the 5'CpG region of the [p14ARF gene] c) [Promotor hypermethylations] were detected in [O6-methylguanine-DNA methyltransferase (MGMT), RB1, estrogen receptor, p73, p16INK4a, death-associated protein kinase, p15INK4b, and p14ARF] d) The promoter region of the [p16INK4] gene was [hypermethylated] in the tumor samples of the primary or metastatic site</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Event annotation for phosphorylation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Number of citations with given number of automatically tagged gene/protein mentions.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Learning curve for the two subcorpora and their combination. Both subcorpora used for training. Average and error bars calculated by 10 repetitions of random subsampling of training data, testing on the development set.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Examples of PubMeth evidence sentence annotation. Annotated spans delimited by brackets and statements expressing methylation underlined, gene mentions shown in italics, and cancer mentions in bold.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Corpus statistics.</figDesc><table><row><cell></cell><cell cols="3">PubMeth PubMed Total</cell></row><row><cell>Abstracts</cell><cell>100</cell><cell>100</cell><cell>200</cell></row><row><cell>Sentences</cell><cell>1118</cell><cell>1009</cell><cell>2127</cell></row><row><cell>Entities</cell><cell></cell><cell></cell><cell></cell></row><row><cell>GGP</cell><cell>1695</cell><cell>1195</cell><cell>2890</cell></row><row><cell>Site</cell><cell>240</cell><cell>234</cell><cell>474</cell></row><row><cell>Total</cell><cell>1935</cell><cell>1429</cell><cell>3364</cell></row><row><cell>Events</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Theme only</cell><cell>660</cell><cell>214</cell><cell>874</cell></row><row><cell>Theme and Site</cell><cell>323</cell><cell>297</cell><cell>620</cell></row><row><cell cols="2">DNA methylation 977</cell><cell>485</cell><cell>1462</cell></row><row><cell>DNA demethyl.</cell><cell>6</cell><cell>26</cell><cell>38</cell></row><row><cell>Total</cell><cell>983</cell><cell>511</cell><cell>1494</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 :</head><label>3</label><figDesc>Overall extraction performance.</figDesc><table><row><cell></cell><cell></cell><cell>Test set</cell><cell></cell></row><row><cell cols="3">Training set PubMed PubMeth</cell><cell>Both</cell></row><row><cell>PubMed</cell><cell>64.9%</cell><cell>71.2%</cell><cell>71.6%</cell></row><row><cell>PubMeth</cell><cell>62.9%</cell><cell>80.0%</cell><cell>74.0%</cell></row><row><cell>Both</cell><cell>66.2%</cell><cell>82.5%</cell><cell>76.8%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc></figDesc><table><row><cell>: F-score by subcorpus.</cell></row><row><cell>tween system precision and recall. We first set</cell></row><row><cell>the tradeoff using a sparse search of the parame-</cell></row><row><cell>ter space [0:1], evaluating the performance of the</cell></row><row><cell>system by training on the training set and evaluat-</cell></row><row><cell>ing on the development set. As these experiments</cell></row><row><cell>did not indicate any other parameter setting could</cell></row><row><cell>provide significantly better performance, we chose</cell></row><row><cell>the default threshold setting of 0.5. To study the</cell></row><row><cell>effect of training data size on performance, we per-</cell></row><row><cell>formed extraction experiments randomly down-</cell></row><row><cell>sampling the training data on the document level</cell></row><row><cell>with testing on the development set. In final exper-</cell></row><row><cell>iments EventMine was trained on the combined</cell></row><row><cell>training and development data and performance</cell></row><row><cell>evaluated on the held-out test data.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">Annotators were instructed to always mark some trigger expression. We note that while we do not here specifically distinguish hypo-and hyper-methylation, the trigger annotations are expected to facilitate adding these distinctions if necessary.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">The tagger has been evaluated at 86% F-score on a broad-coverage corpus, suggesting this is unlikely to severely misestimate the true distribution.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_2">http://www-tsujii.is.s.u-tokyo.ac.jp/ GENIA/SharedTask/downloads.shtml</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We would like to thank Maté Ongenaert and other creators of PubMeth for their generosity in allowing the release of resources building on their work and the anonymous reviewers for their many insightful comments. This work was supported by Grant-in-Aid for Specially Promoted Research (MEXT, Japan).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The ITI TXM corpora: Tissue expressions and protein-protein interactions</title>
		<author>
			<persName><forename type="first">Bea</forename><surname>Alex</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claire</forename><surname>Grover</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barry</forename><surname>Haddow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mijail</forename><surname>Kabadjov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ewan</forename><surname>Klein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Matthews</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stuart</forename><surname>Roebuck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><surname>Tobin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xinglong</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of LREC&apos;08</title>
				<meeting>LREC&apos;08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">An improved version of the DNA methylation database (MethDB)</title>
		<author>
			<persName><forename type="first">Celine</forename><surname>Amoreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Winfried</forename><surname>Hindermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christoph</forename><surname>Grunau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucl. Acids Res</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="75" to="77" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Event extraction for systems biology by text mining the literature</title>
		<author>
			<persName><forename type="first">Sophia</forename><surname>Ananiadou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Douglas</forename><forename type="middle">B</forename><surname>Jun'ichi Tsujii</surname></persName>
		</author>
		<author>
			<persName><surname>Kell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Trends in Biotechnology</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="381" to="390" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Extracting complex biological events with rich graph-based feature sets</title>
		<author>
			<persName><forename type="first">Jari</forename><surname>Björne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Juho</forename><surname>Heimonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Filip</forename><surname>Ginter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antti</forename><surname>Airola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tapio</forename><surname>Pahikkala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tapio</forename><surname>Salakoski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;09 Shared Task</title>
				<meeting>BioNLP&apos;09 Shared Task</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="10" to="18" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Scaling up biomedical event extraction to the entire pubmed</title>
		<author>
			<persName><forename type="first">Jari</forename><surname>Björne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Filip</forename><surname>Ginter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tapio</forename><surname>Jun'ichi Tsujii</surname></persName>
		</author>
		<author>
			<persName><surname>Salakoski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;10</title>
				<meeting>BioNLP&apos;10</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="28" to="36" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Meinfotext: associated gene methylation and cancer information from text mining</title>
		<author>
			<persName><forename type="first">Yu-Ching</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hsuan-Cheng</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hsueh-Fen</forename><surname>Juan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">22</biblScope>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An exploration of mining gene expression mentions and their anatomical locations from biomedical text</title>
		<author>
			<persName><forename type="first">Martin</forename><surname>Gerner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Goran</forename><surname>Nenadic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Casey</forename><forename type="middle">M</forename><surname>Bergman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP 2010</title>
				<meeting>BioNLP 2010</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="72" to="80" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Dna modification mechanisms and gene activity during development</title>
		<author>
			<persName><forename type="first">Robin</forename><surname>Holliday</surname></persName>
		</author>
		<author>
			<persName><surname>Pugh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">187</biblScope>
			<biblScope unit="page" from="226" to="232" />
			<date type="published" when="1975">1975</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The inheritance of epigenetic defects</title>
		<author>
			<persName><forename type="first">Robin</forename><surname>Holliday</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">238</biblScope>
			<biblScope unit="page" from="163" to="170" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Biomedical language processing: What&apos;s beyond</title>
		<author>
			<persName><forename type="first">Lawrenece</forename><surname>Hunter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bretonnel Cohen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PubMed? Molecular Cell</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="589" to="594" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals</title>
		<author>
			<persName><forename type="first">Rudolf</forename><surname>Jaenisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adrian</forename><surname>Bird</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature Genetics</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="245" to="254" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Corpus annotation for mining biomedical events from literature</title>
		<author>
			<persName><forename type="first">Jin-Dong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomoko</forename><surname>Ohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">10</biblScope>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Overview of bionlp&apos;09 shared task on event extraction</title>
		<author>
			<persName><forename type="first">Jin-Dong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomoko</forename><surname>Ohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshinobu</forename><surname>Kano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;09</title>
				<meeting>BioNLP&apos;09</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Tissue microarrays for highthroughput molecular profiling of tumor specimens</title>
		<author>
			<persName><forename type="first">Juha</forename><surname>Kononen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lukas</forename><surname>Bubendorf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anne</forename><surname>Kallionimeni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maarit</forename><surname>Barlund</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peter</forename><surname>Schraml</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Leighton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joachim</forename><surname>Torhorst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Michael</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Guido</forename><surname>Mihatsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olli-P</forename><surname>Sauter</surname></persName>
		</author>
		<author>
			<persName><surname>Kallionimeni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nat Med</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="844" to="847" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Banner: An executable survey of advances in biomedical named entity recognition</title>
		<author>
			<persName><forename type="first">R</forename><surname>Leaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gonzalez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of PSB&apos;08</title>
				<meeting>PSB&apos;08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="652" to="663" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Printing Proteins as Microarrays for High-Throughput Function Determination</title>
		<author>
			<persName><forename type="first">Gavin</forename><surname>Macbeath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stuart</forename><forename type="middle">L</forename><surname>Schreiber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">289</biblScope>
			<biblScope unit="issue">5485</biblScope>
			<biblScope unit="page" from="1760" to="1763" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A comparative study of syntactic parsers for event extraction</title>
		<author>
			<persName><forename type="first">Makoto</forename><surname>Miwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tadayoshi</forename><surname>Hara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;10</title>
				<meeting>BioNLP&apos;10</meeting>
		<imprint>
			<date type="published" when="2010">2010a</date>
			<biblScope unit="page" from="37" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Event extraction with complex event classification using rich features</title>
		<author>
			<persName><forename type="first">Makoto</forename><surname>Miwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rune</forename><surname>Saetre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jin-Dong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Bioinformatics and Computational Biology (JBCB)</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="131" to="146" />
			<date type="published" when="2010">2010b</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Feature forest models for probabilistic HPSG parsing</title>
		<author>
			<persName><forename type="first">Yusuke</forename><surname>Miyao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="35" to="80" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">GENIA corpus: An annotated research abstract corpus in molecular biology domain</title>
		<author>
			<persName><forename type="first">Tomoko</forename><surname>Ohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuka</forename><surname>Tateisi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hideki</forename><surname>Mima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of HLT&apos;02</title>
				<meeting>HLT&apos;02</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="73" to="77" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Incorporating GENETAGstyle annotation to GENIA corpus</title>
		<author>
			<persName><forename type="first">Tomoko</forename><surname>Ohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jin-Dong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yue</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;09</title>
				<meeting>BioNLP&apos;09</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="106" to="107" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Event extraction for post-translational modifications</title>
		<author>
			<persName><forename type="first">Tomoko</forename><surname>Ohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Makoto</forename><surname>Miwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jin-Dong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;10</title>
				<meeting>BioNLP&apos;10</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="19" to="27" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">PubMeth: a cancer methylation database combining text-mining and expert annotation</title>
		<author>
			<persName><forename type="first">Maté</forename><surname>Ongenaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Leander</forename><surname>Van Neste</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tim</forename><forename type="middle">De</forename><surname>Meyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gerben</forename><surname>Menschaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sofie</forename><surname>Bekaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wim</forename><surname>Van Criekinge</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucl. Acids Res</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="D842" to="846" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
	<note>suppl</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">methblast and methprimerdb: web-tools for pcr based methylation analysis</title>
		<author>
			<persName><forename type="first">Filip</forename><surname>Pattyn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jasmien</forename><surname>Hoebeeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Piet</forename><surname>Robbrecht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Evi</forename><surname>Michels</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anne</forename><surname>De Paepe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Guy</forename><surname>Bottu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Coornaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Herzog</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Frank</forename><surname>Speleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jo</forename><surname>Vandesompele</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">496</biblScope>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Joint inference for knowledge extraction from biomedical literature</title>
		<author>
			<persName><forename type="first">Hoifung</forename><surname>Poon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lucy</forename><surname>Vanderwende</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of NAACL/HLT&apos;10</title>
				<meeting>NAACL/HLT&apos;10</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="813" to="821" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Comparative analysis of five proteinprotein interaction corpora</title>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antti</forename><surname>Airola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Juho</forename><surname>Heimonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jari</forename><surname>Björne</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page">S6</biblScope>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
	<note>Suppl. 3</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">A markov logic approach to biomolecular event extraction</title>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Riedel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hong-Woo</forename><surname>Chun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Toshihisa</forename><surname>Takagi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;09 Shared Task</title>
				<meeting>BioNLP&apos;09 Shared Task</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="41" to="49" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">X inactivation, differentiation, and dna methylation</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Riggs</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Cytogenetic and Genome Research</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="9" to="25" />
			<date type="published" when="1975">1975</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Dependency parsing and domain adaptation with LR models and parser ensembles</title>
		<author>
			<persName><forename type="first">Kenji</forename><surname>Sagae</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of EMNLP-CoNLL 2007</title>
				<meeting>EMNLP-CoNLL 2007</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1044" to="1050" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Schena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dari</forename><surname>Shalon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ronald</forename><forename type="middle">W</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Patrick</forename><forename type="middle">O</forename><surname>Brown</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">270</biblScope>
			<biblScope unit="page" from="467" to="470" />
			<date type="published" when="1995">1995. 5235</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Dna methylation landscapes: provocative insights from epigenomics</title>
		<author>
			<persName><forename type="first">M</forename><surname>Miho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adrian</forename><surname>Suzuki</surname></persName>
		</author>
		<author>
			<persName><surname>Bird</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nature Review Genetics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="465" to="476" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">GENETAG: A tagged corpus for gene/protein named entity recognition</title>
		<author>
			<persName><forename type="first">Lorraine</forename><surname>Tanabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Natalie</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lynne</forename><forename type="middle">H</forename><surname>Thom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wayne</forename><surname>Matten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Wilbur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page">S3</biblScope>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note>Suppl. 1</note>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Subdomain adaptation of a pos tagger with a small corpus</title>
		<author>
			<persName><forename type="first">Yuka</forename><surname>Tateisi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshimasa</forename><surname>Tsuruoka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of BioNLP&apos;06</title>
				<meeting>BioNLP&apos;06<address><addrLine>New York, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006-06">2006. June</date>
			<biblScope unit="page">136137</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Gene ontology: tool for the unification of biology</title>
	</analytic>
	<monogr>
		<title level="j">Nature Genetics</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="25" to="29" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
	<note>The Gene Ontology Consortium</note>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Construction of an annotated corpus to support biomedical information extraction</title>
		<author>
			<persName><forename type="first">Paul</forename><surname>Thompson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Syed</forename><surname>Iqbal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Mcnaught</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sophia</forename><surname>Ananiadou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">349</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Investigating heterogeneous protein annotations toward cross-corpora utilization</title>
		<author>
			<persName><forename type="first">Yue</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jin-Dong</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rune</forename><surname>Saetre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jun'ichi</forename><surname>Tsujii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BMC Bioinformatics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">403</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
