<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Analysis of Gene/Protein Associations at PubMed Scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sampo Pyysalo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomoko Ohta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jun'ichi Tsujii</string-name>
          <email>tsujii@is.s.u-tokyo.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Tokyo</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Centre for Text Mining, University of Manchester</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computer Science, University of Manchester</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>54</fpage>
      <lpage>62</lpage>
      <abstract>
        <p>Event extraction following the GENIA Event corpus and BioNLP'09 shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed, far beyond the narrow subdomains of biomedicine for which annotated resources are available. We aim to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/highlikelihood association statements are then manually analyzed with reference to the GENIA ontology. We provide a first estimate of the overall coverage of existing resources for event extraction and identify several classes of biologically significant associations of genes and proteins that are not addressed by these resources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In recent years, there has been a significant shift in
focus in biomedical information extraction from
simple pairwise relations representing
associations such as protein-protein interactions (PPI)
toward representations that capture typed, structured
associations of arbitrary numbers of entities in
specific roles, frequently termed event extraction
        <xref ref-type="bibr" rid="ref2">(Ananiadou et al., 2010)</xref>
        . Much of this work draws
on the GENIA Event corpus
        <xref ref-type="bibr" rid="ref16">(Kim et al., 2008)</xref>
        ,
a resource of 1500 PubMed abstracts in the
domain of transcription factors in human blood cells
annotated for genes, proteins and related entities,
events and syntax. This resource served also as
the source for the annotations in the BioNLP’09
shared task on event extraction (BioNLP ST), the
first collaborative evaluation of event extraction
methods
        <xref ref-type="bibr" rid="ref17 ref25 ref29">(Kim et al., 2009)</xref>
        .
      </p>
      <p>
        Another recent trend in the domain is a move
toward application of extraction methods to the
full scale of the existing literature, with results for
various targets covering the entire PubMed
literature database of nearly 20 million citations
being made available
        <xref ref-type="bibr" rid="ref13 ref13 ref14 ref14 ref22 ref23 ref5 ref6">(McIntosh and Curran, 2009;
Bjo¨rne et al., 2010b; Gerner et al., 2010a; Gerner
et al., 2010b)</xref>
        . As event extraction methods
initially developed to target the set of events defined
in the GENIA / BioNLP ST corpora are now
being applied at PubMed scale, it makes sense to ask
how much of the full spectrum of gene/protein
associations found there they can maximally cover, a
question separate from issues relating to their
performance in extracting the targeted event types.
      </p>
      <p>
        In this study, we seek to characterize the full
range of gene/protein associations described in
the literature and estimate what coverage of these
associations state-of-the-art event extraction
systems can maximally achieve. We approach these
questions by assuming that associations are stated
through specific words, analogously to the widely
applied concepts of interaction words in
proteinprotein interaction extraction and text binding
words in event extraction. We follow a
statistical approach to identifying such candidate words
using an automatically tagged corpus covering the
entire PubMed literature database.
We term our extraction target gene/protein
associations. So as not to limit the
applicability of our results, we define our target entities
(“genes/proteins”) broadly. The specific definition
of this entity type is provided by the GENETAG
corpus annotation
        <xref ref-type="bibr" rid="ref32">(Tanabe et al., 2005)</xref>
        on which
the applied automatic tagger is trained.
GENETAG annotates a single class of gene/protein
entities that encompasses genes and gene products as
well as related entities such as domains,
promoters, and complexes. This inclusiveness permits the
identification of associations between more than
only the gene and gene product entities included
in the GENIA / BioNLP ST annotation
        <xref ref-type="bibr" rid="ref17 ref25 ref29">(Ohta et
al., 2009)</xref>
        .
      </p>
      <p>
        We also intend “associations” broadly,
understanding it to encompass direct PPI-type
interactions as well as experimental findings
suggesting them, as pursued in the BioCreative PPI tasks
        <xref ref-type="bibr" rid="ref18">(Krallinger et al., 2007)</xref>
        , BioNLP-style events
(“things that happen”) such as expression and
localization, as well as static relations in the sense
of
        <xref ref-type="bibr" rid="ref17 ref25 ref29">(Pyysalo et al., 2009)</xref>
        , associations such as
partof that hold between entities without necessarily
implying change. Indeed, while we take
“association” to exclude properties and states that involve
only a single entity, we do not set other specific
constraints, following instead a loose biologically
motivated definition that can be characterized
informally as “any association between genes, gene
products, or related entities that is of biological
interest.”
      </p>
      <p>We note that while our aims and approach share
a number of features with tasks such as
proteinprotein interaction extraction, they differ in focus
on statements of association (as opposed to the
entities stated to be associated) and in that we do not
aim to find instances of the expressions of
interest with high recall, but rather identify association
types. Due to the large scale of the PubMed corpus
it is possible to pursue an approach that only
considers a small, high-reliability portion of the
available data (discarding most instances) and still finds
associations of interest. Thus, instead of
instancelevel recall, we pay particular attention to not
introducing overt bias e.g. toward particular forms
of expression so as to be able to use the result to
estimate relative frequencies of the associations in
the full corpus.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus resources</title>
      <p>
        This study is based on the 2009 distribution of
the full PubMed literature database,
encompassing approximately 18 million citations of
biomedical domain scientific articles. For the analysis
of this data, we make use of the Turku PubMed
Scale (TPS) corpus
        <xref ref-type="bibr" rid="ref5 ref6">(Bjo¨rne et al., 2010b)</xref>
        , an
automatically annotated corpus covering the entire
PubMed. Note that while the original focus of
the corpus is on BioNLP-style events, we do not
use these annotations. Instead, we make use of
the automatically identified sentence boundaries,
named entities, and the syntactic analyses, briefly
presented in the following.
      </p>
      <p>
        All PubMed documents in the TPS corpus were
initially processed with the GENIA sentence
splitter with simple heuristic post-processing to
correct some errors from the machine learning-based
splitter.1 The sentence splitter is estimated to
achieve an F-score of 99.7% on the GENIA
corpus. Gene/protein named entities were tagged in
all sentences using the BANNER named entity
recognition system
        <xref ref-type="bibr" rid="ref11 ref15 ref16 ref19 ref20 ref21 ref28 ref3">(Leaman and Gonzalez, 2008)</xref>
        trained on the GENETAG corpus and thus reflect
its inclusive definition of gene/protein. The release
of BANNER applied to tag the TPS corpus was
reported to achieve 86.4% F-score on the
GENETAG corpus, and an evaluation on a random
sample of tagged entities in TPS data found 87%
precision
        <xref ref-type="bibr" rid="ref5 ref6">(Bjo¨rne et al., 2010a)</xref>
        , suggesting that the
tagger generalizes well to the whole PubMed.
      </p>
      <p>
        Finally, the TPS corpus distribution includes
syntactic analyses for all sentences in which at
least one named entity has been tagged.2 Parses
were produced using the McClosky-Charniak
parser, a version of the Charniak-Johnson parser
        <xref ref-type="bibr" rid="ref35 ref7 ref9">(Charniak and Johnson, 2005)</xref>
        adapted to the
biomedical domain. The parser has shown
stateof-the-art performance in recent intrinsic
        <xref ref-type="bibr" rid="ref11 ref15 ref16 ref19 ref20 ref21 ref28 ref3">(McClosky and Charniak, 2008)</xref>
        and extrinsic
        <xref ref-type="bibr" rid="ref24 ref26">(Miwa
et al., 2010)</xref>
        evaluations. The McClosky-Charniak
parser produces constituency (phrase structure)
analyses in the Penn Treebank scheme, with Penn
part-of-speech tags. In addition to the these
analyses, dependency analyses in the Stanford
Dependency (SD) scheme
        <xref ref-type="bibr" rid="ref11 ref15 ref16 ref19 ref20 ref21 ref28 ref3 ref34">(de Marneffe and Manning,
2008)</xref>
        , created from the constituency analyses by
automatic conversion using the using the Stanford
parser tools3 are provided in the TPS corpus.
1http://www-tsujii.is.s.u-tokyo.ac.jp/∼y-matsu/geniass/
2Sentences not containing entities are not parsed as
parsing was the most computationally intensive part of the
automatic corpus annotation and the system could only extract
events from sentences with entities.
      </p>
      <p>3http://nlp.stanford.edu/software/lex-parser.shtml</p>
    </sec>
    <sec id="sec-3">
      <title>4 Identification of Gene/Protein</title>
    </sec>
    <sec id="sec-4">
      <title>Associations</title>
      <p>In this section, we present our approach to
identifying statements of gene/protein associations
through an extended analysis of word statistics in
PubMed.
4.1</p>
      <sec id="sec-4-1">
        <title>Overall Statistics</title>
        <p>As expected for a corpus of English, the most
frequent words in PubMed are prepositions,
determiners, conjunctions and forms of the copula
(“is”) and, if non-word tokens are included,
punctuation. In this work, we focus on content words,
filtering closed class words and non-words and
applying a basic stopword list including the PubMed
stopword list. Table 1 shows the most frequent
such words in PubMed.4 The distribution
suggests that medical topics dominate biomolecular
ones overall, with e.g. the word “patients”
occurring more than three times as often as the word
“protein”. Although general expressions such as
“activity” and “effect” can be used to describe
protein associations, the most frequent words contains
no word specific to protein associations.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Gene/Protein Mentions</title>
        <p>The automatic tagging for mentions of
gene/protein-related named entities in the TPS
corpus covers a total of 36.4 million gene/protein
mentions in 5.4 million documents, approximately
30% of all PubMed citations. These annotations
allow focus on texts likely relevant to gene/protein
associations. Here, as we are interested in
particular in texts describing associations between two
or more gene/protein related entities, we apply a
focused selection, picking only those individual
4For this and other word statistics in this section, basic
tokenization separating punctuation from words and
lowercasing has been applied but stemming or lemmatization is not
performed.</p>
        <p>Word
cells
protein
expression
activity
cell
gene
receptor
human
levels
factor
sentences in which two or more mentions
cooccur. While this excludes associations in which
the entities occur in different sentences, their
relative frequency is expected to be low: for example,
in the BioNLP ST data, all event participants
occurred within a single sentence in 95% of the
targeted biomolecular event statements. In the
TPS data, there are 9.0 million sentences with at
least two tagged entities. These sentences contain
25.4 million entity mentions; approximately 70%
of the total number.</p>
        <p>
          Table 2 shows the most frequent words in
sentences with at least two tagged protein mentions.
The list suggests that this simple selection is
sufficient to identify a subset of PubMed where
biomolecular topics are prominent: both “protein”
and “expression” appear ranked near the top.
The TPS corpus contains both constituency and
dependency analyses of sentence syntax. While
both forms of representation arguably capture
largely the same information, dependency
representations have been argued to make the relevant
syntactic relations more immediately accessible
and have been successfully employed in many
recent domain information extraction approaches,
frequently in conjunction with the use of the
shortest dependency path between two entities to
discover stated associations (see e.g.
          <xref ref-type="bibr" rid="ref12 ref23 ref35 ref4 ref7 ref9">(Bunescu and
Mooney, 2005; Fundel et al., 2007; Miwa et al.,
2009; Bjo¨rne et al., 2009)</xref>
          ).
        </p>
        <p>
          Here, we follow the assumption that when two
entities are stated to be associated in some way,
the most important words expressing their
association will typically be found on the shortest
dependency path connecting the two entities (cf. the
shortest path hypothesis of
          <xref ref-type="bibr" rid="ref35 ref7 ref9">(Bunescu and Mooney,
2005)</xref>
          ) The specific dependency representation
applied here is the collapsed, coordination-processed
variant of the Stanford representation, which is
expressly oriented toward use in this type of
information extraction approaches
          <xref ref-type="bibr" rid="ref11 ref15 ref16 ref19 ref20 ref21 ref28 ref3 ref34">(de Marneffe and
Manning, 2008)</xref>
          . When extracting the shortest paths,
we further avoid traversing coordinating
conjunction dependencies (conj*) to assure that relevant
words are not excluded in sentences involving
coordination and that similar paths are extracted for
all coordinated words (Figure 1).
        </p>
        <p>&lt;nsubj dobj&gt; ccco&gt;nj&gt;
p1 activates p2 and p3</p>
        <p>&lt;nsubj dobjd&gt;objc&gt;onj&gt;
p1 activates p2 and p3</p>
        <p>The corpus contains 31.8 million pairs of
gene/protein mentions co-occurring in a sentence,
and a connecting shortest path could be extracted
for 97% of these.5 Table 3 shows the words most
frequently occurring on these paths. This list again
suggests an increased focus on words relating to
gene/protein associations: expression is the most
frequent word on the paths, and binding appears
in the top-ranked words.
4.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Path probabilities</title>
        <p>Entities often co-occur in text without any
association being stated between them, but some shortest
dependency path can be found connecting (nearly)
all co-occurring entities. Distinguishing paths that
5Failures to extract a path were primarily due to
clauselevel coordination (e.g. “we study P1 and we find that P1 is
. . . ”) and, rarely, failures from the parser or the dependency
conversion.
state associations from those that do not could
thus help identify words that are key to
expressing those associations.</p>
        <p>
          A wealth of approaches for distinguishing
relevant paths from irrelevant ones have been
proposed in the protein-protein interaction
extraction literature, including rule-based,
patternbased (hand-written and learned) and supervised
classification-based methods (e.g.
          <xref ref-type="bibr" rid="ref1 ref10 ref12 ref23 ref28 ref30 ref31 ref35">(Ding et al.,
2003; Yakushiji et al., 2005; Rinaldi et al., 2006;
Fundel et al., 2007; Saetre et al., 2007; Airola et
al., 2008; Miwa et al., 2009)</xref>
          ). However, writing
explicit rules conflicts with our aim of discovering
associations (and statements of associations) that
we do not already know about, and application of
standard supervised learning methods would
similarly limit the scope of what can be extracted by
the (known) training data.
        </p>
        <p>
          Here, drawing on ideas from Open
Information Extraction
          <xref ref-type="bibr" rid="ref11 ref3">(Etzioni et al., 2008)</xref>
          , we adopt
a probabilistic approach using an “unlexicalized”
machine learning method. We defer detailed
description of the method to Section 5, now simply
assuming a way to assign to each path p an
(estimated) probability P (p) that the path expresses an
association between the entities it connects. We
make use of P (p) in two obvious ways to refine
the pure frequency-based word rankings presented
above: first, only count words when they occur
on paths that have an estimated probability higher
than a given threshold of being relevant, and
second, replacing the “raw” word count with the
expected number of times that word appears in a
relevant path, informally Ew = Pp:w∈p P (p).
        </p>
        <p>Table 4 shows the top-ranked words by Ew as
calculated using the method described below. The
listing contains only words that are regularly used
to express gene/protein associations, suggesting
that probabilistic ranking can allow clear focus on
the targeted statements.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Machine Learning</title>
      <p>
        We applied supervised machine learning to
estimate the probability that a dependency path
connecting gene/protein mentions expresses an
association of these entities, training with
“unlexicalized” features
        <xref ref-type="bibr" rid="ref11 ref15 ref16 ref19 ref20 ref21 ref28 ref3">(Banko and Etzioni, 2008)</xref>
        to force
the learning method to generalize and to learn
based on the patterns of expression only.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Training Data</title>
        <p>
          For training data, we could potentially draw from
a wealth of corpus resources annotated for some
form of association between genes/proteins, such
as PPI corpora (see e.g.
          <xref ref-type="bibr" rid="ref1 ref15 ref28">(Pyysalo et al., 2008)</xref>
          ).
However, as we are in particular interested in
event extraction approaches, we chose to use the
BioNLP ST data. This dataset also identifies the
expressions stating the annotated events (“trigger
words”), providing test material for the method.
        </p>
        <p>
          As the BioNLP ST data does not explicitly
identify pairs of entities that are stated to be associated,
it was first necessary to derive a pairwise
representation from the event representation. We applied a
mapping similar to that introduced by
          <xref ref-type="bibr" rid="ref15 ref28">(Heimonen
et al., 2008)</xref>
          for deriving pairwise relations from
the event-style annotations of the BioInfer corpus
          <xref ref-type="bibr" rid="ref27">(Pyysalo et al., 2007)</xref>
          : for each co-occurring entity
pair, we identified all paths through event
structures connecting the two entities. If these paths
included at least one where the direction of causality
was not reversed on the path, the pair was marked
as a positive example of an association; otherwise
it was marked negative. Finally, we interpreted
the Equiv annotations identifying equivalent entity
references in the data: any pair where entities are
equivalent to those of at least one positive pair was
marked positive (see Figure 2).
        </p>
        <p>Finally, to make this pair data consistent with
the TPS event spans, tokenization and other
features, we aligned the entity annotations of the two
corpora, mapping a BioNLP ST entity to a TPS
entity if their spans matched or the source entity was
entirely contained within the span of the candidate
target entity. Unmatched entities were removed
from the data. This processing was applied to the
BioNLP ST training set, creating a corpus of 6889
entity pairs of which 1119 (16%) were marked as
expressing an association (positive).</p>
        <p>Cause Theme Theme
Protein Pos.Reg Gene expression Protein
IL-4 … induce … expression of CD86</p>
        <p>Cause Cause
Protein Protein Phosphorylation
MST1 and MST2 phosphorylate …</p>
        <p>
          Cause Theme Equiv
Protein Pos.Reg Protein Protein
cytokine activate transcription factor (TF)
We applied the libSVM Support Vector
Machine implementation using probabilistic outputs
          <xref ref-type="bibr" rid="ref8">(Chang and Lin, 2001)</xref>
          . For training the
classifier, we applied features derived only from the
words and dependencies along the shortest path
between any two entities. We first replaced each
word marked as a gene/protein mention with a
placeholder string and each other word with its
part of speech tag, using the Penn tags included
in TPS. We then derived a set of frequently used
dependency path features from this representation
(see e.g.
          <xref ref-type="bibr" rid="ref1 ref23 ref28 ref34">(Airola et al., 2008; Van Landeghem et
al., 2008; Miwa et al., 2009)</xref>
          ): path length, path
“tokens” (PoS/placeholder), dependency types on
the path, and “token”/dependency 2-grams and
3-grams. Preliminary experiments using
crossvalidation on the training data suggested
performance was not sensitive to the details of the
feature representation. The SVM regularization
parameter was selected similarly, testing parameter
values on the scale . . . , 2−1, 20, 21, . . . and
selecting c = 2−3 for the final experiment.
        </p>
        <p>
          The resulting classifier is intentionally weak,
being trained to recognize not the specific
properties of positive statements in its training set but
rather their general characteristics. Development
testing indicated an F-score and AUC of
approximately 50% and 70%, substantially below the state
of the art for the comparable PPI pair extraction
task
          <xref ref-type="bibr" rid="ref23">(Miwa et al., 2009)</xref>
          .
Ew, informally characterized as the expected
number of times a word w occurs on a dependency
path which is estimated to be likely to express a
gene/protein association, is central to the applied
probabilistic ranking. In technical detail, we
derived Ew as follows.
        </p>
        <p>We first extracted all instances of shortest
dependency paths connecting two genes/proteins.
We then combined all paths sharing the same
“unlexicalized” representation, giving a total of 6.8
million unique paths. To make storage and
processing more feasible, we removed paths
occurring only once in the entire corpus. This filtered
out 6.0 million paths – 88% of the total number
of unique paths – but due to the Zipfian
properties of the distribution, the remaining 0.8 million
unique paths account for 16.7 million occurrences,
or 74% of the total occurrences. We thus do not
expect this practically motivated filtering to
fundamentally alter the basic statistical properties of
the data.</p>
        <p>Each path was then assigned the estimated
probability P (p) using the probabilistic outputs of the
SVM trained as described above. At this stage,
we could potentially introduce a threshold
parameter into the method defining a tradeoff between
path quality and inclusiveness. However, as initial
testing suggested the method to be relatively
robust to the choice of cutoff, we simply take the
obvious choice of defining as “likely positive” path
any for which P (p) &gt; 0.5. We then removed any
path that did not meet this condition as not (likely)
expressing an association, leaving 46437 unique
unlexicalized paths (5.7% of the total) predicted
to express gene/protein associations. Finally, each
occurrence of a word w on one of these paths is
assigned the path probability P (p). In cases where
words appear on multiple paths, they are simply
assigned the maximum of the path probabilities.
Ew is then the sum of these probabilities over the
entire corpus.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>We first evaluated each of the word rankings
discussed in Section 4 by comparing the ranked lists
of words against the set of single words marked
as trigger expressions in the BioNLP ST
development data. These single-word triggers account for
92% of all trigger expressions marked in the data,
and there are 343 unique triggers. Figure 3 shows
precision/recall curves for each of the four
rankings generated by the word frequency/expected
value. The result supports the informal
observations made through the top-ranked words in
Ta</p>
      <p>Path probabilities</p>
      <p>Shortest paths
Two protein sentences</p>
      <p>Overall
20
40
60
80</p>
      <p>100</p>
      <p>Recall
bles 1-4: the later approaches provide a much
more relevant ranking for identifying words
expressing associations.</p>
      <p>We next performed a manual study of candidate
words for stating gene/protein associations using
the Ew ranking. Here, we take as known any
word for which the normalized, lemmatized form6
matches that of any word appearing as a trigger
expression in the BioNLP ST training or
development test data. We then selected the words ranked
highest by Ew that were not known, grouped by
normalized and lemmatized form, and added for
reference examples of frequent shortest
dependency paths on which any of these words appear.
These groups were evaluated by a PhD biologist
with expertise in event annotation and basic
understanding of the Stanford Dependency
representation of syntax (TO), with instructions to mark as
positive words that in contexts like those provided
can be understood to express a gene/protein
association, defined broadly as described in Section 2.</p>
      <p>In total, 1200 candidate expressions were
manually evaluated, of which 660 were judged to
express an association. We then proceeded to
manually cluster them by the type of association they
would typically express. Following preliminary
analysis, we performed a top-level division into
three categories: events (“things that happen”)
involving gene/protein entities in their natural
environment (55% of associations), “static” relations
holding between the entities (28%), and
experimental observations and manipulations that do not
occur naturally (17%). We further grouped the
new event statements into event classes using the
6Using the NLM LVG norm normalizer, available
at http://lexsrv3.nlm.nih.gov/LexSysGroup/
Projects/lvg/2010/
“co-immunoprecipitate, hybridize”
“immunoblotting, electrophoresis”
“apoptosis”
“chemotaxis”
“exocytosis”
“endocytosis, phagocytosis”
“depolymerization, dissociate”
“hydrolysis”
“replication”
“repair”</p>
      <p>PBAOacixolyimtdlianaitttiiolioaoyntnliaotnion ““““pabpcieaoyrltmoliaxnitityidoolaayntat”iitooinno”n””
Prenylation “farnesylation”</p>
      <p>Sulfation “sulfation”
“homeostasis”</p>
      <p>
        Gene Ontology
        <xref ref-type="bibr" rid="ref33">(The Gene Ontology Consortium,
2000)</xref>
        for reference and identified event classes
that were not previously included in the GENIA
event ontology. This process suggested 18 event
classes that were not previously considered in the
GENIA ontology, shown in Figure 4 with a
tentative proposal on how these classes could be
organized into the GENIA ontology, with examples of
identified words expressing each new event type.
      </p>
      <p>Finally, to estimate the relative prominence of
the known (i.e. BioNLP ST) expressions of
associations in PubMed compared to those that were
newly identified, we compared the E values of
the unique lemmas, counted as the sum of Ew for
words sharing the lemma. Figure 5 shows a plot of
the values ranked from high to low E. The result
was unexpected: the estimate suggests that even
though the newly identified association words are
drawn from PubMed without subdomain
restrictions and include more than only event
expressions, expressions of event-type associations
usE
ing the previously known words are overall much
more prominent in PubMed. Specifically, the total
E value mass of all the newly identified
associations (the area under the curve in Figure 5) is just
22% of that of the known events, and the mass of
the newly identified events 37% of all the new
associations; only 8% of that of the known events.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>
        We found that currently existing resources for
event extraction are lacking in coverage of e.g.
relatively rare but biologically important protein
post-translational modifications and experimental
outcomes that suggest (but do not state) causal
connections. However, the statistical analysis
suggests that resources already cover the clear
majority of gene/protein events in PubMed, indicating
that an annotation-based approach to extending
coverage of event types (e.g.
        <xref ref-type="bibr" rid="ref26">(Ohta et al., 2010)</xref>
        )
may offer a realistic path to near-complete
coverage of all major gene/protein events in the near
future. With resources for static relation
extraction
        <xref ref-type="bibr" rid="ref17 ref25 ref29">(Pyysalo et al., 2009)</xref>
        this coverage could be
further extended beyond event-type associations.
      </p>
      <p>However, the approach to identifying
gene/protein associations considered here is
limited in a number of ways: it excludes
associations stated across sentence boundaries,
does not treat multi-word expressions as wholes,
and only directly includes associations stated
between exactly two entities. The approach is also
fundamentally limited to associations expressed
through specific words and thus blind to e.g.
part-of relations implied by statements such as
CD14 Sp1-binding site. Further, our estimate of
overall association statement frequency ignored
the “long tail” of the distribution, thus excluding
rare expressions which may nevertheless add
up to a not insignificant fraction of the total.
These factors limit the reliability of the presented
coverage estimates. Finally, it should be noted
that while we have taken any expression of
association for which even a single annotation exists
as “known”, the performance at which many of
these association can be extracted in practice may
be limited.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>We have presented an approach to
discovering expressions of gene/protein associations from
PubMed based on named entity co-occurrences,
shortest dependency paths and an
unlexicalized classifier to identify likely statements of
gene/protein associations. Drawing on the
automatically created full-PubMed annotations of the
Turku PubMed-Scale (TPS) corpus and using the
BioNLP’09 shared task data to define positive and
negative examples of association statements, we
distilled an initial set of over 30 million protein
mentions into a set of 46,000 unique unlexicalized
paths estimated likely to express gene/protein
associations. These paths were then used to rank
all words in PubMed by the expected number of
times they are predicted to express such
associations, and 1200 candidate association-expressing
words not appearing in the BioNLP’09 shared task
data evaluated manually. The study of these
candidates suggested 18 new event classes for the
GENIA ontology and indicated that the majority of
statements of gene/protein associations not
covered by currently available resources are not
statements of biomolecular events but rather statements
of static relations or experimental manipulation.</p>
      <p>The event annotation of the GENIA corpus was
originally designed to cover events discussed in
publications on a limited subdomain of
biomolecular science. It could thus be assumed that the
event types and the specific statements annotated
in GENIA would have only modest coverage of
all gene/protein association types and statements
in PubMed. However, our results suggest that
even the BioNLP’09 shared task data, a subset
of GENIA, may represent a clear majority of all
gene/protein associations. However, this estimate
of coverage is a first attempt and involves many
uncertain factors and potential sources of error,
calling for more research.</p>
      <p>The data derived from TPS created in this study,
including the shortest paths, their estimated
probabilities, and the word lists ranked by
probability of stating a gene/protein association are
available for research purposes from from the GENIA
project homepage http://www-tsujii.is.
s.u-tokyo.ac.jp/GENIA.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work was supported by Grant-in-Aid for
Specially Promoted Research (MEXT, Japan).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Antti</given-names>
            <surname>Airola</surname>
          </string-name>
          , Sampo Pyysalo, Jari Bjorne, Tapio Pahikkala, Filip Ginter, and
          <string-name>
            <given-names>Tapio</given-names>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>Suppl 11</issue>
          ):
          <fpage>S2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Sophia</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          , Sampo Pyysalo,
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          , and
          <string-name>
            <surname>Douglas</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kell</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Event extraction for systems biology by text mining the literature</article-title>
          .
          <source>Trends in Biotechnology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Michele</given-names>
            <surname>Banko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The tradeoffs between open and traditional relation extraction</article-title>
          .
          <source>In Proceedings of ACL-08: HLT</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Jari</given-names>
            <surname>Bjo</surname>
          </string-name>
          ¨rne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio Pahikkala, and
          <string-name>
            <given-names>Tapio</given-names>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Extracting complex biological events with rich graphbased feature sets</article-title>
          .
          <source>In Proceedings of the BioNLP 2009 Shared Task</source>
          , pages
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jari</given-names>
            <surname>Bjo</surname>
          </string-name>
          <article-title>¨rne, Filip Ginter, Sampo Pyysalo, Jun'ichi Tsujii, and Tapio Salakoski</article-title>
          . 2010a.
          <article-title>Complex event extraction at PubMed scale</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>26</volume>
          (
          <issue>12</issue>
          ):
          <fpage>i382</fpage>
          -
          <lpage>390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jari</given-names>
            <surname>Bjo</surname>
          </string-name>
          <article-title>¨rne, Filip Ginter, Sampo Pyysalo, Jun'ichi Tsujii, and Tapio Salakoski</article-title>
          . 2010b.
          <article-title>Scaling up biomedical event extraction to the entire pubmed</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Razvan C.</given-names>
            <surname>Bunescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Raymond J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>A shortest path dependency kernel for relation extraction</article-title>
          .
          <source>In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP'05)</source>
          , pages
          <fpage>724</fpage>
          -
          <lpage>731</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Chih-Chung Chang</surname>
          </string-name>
          and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          ,
          <year>2001</year>
          .
          <article-title>LIBSVM: a library for support vector machines</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Eugene</given-names>
            <surname>Charniak</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Johnson</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Coarseto-fine n-best parsing and maxent discriminative reranking</article-title>
          .
          <source>In Proceedings of ACL'05</source>
          , pages
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Jing</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Daniel Berleant, Jun Xu, and
          <string-name>
            <surname>Andy</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Fulmer</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Extracting biochemical interactions from MEDLINE using a link grammar parser</article-title>
          .
          <source>In Proceedings of ICTAI'03</source>
          , pages
          <fpage>467</fpage>
          -
          <lpage>471</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michele</given-names>
            <surname>Banko</surname>
          </string-name>
          , Stephen Soderland, and
          <string-name>
            <surname>Daniel</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Weld</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Open information extraction from the web</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>51</volume>
          (
          <issue>12</issue>
          ):
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Katrin</given-names>
            <surname>Fundel</surname>
          </string-name>
          , Robert Kuffner, and
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Zimmer</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>RelEx-Relation extraction using dependency parse trees</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>23</volume>
          (
          <issue>3</issue>
          ):
          <fpage>365</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Gerner</surname>
          </string-name>
          , Goran Nenadic, and
          <string-name>
            <given-names>Casey</given-names>
            <surname>Bergman</surname>
          </string-name>
          . 2010a.
          <article-title>Linnaeus: A species name identification system for biomedical literature</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>85</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Gerner</surname>
          </string-name>
          , Goran Nenadic, and
          <string-name>
            <surname>Casey</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bergman</surname>
          </string-name>
          .
          <year>2010b</year>
          .
          <article-title>An exploration of mining gene expression mentions and their anatomical locations from biomedical text</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>72</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Juho</given-names>
            <surname>Heimonen</surname>
          </string-name>
          , Sampo Pyysalo, Filip Ginter, and
          <string-name>
            <given-names>Tapio</given-names>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Complex-to-pairwise mapping of biological relationships using a semantic network representation</article-title>
          .
          <source>In Proceedings of SMBM'08.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Corpus annotation for mining biomedical events from literature</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>10</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Overview of BioNLP'09 shared task on event extraction</article-title>
          .
          <source>In Proceedings of BioNLP 2009 Shared Task.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Krallinger</surname>
          </string-name>
          , Florian Leitner, and
          <string-name>
            <given-names>Alfonso</given-names>
            <surname>Valencia</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Assessment of the second BioCreative PPI task: Automatic extraction of protein-protein interactions</article-title>
          .
          <source>In Proceedings of BioCreative II</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Robert</given-names>
            <surname>Leaman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Garciela</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Banner: An executable survey of advances in biomedical named entity recognition</article-title>
          .
          <source>In Proceedings of PSB'08</source>
          , pages
          <fpage>652</fpage>
          -
          <lpage>663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Marie-Catherine de</surname>
            Marneffe and
            <given-names>Christopher</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The stanford typed dependencies representation</article-title>
          .
          <source>In COLING Workshop on Crossframework and Cross-domain Parser Evaluation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>David McClosky</surname>
            and
            <given-names>Eugene</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>SelfTraining for Biomedical Parsing</article-title>
          .
          <source>In Proceedings of ACL-HLT'08</source>
          , pages
          <fpage>101</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Tara</given-names>
            <surname>McIntosh and James R. Curran</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Reducing semantic drift with bagging and distributional similarity</article-title>
          .
          <source>In Proceedings of ACL/IJCNLP'09</source>
          , pages
          <fpage>396</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Miwa</surname>
          </string-name>
          , Rune Saetre, Yusuke Miyao, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Protein-protein interaction extraction by leveraging multiple kernels and parsers</article-title>
          .
          <source>International Journal of Medical Informatics</source>
          ,
          <volume>78</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e39</fpage>
          -
          <lpage>e46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Miwa</surname>
          </string-name>
          , Sampo Pyysalo, Tadayoshi Hara, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A comparative study of syntactic parsers for event extraction</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>37</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Tomoko</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jin-Dong</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Sampo Pyysalo,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Incorporating GENETAG-style annotation to GENIA corpus</article-title>
          .
          <source>In Proceedings of BioNLP'09</source>
          , pages
          <fpage>106</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Tomoko</given-names>
            <surname>Ohta</surname>
          </string-name>
          , Sampo Pyysalo, Makoto Miwa, JinDong Kim, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Event extraction for post-translational modifications</article-title>
          .
          <source>In Proceedings of BioNLP'10</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Sampo</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , Filip Ginter, Juho Heimonen, Jari Bjo¨rne, Jorma Boberg, Jouni Ja¨rvinen, and
          <string-name>
            <given-names>Tapio</given-names>
            <surname>Salakoski</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>BioInfer: A corpus for information extraction in the biomedical domain</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>8</volume>
          (
          <issue>50</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Sampo</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , Antti Airola, Juho Heimonen, and Jari Bjo¨rne.
          <year>2008</year>
          .
          <article-title>Comparative analysis of five proteinprotein interaction corpora</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>9</volume>
          (
          <issue>Suppl</issue>
          . 3):
          <fpage>S6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Sampo</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          , Tomoko Ohta,
          <string-name>
            <surname>Jin-Dong Kim</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Static relations: a piece in the biomedical information extraction puzzle</article-title>
          .
          <source>In Proceedings of BioNLP'09</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          , Gerold Schneider,
          <string-name>
            <given-names>Kaarel</given-names>
            <surname>Kaljurand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Hess</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Romacker</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>An environment for relation mining over richly annotated corpora: The case of GENIA</article-title>
          .
          <source>In Proceedings of SMBM'06</source>
          , pages
          <fpage>68</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Rune</given-names>
            <surname>Saetre</surname>
          </string-name>
          , Kenji Sagae, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Syntactic features for protein-protein interaction extraction</article-title>
          .
          <source>In Proceedings of LBM'07</source>
          , pages
          <fpage>6</fpage>
          .
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Tanabe</surname>
          </string-name>
          , Natalie Xie, Lynne H Thom, Wayne Matten, and
          <string-name>
            <given-names>W John</given-names>
            <surname>Wilbur</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>GENETAG: A tagged corpus for gene/protein named entity recognition</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>6</volume>
          (
          <issue>Suppl</issue>
          . 1):
          <fpage>S3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>The</given-names>
            <surname>Gene Ontology Consortium</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Gene ontology: tool for the unification of biology</article-title>
          .
          <source>Nature Genetics</source>
          ,
          <volume>25</volume>
          :
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Sofie Van Landeghem</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yvan Saeys</surname>
          </string-name>
          , Bernard De Baets, and Yves Van de Peer.
          <year>2008</year>
          .
          <article-title>Extracting proteinprotein interactions from text using rich feature vectors and feature selection</article-title>
          .
          <source>In Proceedings of SMBM'08.</source>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Akane</given-names>
            <surname>Yakushiji</surname>
          </string-name>
          , Yusuke Miyao, Yuka Tateisi, and
          <string-name>
            <surname>Jun'ichi Tsujii</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Biomedical information extraction with predicate-argument structure patterns</article-title>
          .
          <source>In Proceedings of SMBM'05</source>
          , pages
          <fpage>60</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>