<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Protein-Protein Interaction Abstract Identification with Contextual Bag of Words</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abstract</string-name>
        </contrib>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>protein-protein interactions. We propose a novel feature representation scheme,
contextual-bag-of-words, to exploit protein name information.</p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>Our method outperforms well-known methods that use protein name information as
additional features. We further improve performance by extracting reliable and
informative instances from unlabeled and likely positive data to provide additional
training data. We employ F-measure and the area under a receiver operating
characteristic curve (AUC) to measure the classification and ranking abilities,
respectively. Our final model achieves an F-measure of 80.34% and an AUC score of
88.06%, which are higher than those of the top-ranking system in BioCreAtIvE-II by
2.34% and 2.52%, respectively.</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>These results show the effectiveness of our contextual-bag-of-words scheme and
suggest that our system could serve as an efficient preprocessing tool for modern PPI
database curation.</p>
      <sec id="sec-3-1">
        <title>Background</title>
        <p>Most biological processes, including metabolism and signal transduction, involve
large numbers of proteins and are usually regulated through protein-protein
interactions (PPI). It is therefore important to understand not only the functional roles
of the individual proteins involved but also the overall organization of each biological
process [1].
Several experimental methods can be employed to determine whether a protein
interacts with another protein. Experimental results are published and then stored in
protein-protein interaction databases such as BIND [2] and DIP [3]. These PPI
databases are now essential for biologists to design their experiments or verify their
results since they provide a global and systematic view of the large and complex
interaction networks in various organisms.</p>
        <p>Initially, the results were mainly verified and added to the databases manually. Since
1990, the development of large-scale and high-throughput experimental technologies
such as immunoprecipitation and the yeast two-hybrid model has boosted the output
of new experimental PPI data exponentially [4]. It becomes impossible to perform the
relying curation task on the formidable number of existing and emerging publications
if it relies solely on human effort. Therefore, information retrieval and extraction tools
are being developed to help curators. These tools should be able to examine enormous
volumes of unstructured texts to extract potential PPI information. They usually adopt
a general approach: finding articles relevant to PPI first, and then extracting the
relevant information from them. In this paper, we focus on the first step.
Most methods in this approach formulate the article-finding step as a text
classification (TC) task, in which articles relevant to PPI are denoted as positive
instances while irrelevant ones are denoted as negative. We refer to this task as the
PPI-TC task from now on. One advantage of this formulation is that the machine
learning (ML) methods commonly used in general TC systems such as Support vector
machines [5] or Bayesian approaches [6] can be modified and applied to the problem
of identifying PPI-relevant articles. In spite of this advantage, there are still two main
differences between PPI-TC and TC that might be the key challenges for further
improving the performance of PPI-TC systems. We discuss them in the following two
paragraphs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Words may own different meanings according to contexts</title>
      <p>In TC, documents are usually represented by represented by a "bag of words" (BoW).
However, in PPI-TC, some words are informative only in certain contexts. For
example, "bind" is more informative in indicating if an abstract is PPI-relevant when
it appears in a sentence that has at least two protein names. Thus, including such
contextual information in the feature representation of PPI-TC is very important.</p>
    </sec>
    <sec id="sec-5">
      <title>The existence of likely data</title>
      <p>Unlike in general TC, where documents are either categorized as relevant or irrelevant
to some topic, the situation is more complicated in PPI-TC. The definition of
"PPIrelevant" varies with the database for which we curate. Most PPI databases define
their standard according to Gene Ontology, a taxonomy that classifies all kinds of
protein-protein interactions. Each PPI database may only annotate a subset of PPI
types; therefore, only some of these types will overlap with a different PPI database.
In PPI databases, each existing PPI record is associated with its literature source
(PMID). Figure 1 shows a PPI record of the MINT database. It shows that the article
with PubMed ID:11238927 contains information about the interaction between
P19525 and O75569, where P19525 and O75569 are the primary accession numbers
of two proteins in the UniProt database. These articles can be treated as PPI-relevant
and as true positive data. However, to employ mainstream machine-learning
algorithms and improve their efficacy in PPI-TC, there are still two major challenges.
The first is how to exploit the articles recorded in other PPI databases. Since other
databases may partially annotate the same PPI types as the target database, articles
recorded in them can be treated as likely positive (LP) data. If more effective training
data are included, feature weights will be calculated more accurately and the number
of unseen features will be reduced. Considering these articles may increase the
generality of the original model. The second challenge is a consequence of the first:
To use likely positive data we must collect corresponding likely negative (LN) data,
or the ratio of positive to negative data will become unbalanced. In the following
sections, we will describe how we tackle these two challenges and discuss why our
methods are effective for PPI-TC.</p>
      <sec id="sec-5-1">
        <title>Synopsis</title>
        <p>To increase the readability of this paper and introduce the terminologies that will be
used in the Results, Discussions, and Conclusions sections, we here summarize the
major methods, datasets, and evaluation metrics used in our experiments.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Formulation and term weighting schemes</title>
      <p>In this paper, PPI-TC is formulated as a classification problem. Each document is
transformed to a feature vector and then classified as either PPI-relevant or -irrelevant.
We adopt the support vector machines (SVM) as our classification model because its
efficacy has been demonstrated for binary classification tasks and allows non-binary
value in feature vectors.</p>
      <p>Following the classical BoW feature representation, a document d is represented as a
term vector v, in which each dimension vi corresponds to a term ti. vi is calculated by a
term-weighting function, which is very important for SVM-based TC because SVM
models are sensitive to the data scale, i.e. they are dominated by some dimensions
with very wide ranges.</p>
      <p>In addition to the simplest binary features, which only indicate the existence of a word
in a document, there are currently numerous term-weighting schemes that utilize term
frequency (TF), inverse document frequency (IDF) or statistical metrics information.
Lan et al. [7] pointed out that the popularly-used term frequency-inverse document
frequency (TFIDF) method has not performed uniformly well with respect to different
data corpora. The traditional IDF factor and its variants were introduced to improve
the discriminating power of terms in the traditional information-retrieval field.
However, in TC, this may not be the case since the IDF factor neglects the category
information of the training set. Hence, they proposed two new supervised weighting
schemes, relative frequency (RF) and term frequency-relative frequency (TFRF), to
improve the term's discriminating power. In these functions, each term is assigned
more appropriate weights in terms of different categories.</p>
      <p>In Table 1, we list the symbols representing the number of positive and negative
documents that contain and do not contain term ti. With this table, the schemes stated
above can be defined as follows:</p>
      <p>Binary(ti , d ) = ⎩⎧⎨01,, iofthtie∈rwdise,
TFd (ti ) = ti ' s term frequency in d</p>
      <p>| d |
TFIDF (ti , d ) = TFd (ti ) ⋅ log w +wx ++ yy + z , and</p>
    </sec>
    <sec id="sec-7">
      <title>Methods of exploiting contextual information</title>
      <p>A PPI abstract must contain some protein names. Hence, recognition of protein names
in abstracts can improve the identification of PPI abstracts. In the following
paragraphs, we describe the three methods that extend the classical BoW scheme,
including our proposed CBoW, along with the other two well-known methods, BoP
and BoN.</p>
      <sec id="sec-7-1">
        <title>Contextual bag of words (CBoW)</title>
        <p>The number of protein names that exists in the context affects a word’s
informativeness for PPI relevance. Based on this fact, we distinguish the original
word bags into different contextual bags. The words in individual sentences are
bagged according to the number of protein names (PNs) in the sentence. If there are 0
PN, the words are put into contextual Bag 0; if 1 PN, then Bag 1; and if 2 or more
PNs, then Bag 2.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Bag of phrases (BoP)</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref4">8</xref>
          ] suggested that adding phrases into the original bag can help retain some order
information which is lost in BoW. In our case, we add PN phrases into the bag.
        </p>
      </sec>
      <sec id="sec-7-3">
        <title>Bag of normalized PNs (BoN)</title>
        <p>The more protein names that appear in an abstract, the more likely it is to be
PPIrelevant. Following [9], we replace each PN in a given abstract with “PROTEIN_i”,
where i denotes the order of appearance in this abstract. Abstracts containing different
numbers of PNs have different normalized PN features.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Utilizing the likely data</title>
      <p>The key steps of utilizing the likely data include selecting the most effective ones and
exploiting them for improving the PPI-TC model. For the first step, the LP data can
be collected from other PPI databases while the LN data are not available. Therefore,
collecting LP data is much easier than LN data. In our method, we choose MEDLINE
abstracts in Genomic TREC 2004 collection that are not recorded in major PPI
databases to be the LN data. This is because we observe that most Medline abstracts
are not relevant to PPI. Then, the method described in the "Selecting the most effective
likely positive and negative data" subsection is employed to pick the most effective
likely data. The selected LP and LN data are denoted as LP* and LN* from now on.
For the second step, we employ the hierarchical model that is detailed in the
"Exploiting the selected likely positive and negative data" subsection.</p>
    </sec>
    <sec id="sec-9">
      <title>Datasets</title>
      <p>In our experiment, we use the dataset of the BioCreAtIvE II IAS subtask [1] because
the training set contains not only the true positive data (TP) and true negative data
(TN) but also the likely positive data (LP), which is very necessary for our PPI-TC
system. The TP (PPI-relevant) data were derived from the content of the IntAct [10]
and MINT [11] databases, which are not organism specific. TN data were also
provided by MINT and IntAct database curators. The LP data comprise a collection of
PubMed identifiers of articles that have been used to annotate protein interactions by
other interaction databases (namely BIND [2], HPRD [12], MPACT [13] and GRID
[14]). Note that this additional collection is a noisy dataset and thus not part of the
ordinary TP collection, as these additional databases may have different annotation
standards from MINT and IntAct (e.g. regarding the curation of genetic interactions).
We randomly selected 105,000 abstracts from the Genomic TREC 2004 collection be
- 8
the LN data. It consisted of 10-year (from 1994 to 2003) published MEDLINE
abstracts (4,591,008 records). The test set is a balanced dataset, which contains 338
and 339 abstracts for TP and TN respectively. According to BioCreAtIvE-II’s official
statement, the keyword set of the test set differs from that of the training set in order
to prevent over-fitting systems from achieving unfairly high scores. The size of each
dataset is shown in Table 2.</p>
    </sec>
    <sec id="sec-10">
      <title>Evaluation metrics</title>
      <p>We employ the official evaluation metrics of BioCreAtIvE II, which assess not only
the accuracy of classification but also the quality of ranking of relevant abstracts.</p>
      <sec id="sec-10-1">
        <title>Classification metrics</title>
        <p>The classification metrics examine the prediction outcome from the perspective of
binary classification. The value terms used in the following formulas are defined as
follows: True Positive (TP) represents the number of correctly classified relevant
instances, False Positive (FP) the number of incorrectly classified irrelevant instances,
True Negative (TN) the number of correctly classified irrelevant instances, and finally,
False Negative (FN) the number of incorrectly classified relevant instances.
The classification metrics used in our experiments are precision, recall and F-measure.
The F-measure is a harmonic average of precision and recall. These three metrics are
defined as follows:</p>
        <p>Precision = TPT+PFP , Recall = TP T+PFN</p>
        <p>F− measure = 2P⋅rPerceisciiosino+n ⋅RReeccaallll</p>
      </sec>
      <sec id="sec-10-2">
        <title>Ranking metrics</title>
        <p>Curation of PPI databases requires a classifier to output a ranked list (as opposed to a
binary decision) of all testing instances based on the likelihood that they will be in the
positive class. The curators can then either specify a cutoff to filter out some articles
on the basis of their experience, or give higher priority to more highly ranked
instances.</p>
        <p>The ranking metric used in our experiments is AUC, the area under the receiver
operating characteristic curve (ROC curve). The ROC curve is a graph of the fraction
of true positives (TPR, true positive rate) vs. the fraction of false positives (FPR, false
positive rate) for a classification system given various cutoffs for output likelihoods,
where
TPR = TPT+PFP , FPR = TPF+PFP
When the cutoff is lowered, more instances are considered positive. Hence, both TPR
and FPR are increased since their numerators become larger but their denominator,
denoting the total number of positive instances, remains constant. The more positive
instances that are ranked above the negative ones by the classification system, the
faster that TPR grows in relation to FPR as the cutoff descends. Consequently, higher
AUC values indicate more reliable ranking results.
performance of BoW regardless of the weighting schemes. These results suggest that
our idea of dividing the word bag according to a word’s context is effective. Notably,
the RF weighting function consistently outperforms the other two in all methods.
These results demonstrate RF’s appropriateness for both TC and PPI-TC.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Expanding the training set</title>
      <p>In this section, we examine the effects of adding LP* and LN*. Using the procedure
described in Methods (note: it is in the last section of this paper), we select 8,862
abstracts from the original LP dataset and 10,000 abstracts from the unlabeled data set
to form the LP* and LN* datasets, respectively.</p>
      <p>Without lost of generality, we use the CBoW feature representation scheme. Table 4
shows that irrespective of the weighting scheme used, adding the selected data
improves both the F-measure and AUC. These results suggest that exploiting LP and
unlabeled data not only refines the filtering accuracy but also the ranking quality
effectively, which is critical for PPI database curation. Similar to the results shown in
Table 3, RF also outperforms the other weighting schemes.</p>
    </sec>
    <sec id="sec-12">
      <title>Compared with BioCreAtIvE-II systems</title>
      <p>Table 5 compares our scores with the best and median scores in BioCreAtIvE-II. We
can see that our system performs better than BioCreAtIvE-II's best system and
significantly better than BioCreAtIvE-II median system. These results suggest that
our system has state-of-the-art ability to filter out PPI-irrelevant abstracts and rank
PPI-relevant ones.</p>
      <sec id="sec-12-1">
        <title>Discussion</title>
        <p>In this section, we explain CBoW's effectiveness by illustrating and analyzing feature
weights in different contextual bags. First, we list the words with the largest
discriminative power difference enhanced by CBoW. In an SVM model, a feature's
discriminative power correlates positively to its weight. Therefore, we list the words
with the largest weight variances among all bags, as shown in Table 6. We can see
that these words are really the words highly related to PPI when they appear in
sentences with more than two PNs.</p>
        <p>To further explain how CBoW correctly identifies a PPI-relevant abstract, we exhibit
two examples in Table 7. The words in Table 6 are marked in italic. In addition,
protein names are underlined to indicate context types.</p>
        <p>The first example (PMID=9707401) is mislabeled by BoW since it has a PPI keyword,
interaction. However, in CBoW, only the occurrences located in the sentence with
two or more protein names have high weight to indicate an abstract’s PPI-relevance.
The first example is not this case. Therefore, it is correctly classified by CBoW as
PPI-irrelevant.</p>
        <p>The second example (PMID=16286467) is misclassified as PPI-irrelevant by BoW
because it does not contain top discriminative words such as interaction. However, in
CBoW, the weights of stimulation, regulated, and phosphorylation are significantly
enlarged. Therefore, it can be correctly identified as PPI-relevant.</p>
        <p>After examining the weights of individual words in different bags, we compare the
mean and standard deviation of weights for different bags (Table 8). We can see that
- 12
Bag 2 has the largest mean weight. This result is in accordance with our intuition that
words in Bag 2 have the strongest discriminative power.</p>
        <p>We then use Mann-Whitney’s rank sum test and F-test to test the equality of means
and variances of weights between any two bags. The p-values of all the tests are listed
in Table 9. An extremely small p-value (&lt;0.01) is considered strong support for the
significant difference between the two compared distributions. According to the test
results, we can see that the weights in Bag 2 and Bag 1 are significantly greater than
those in Bag 0. Also, the variance of weights in Bag 2 is significantly greater than in
Bag 1 and Bag 0, suggesting that the weights in Bag 2 range more widely, thus
making the features in Bag 2 more discriminative and dominant.</p>
      </sec>
      <sec id="sec-12-2">
        <title>Conclusions</title>
        <p>In this paper, we propose a novel CBoW feature representation scheme and
demonstrate its effectiveness over other methods that also exploit PN information in
PPI-TC. We also develop a method to extract likely positive and likely negative data
which is applicable to PPI-TC. Recently, many advanced document representation
schemes have been developed. Most of them were produced by incorporating
NLPbased features. [15] pointed out that these features can help disambiguate words in the
bag but did not find features that are generally effective. The results of our
experiments on BoP and BoN support this claim. In our method, we need to split the
feature space according to different types of contexts defined by domain knowledge.
Our study of the PPI-TC problem presents a potential new way of exploiting
NLPbased contextual information. In the future, we will examine the generality of this
idea by applying it to TC in other domains.
In targeting to an annotation standard of a specific PPI database, all other related
resources can be regarded as likely-positive. In this case, the complicated dataset
integration problem can be converted into an easy filtration. Also, we can extract
abundant likely-negative instances from unlimited unlabeled data to balance the
training data.</p>
        <p>With our methods, our PPI-TC system has higher F-score and AUC than the rank 1
system of these metrics in the BioCreAtIvE-II IAS challenge, which suggests that our
system can serve as an efficient preprocessing tool for curating modern PPI databases.</p>
      </sec>
      <sec id="sec-12-3">
        <title>Methods</title>
        <p>In this section, we first introduce the machine-learning model used in our system:
support vector machines. Secondly, we describe how our system filters out ineffective
likely-positive data and selects effective likely-negative data from unlabeled data.
Finally, we explain how we exploit the selected likely-positive and negative data.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Support vector machines</title>
      <p>The support vector machine (SVM) model is one of the best known ML models that
can handle sparse high dimension data, which has been proved useful for text
classification [16]. It tries to find a maximal-margin separating hyperplane &lt;w, φ(x)&gt;
- b = 0 to separate the training instances, i.e.,
min || w ||2 +C ∑iξ (i) subject to
y(i) (&lt;w, φ(x(i))&gt; - b) ≥ 1－ξ(i), ∀i
where x(i) is the ith training instance which is mapped into a high-dimension space by
φ (⋅) , yi ∈ {1, -1} is its label, ξ(i) denotes its training error, and C is the cost factor
(penalty of the misclassified data). The mapping function φ (⋅) and the cost factor C are
the main parameters of a SVM model.</p>
      <p>When classifying an instance x, the decision function f(x) indicates that x is "above"
or "below" the hyperplane. [17] shows that the f(x) can be converted into an
equivalent dual form which can be more easily computed:</p>
      <p>primal form: f(x) = sign(&lt;w, φ(x)&gt; - b)
dual form: f(x) = sign ( ∑iα (i) y(i) K(x(i) , x) − b)
where K(x(i), x) = &lt;φ(x(i)), φ(x)&gt; is the kernel function and α(i) can be thought of as w's
transformation.</p>
      <p>In our experiment, we choose the following linear kernel because the literature had
shown that this kernel is efficient and effective for TC:</p>
      <p>K(x(i), x(j)) = &lt;x(i), x(j)&gt;
which is equivalent to</p>
      <p>φ(x(i)) = x(i)
Finally, the cost factor C is chosen to be 1, which is fairly suitable for most problems.</p>
    </sec>
    <sec id="sec-14">
      <title>Selecting the most effective likely positive and negative data</title>
      <p>The limited training set contains only limited numbers of true-positive (TP) and
truenegative (TN) data. To increase the generality of the classification model, more
external resources should be introduced. One important resource is another PPI
database; abundant PPI articles are recorded in various such databases. However,
most of them only annotate a selection of all the PPI types defined in Gene Ontology.
Therefore, some annotations may match the criteria of the target PPI database while
- 15
others may not. This means that abstracts annotated in that database can only be
treated as likely-positive examples, some of which may need to be filtered out.
Another problem is that there are no negative data or even likely-negative data in any
curation. We will obtain a model with a bias toward positive prediction if only those
instances in the PPI databases are used because most machine-learning-based
classifiers tend explicitly or implicitly to record the prior distribution of
positive/negative labels in the training data. As explained in the introduction, an
imbalance in training data can cause serious problems. However, a large proportion of
the biomedical literature is negative, which is exactly the opposite. Therefore, more
likely-negative (LN) instances should be incorporated to balance the training data, and
this can be carried out in a manner similar to filtering out LP instances.
Liu et al. [18] provide a survey of these bootstrapping techniques, which iteratively
tag unlabeled examples and add those with high confidence to the training set.
In the filtering process, two criteria must be considered: reliability and
informativeness. We only retain sufficiently reliable instances, or the remainder will
confuse the final model.</p>
      <p>The informativeness of an instance is also important. We do not need additional
instances if they are absolutely positive or negative. Deciding their labels is trivial for
our initial classification model. In the terminology of SVM, they are not support
vectors since they contribute nothing to the decision boundary in training. In testing,
their output values by SVM are always greater than 1 or less then -1, which means
they are distant from the separating hyperplane. Therefore, we can discard such
uninformative instances to reduce the size of the training set without diminishing
performance.</p>
      <p>Following these criteria, we now illustrate our filtration process. The flowchart of the
whole procedure is shown in Figure 2. We use the initial model trained with TP+TN
to label the LP data we collected. Those abstracts in the original LP with an SVM
output in [γ+, 1] are retained. The dataset after filtering out irrelevant instances in LP
is referred to as ‘selected likely-positive data’ (LP*).</p>
      <p>The construction of selected likely-negative (LN*) data is similar. We collect 50k
unlabeled abstracts from the PubMed biomedical literature database and classify them
by our initial model. The articles with an SVM output in [-1, γ-] are collected into the
LN* dataset.</p>
      <p>The two thresholds γ+ and γ- are empirically determined to be 0 and -0.9, respectively.
We use a looser threshold to filter LP data because of our prior knowledge of their
reliability: after all, they have been recorded as PPI-relevant in some databases.</p>
    </sec>
    <sec id="sec-15">
      <title>Exploiting likely positive and negative Data</title>
      <p>The final issue is how to utilize these filtered instances. As shown in Figure 2, the
likely data (LP* + LN*) are used to train a SVM model, the ancillary model, which is
completely independent of the original training set. Subsequently, we use the ancillary
model to predict all TP and TN instances, though their labels are already known, and
these predicted values are scaled by a factor κ and encoded as additional features in
the final model. In this manner, the final model can assign a suitable weight to the
output of the ancillary model based on its accuracy in predicting the training set,
- 17
which is assumed to be close to the accuracy in predicting the test set. The scaling
factor κ can be regarded as a prior confidence in the ancillary model.
Cohen KB, Hunter L: Natural Language Processing and Systems Biology.
In: Artificial Intelligence and Systems Biology. Edited by Dubitzky W, Azuaje
F: Springer; 2005.</p>
      <p>Donaldson I, Martin J, Bruijn Bd, Wolting C, Lay V, Tuekam B, Zhang S,
Baskin B, Bader GD, Michalickova K et al: PreBIND and Textomy –
mining the biomedical literature for protein-protein interactions using a
support vector machine. BMC Bioinformatics 2003, 4(11).</p>
      <p>Marcotte EM, Xenarios I, Eisenberg D: Mining literature for protein–
protein interactions. Bioinformatics 2001, 17(4):359-363.</p>
      <p>Lan M, Tan CL, Low H-B: Proposing a New Term Weighting Scheme for
Text Categorization. In: AAAI-06: 2006; 2006.</p>
      <p>Scott S, Matwin S: Feature engineering for text classification. In: ICML-99:
1999; 1999.</p>
      <p>Paradis F, Nie J-Y: Filtering Contents with Bigrams and Named Entities to
Improve Text Classification. In: AIRS-05: 2005; 2005.</p>
      <p>Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S,
Orchard S, Vingron M, Roechert B, Roepstorf P, Valencia A et al: IntAct: an
open source molecular interaction database. Nucleic Acids Res 2004,
32(Database issue):D452–D455.</p>
      <p>Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich
M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett 2002,
513(1):135-140.</p>
      <p>Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK,
Surendranath V, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M
et al: Development of Human Protein Reference Database as an Initial
13.
14.
15.
16.
17.</p>
      <p>Platform for Approaching Systems Biology in Humans. Genome Res 2003,
13:2363-2371.</p>
      <p>Güldener U, Münsterkötter M, Oesterheld M, Pagel P, Ruepp A, Mewes H-W,
Stümpflen V: MPact: the MIPS protein interaction resource on yeast.
Nucleic Acids Res 2006, 34(Database Issue):D436-D441.</p>
      <p>Breitkreutz B-J, Stark C, Tyers M: The GRID: the General Repository for
Interaction Datasets. Genome Biol 2003, 4(3).</p>
      <p>Moschitti A, Basili R: Complex linguistic features for text classification: A
comprehensive study. . In: ECIR-04: 2004; 2004.</p>
      <p>Joachims T: Text Categorization with Support Vector Machines: Learning
with Many Relevant Features. In: ECML-98: 1998; 1998.</p>
      <p>Cristianini N, Shawe-Taylor J: An Introduction to Support Vector
Machines: Cambridge University Press; 2000.</p>
      <p>Liu B, Lee WS, Yu PS, Li X: Partially Supervised Classification of Text
Documents In: Proceedings of the Nineteenth International Conference on
Machine Learning (ICML-2002): 2002; 2002.</p>
      <p>Prediction
Ancill Model</p>
      <p>Final Model
classes. ¬ti stands for all words other than ti
ti
w
y
Class
Positive
Negative</p>
      <p>Training</p>
      <p>Test
Bag 2
104×10-5
1.77×10-2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Krallinger</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valencia</surname>
            <given-names>A</given-names>
          </string-name>
          :
          <article-title>Evaluating the Detection and Ranking of Protein Interaction Relevant Articles: the BioCreative Challenge Interaction Article Sub-task (IAS)</article-title>
          . In: Second BioCreAtIvE Challenge Workshop:
          <year>2007</year>
          ;
          <year>2007</year>
          :
          <fpage>29</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bader</surname>
            <given-names>GD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betel</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogue</surname>
            <given-names>CW</given-names>
          </string-name>
          :
          <article-title>BIND: the Biomolecular Interaction Network Database</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2003</year>
          ,
          <volume>31</volume>
          (
          <issue>1</issue>
          ):
          <fpage>248</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Xenarios</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rice</surname>
            <given-names>DW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salwinski</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baron</surname>
            <given-names>MK</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcotte</surname>
            <given-names>EM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D E</surname>
          </string-name>
          :
          <article-title>DIP: the database of interacting proteins</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2000</year>
          ,
          <volume>28</volume>
          (
          <issue>1</issue>
          ):
          <fpage>289</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Table 8</article-title>
          .
          <source>Summary of weights in different bags Bag 0 Bag 1 Mean -5</source>
          .6×
          <fpage>10</fpage>
          -
          <lpage>5</lpage>
          60×
          <fpage>10</fpage>
          -
          <lpage>5</lpage>
          Dev. 1.38×
          <fpage>10</fpage>
          -
          <lpage>2</lpage>
          1.39×
          <fpage>10</fpage>
          -
          <lpage>2</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>