<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emidio Capriotti*§ and Russ B. Altman*‡</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>, Stanford University</institution>
          ,
          <addr-line>Stanford (CA)</addr-line>
          ,
          <country country="US">United States of America</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics and Computer Sciences, University of Balearic Islands</institution>
          ,
          <addr-line>Palma de Mallorca</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Departments of Bioengineering</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>and Genetics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. The non-synonymous SNPs occurring in coding regions resulting in single amino acid polymorphisms (SAPs) may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performances, the quality of the prediction can be further improved introducing new features derived from the protein three-dimensional structure. Results: In this paper, we present a structure-based machine learning approach to predict disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features from the protein sequence, structure and function information. After dataset balancing, the structure-based method reaches an overall accuracy of 84%, a correlation coefficient of 0.67, and an area under the receiving operating characteristic curve (AUC) of 0.91. When compared with a similar sequencebased predictor, structure-based method results in an increase of the overall accuracy and the AUC ~3%, and 0.06 for the correlation coefficient. Conclusion: This work demonstrates that structural information can increase the accuracy of detecting of disease-related SAPs. Our results also quantify the magnitude of the improvement on a large data. This improvement is in agreement with the previously observed results in the prediction of the protein stability change upon mutation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>protein mutation [4-16]. These algorithms are able to predict the protein stability change
[10, 11, 16], the variation in protein functional activity [6] and the insurgence of human
pathologies [4, 5, 7-9, 12-15]. The majority of the methods rely on information derived
from protein sequence [4, 8, 9, 14], others use protein structure data [12, 17] and
knowledge-based information [7, 13, 15]. In this paper we focus our attention on SAPs
presenting a new machine learning based method to predict disease-related SAPs using
together protein sequence, structural and functional information. We quantified the
improvement of the performance resulting from the use of protein structure information.</p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
    </sec>
    <sec id="sec-3">
      <title>Performance of the method</title>
      <p>In the last decades machine learning approaches have been successfully used to
address several biological problems and develop new prediction methods. We modified
a previously developed predictor introducing new three-dimensional structure
information. In particular we use new features to describe the structural environment of
the mutation considering a radius shell of 6 Å around the C-α. To quantify the
improvement of the accuracy resulting from the use of 3D structure information, we
compare the performances of a structure-based method (SVM-3D) with a
sequencebased one (SVM-SEQ). In Tab. 1 different accuracy measures for both predictors are
reported. The structure-based method results in 3% better overall accuracy and 0.06
better correlation. Comparing the ROC curves (Fig. 1 A), SVM-3D results in 0.02 better
Area Under the Curve (AUC) with respect to SVM-SEQ. If 10% of wrong predictions are
accepted SVM-3D has 6% more true positive. The output returned by the SVM has been
used to calculate the Reliability Index (RI) and filter prediction. If predictions with RI&gt;5
are selected the SVM-3D method results in 90% overall accuracy, 0.81 correlation
coefficient on 74% of the whole dataset (see Fig 1 B). Analyzing the predictions of
SVMSEQ and SVM-3D methods we found that outputs agree in the 88% of the cases. On this
subset the overall accuracy is 86% and the correlation coefficient of the method is 0.73.
For the remaining 12% of the predictions, SVM-SEQ method results in a very poor
overall accuracy and correlation respectively 37% and -0.25. SVM-3D performs slightly
better than a random predictor resulting in 63% overall accuracy and a 0.25 correlation
(see Tab 2).</p>
    </sec>
    <sec id="sec-4">
      <title>Structure environment analysis</title>
      <p>Protein three-dimensional structural information is an important feature to predict the
effect of SAPs. The analysis of the protein structure provides information about the
environment of the mutation. In fact, the effect of the mutation depends on the position of
the mutated residue, if it is buried in the hydrophobic core or exposed on the surface of
the protein. In Fig. 2 panel A the distributions of the relative solvent accessible area
(RSA) for disease-related and neutral variants are plotted. The two distributions have
mean RSA values of 20.6 and 35.7 respectively for disease-related and neutral variants
(see Fig 2 panel A). They are significantly different and the Kolmogorov-Smirnov test
returns a p-value of 2.8*10-71. We calculated the overall accuracy and correlation
coefficient of our method dividing the dataset in 10 bins according to RSA value of the
mutated residue. The SVM-3D method shows better performance in the prediction of
buried (RSA&lt;20) and highly exposed (RSA&gt;80) residues (see Fig 2 panel B).</p>
    </sec>
    <sec id="sec-5">
      <title>Scoring the residue interactions</title>
      <p>The protein three-dimensional structure information is important to calculate the
interactions between residues far in the sequence but close in the 3D space. We defined
two types of interactions: the lost interactions are those missing after the wild-type
mutation and the new interactions formed by the mutant residue. In this section we
compared the frequency of lost and new interactions related to disease or neutral
mutations. We calculated the log odd score for lost and new interactions respectively in
panels A and B (see Fig. 3). According to these results, the most deleterious lost
contacts are between and Cys-Cys and newly formed interactions between Trp-Trp are
the most damaging ones. The missing Cys-Cys interactions could lead to the loss of a
disulphide bond and the mutation of a residue into a Tryptophan when close to another
Tryptophan could result in stereo-chemical problems.</p>
      <p>An example of missing Cys-Cys interaction has been observed in the mutation of
Cys163 in the Glycosylasparaginase (Swiss-Prot:ASPG_HUMAN). This mutation is
responsible for the insurgence of the Aspartylglucosaminuria (MIM:208400). Looking at
the protein structure (Fig 4), we found that the mutation of the Cys163 to Serine results
in the loss of the disulfide bridge between Cys163 and Cys179 (respectively Cys140 and
Cys156 in the PDB structure 1APY chain A). Interesting example of possible damaging
newly formed interaction can be observed in the Thyroid hormone receptor
(SwissProt:THB_HUMAN) where the mutation of Arg243 into Tryptophan is cause of the
Thyroid hormone resistance (MIM:188570,274300). Analyzing the protein structure
(1 !" chain A) we found that the new Tryptophan could be close to another one in
position 239. This mutation could result in stereo-chemical problems in the pocket
around the position 243 (see Fig 5). Both the examples are correctly predicted by
structure-based method and wrongly predicted by the sequence-based algorithm.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We developed a new machine learning approach based on protein structure information
to predict the effect of SAPs. The method has been compared to a previously developed
sequence-based predictor to quantify the increase of accuracy achieved by protein
structure information. Using a balanced set of 6,630 mutations the structure-based
method results in about 3% higher accuracy and AUC and 0.06 higher correlation with
respect to sequence-based one. Although the increase the accuracy is not extremely
high the introduction of structure information can be particularly useful in specific
situation providing insight about the disease mechanism like in the cases discussed
above. The prediction improvement is in agreement with the previously results observed
in the prediction of the protein stability change upon mutation [10].</p>
    </sec>
    <sec id="sec-7">
      <title>Methods</title>
    </sec>
    <sec id="sec-8">
      <title>Datasets</title>
      <p>The preformaces of machine learning methods strongly depend from the training set.
This is the reason why the selection of a representative set of SAPs is a pivotal issue in
the development of predictive algorithms. A previous analysis of different SAPs
databases has shown that annotated set of variants from Swiss-Var database is the best
available one [18]. According to this observation, we selected our set of SAP from
SwissVar release 57.9 (Oct 2009) and we map all the variants on the protein structures
available in the Protein Data Bank (PDB) [19]. To reduce the number of sequence
alignments between Swiss-Prot sequences and sequences derived from the PDB, we
use a precompiled list of correspondences between Swiss-Prot and PDB codes available
at the ExPASY web site. Using this list we aligned each pair of sequences using Blast
algorithm [20] and filtering out alignment with: i) gaps, ii) sequence identity lower than
100% and iii) shorter than 40 residues. The remaining alignments are used to calculate
the correspondence between the Swiss-Prot and PDB residue numerations. In case a
mutation maps in more than one protein structure, the one with best resolution has been
selected. After this filtering procedure we obtain a set of 4,986 mutations from 784
protein chains. The dataset of variants mapped into protein structures is composed by
3,342 disease-related SAPs and 1,644 neutral polymorphisms. To keep the dataset
balanced we doubled the number of neutral variants considering their reverse mutation
as neutral. The final set results in 6,630 mutations about equally distributed between
disease-related and neutral SAPs.</p>
    </sec>
    <sec id="sec-9">
      <title>Implemented SVM-based predictors</title>
      <p>The proposed task is to predict whether a given single amino acid polymorphism is a
neutral or disease-related. The task is treated as a binary classification problem for the
protein upon mutation. The Support Vector Machine (SVM) input features for the
structural-based predictor include: the amino acid mutation, the mutation structural
environment, the sequence-profile derived features, and a functional-based log-odds
score calculated considering the GO classification. The final input vector consists 48
elements:
• 20 components encoding for the mutations (Mut)
• 21 local protein structure information (3D)
• 5 inputs features derived from sequence profile (Prof)
• 2 elements encoding for the number of GO term associated to the protein and the</p>
      <p>GO log-odd score (LGO).</p>
      <p>A similar sequence-based SVM predictor has been used to measure the increase of
accuracy resulting from the use of protein three-dimensional structure information. The
structure-based SVM differs only in the 21 elements vector encoding for the local protein
structure environment (3D) that replaces the 20 elements vector encoding for the
sequence environment. More details about the SVM input features have been described
in supplementary materials.</p>
    </sec>
    <sec id="sec-10">
      <title>Interaction score</title>
      <p>The residues interactions are defined considering all the residues within a radius shell of
6 Å around the C-α of the mutated residue. According to this we calculate a log odd
score dividing the frequency of lost interactions related to disease by the same type of
interactions that have no pathological effect.</p>
      <p>Although the mutations could be responsible for protein structural changes, as first
approximation, we consider the position of the C-α of the new residue will not change
significantly after the mutation. Hence, we consider new interactions those between the
mutant residue and the residues previously interacting with the wild-type. A score of the
possible damaging effect of lost or new interactions are calculated as follow
LCk=log2[f(ck(i,j),D)/f(ck(i,j),N)]
[1]
where fk(ck(i,j),D) and f(ck(i,j),N) are the frequencies of contacts between residues i and j
respectively for disease-related (D) and neutral (N) variants and k is equal to l or n
respectively for lost and new interactions.</p>
    </sec>
    <sec id="sec-11">
      <title>Accuracy measures</title>
      <p>The performances of our methods are evaluated using a 20-fold cross-validation
procedure on the whole SAPs dataset. The dataset has been divided keeping the ratio of
the disease-related to the neutral polymorphism mutations similar to the original
distribution of the whole set. Furthermore, all the proteins in the datasets are clustered
according to their sequence similarity with the blastclust program in the BLAST suite [20]
by adopting the default value of length coverage equal to 0.9 and the percentage
similarity threshold equal to 30%. We kept all the mutations belonging to a protein in the
same training set to overestimate the performance. Classical accuracies measures such
as the overall accuracy (Q2), the sensitivity (S), the probability of correct predictions (P),
the Matthewʼs correlation coefficient (C), the false and true positive rates (FPR, TPR)
and the area under the ROC curve (AUC) are used to score the performance of our
predictors. A Reliability Index (RI) score has been calculated to select more reliable
predictions. More details about the definition of the statistical index used in this work are
provided in the supplementary materials.</p>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgments</title>
      <p>EC acknowledges support from the Marie Curie International Outgoing Fellowship
program (PIOF-GA-2009-237225). RBA would like to acknowledge the following funding
sources: NIH LM05652 and GM61374.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.</p>
    </sec>
    <sec id="sec-13">
      <title>Capriotti E, Calabrese R, Casadio R: Predicting the insurgence of human genetic</title>
      <p>diseases associated to single point protein mutations with support vector
machines and evolutionary information. Bioinformatics 2006, 22(22):2729-2734.</p>
    </sec>
    <sec id="sec-14">
      <title>Capriotti E, Fariselli P, Casadio R: I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 2005, 33(Web</title>
      <p>Server issue):W306-310.</p>
    </sec>
    <sec id="sec-15">
      <title>Guerois R, Nielsen JE, Serrano L: Predicting changes in the stability of proteins and</title>
      <p>protein complexes: a study of more than 1000 mutations. J Mol Biol 2002,
320(2):369-387.</p>
      <p>Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A:
LSSNP: large-scale annotation of coding non-synonymous SNPs based on multiple
information sources. Bioinformatics 2005, 21(12):2814-2820.</p>
      <p>Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P:
Automated inference of molecular mechanisms of disease from amino acid
substitutions. Bioinformatics 2009, 25(21):2744-2750.</p>
    </sec>
    <sec id="sec-16">
      <title>Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res</title>
      <p>2001, 11(5):863-874.</p>
    </sec>
    <sec id="sec-17">
      <title>Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey.</title>
      <p>Nucleic Acids Res 2002, 30(17):3894-3900.</p>
      <p>Capriotti E, Fariselli P, Rossi I, Casadio R: A three-state prediction of single point
mutations on protein stability changes. BMC Bioinformatics 2008, 9 Suppl 2:S6.
Wong WS, Yang Z, Goldman N, Nielsen R: Accuracy and power of statistical
methods for detecting adaptive evolution in protein coding sequences and for
identifying positively selected sites. Genetics 2004, 168(2):1041-1051.
Care MA, Needham CJ, Bulpitt AJ, Westhead DR: Deleterious SNP prediction: be
mindful of your training data! Bioinformatics 2007, 23(6):664-672.</p>
      <p>Berman H, Henrick K, Nakamura H, Markley JL: The worldwide Protein Data Bank
(wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 2007,
35(Database issue):D301-303.</p>
      <p>Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped</p>
    </sec>
    <sec id="sec-18">
      <title>BLAST and PSI-BLAST: a new generation of protein database search programs.</title>
      <p>
        <xref ref-type="bibr" rid="ref4">Nucleic Acids Res 1997</xref>
        , 25(17):3389-3402.
      </p>
      <p>Fig. 3 Log odd score for lost residues interactions (A) and for newly formed interactions (B). The
red zones correspond to damaging lost or new interactions. Bleu points correspond to neutral
interactions.</p>
      <p>SVM-SEQ
SVM-3D
The accuracy measures are defined in supplementary materials. D, N stands for disease-related
and neutral variants respectively.</p>
      <p>Q2 P[D] S[D] P[N] S[D] C AUC PM
SEQ∩3D 0.86 0.85 0.89 0.88 0.84 0.73 0.92 88
SEQ-3D 0.63 0.66 0.65 0.60 0.60 0.25 0.68 12
3D-SEQ 0.37 0.40 0.35 0.34 0.40 -0.25 0.40 12
SEQ∩3D indicates the subset of agree predictions, SEQ-3D and 3D-SEQ are respectively the
predictions of SVM-SEQ and SVM-3D on the not agree prediction subset. The accuracy
measures are defined in supplementary materials. PM is the fraction of the dataset. D, N stands
for disease-related and neutral variants respectively.</p>
      <sec id="sec-18-1">
        <title>Supplementary Material Improving the prediction of disease-related variants using protein three-dimensional structure.</title>
        <p>Departments of Bioengineering* and Genetics‡, Stanford University, Stanford (CA), United States
of America; §Department of Mathematics and Computer Sciences, University of Balearic Islands,</p>
        <p>Palma de Mallorca, Spain.</p>
        <p>{emidio, russ.altman}@stanford.edu</p>
      </sec>
      <sec id="sec-18-2">
        <title>Support Vector Machine (SVM) input features</title>
        <p>The SVM-based methods developed in this work consider in input the following features:
i) residue mutation; ii) protein sequence profile; iii) functional score based on Gene
Ontology (GO) terms and iv) either sequence or structure mutation environment.</p>
        <sec id="sec-18-2-1">
          <title>Encoding residue mutation</title>
          <p>The input vector relative to mutation consists of 20 values: the first 20 (the 20 residue
types) explicitly define the mutation by setting to -1 the element corresponding to the wild
type residue and to 1 the newly introduced residue (all the remaining elements are kept
equal to 0).</p>
        </sec>
        <sec id="sec-18-2-2">
          <title>Encoding mutation structure environment</title>
          <p>The protein structural environment is encoding with a 21 elements vector. The first 20
elements encode for the number of each residue type, which have at least one heavy
atom within a radius shell around the C-α of the mutated residue. After an optimization
procedure a shell of 6 Å radius has been considered. The 21st element is the relative
solvent accessible area calculated using the DSSP program [1].</p>
        </sec>
        <sec id="sec-18-2-3">
          <title>Encoding mutation sequence environment</title>
          <p>The 20 element input values for the mutation sequence environment (the 20 elements
represent the 20 residue types) encode for the number of the each residue type, to be
found inside a window centered at the residue that undergoes mutation and that
symmetrically spans the sequence to the left (N-terminus) and to the right (C-terminus)
with a length of 19 residues [2].</p>
        </sec>
        <sec id="sec-18-2-4">
          <title>Encoding sequence profile information</title>
          <p>We derive for each mutation: the frequency of the wild type, the frequency of the mutated
residue, the number of totally and locally aligned sequences and a conservation index
(CI) for the position at hand: the more a residue is functionally important the more is
conserved over evolution [3]. The conservation index is calculated as:
where fa(i) is the relative frequency of residue a at mutated position i and fa is the overall
frequency of the same residue in the alignment. The sequence profile is computed from
the output of the BLAST program [4] running on the uniref90 database (Oct 2009)
(Evalue threshold=10-9, number of runs=1).</p>
        </sec>
        <sec id="sec-18-2-5">
          <title>Functional based score</title>
          <p>The Gene Ontology log-odds score (LGO) provides information about the correlation
among a given mutation type (disease related and neutral) and the protein function. The
annotation data are relative to the GO Database (version Mar 2010) and are retrieved at
the web resource hosted at European Bionformatics Institute (EBI). To calculate the
LGO, first we derived the GO terms from all the three branches (molecular function,
biological process and cellular components) for all our proteins in the dataset. For each
annotated term the appropriate ontology tree was traversed upward to retrieve all the
parent terms with the GO-TermFinder tool (http://search.cpan.org/dist/GO-TermFinder/)
[5] and counting a GO term only once. The log-odds score associated to each protein is
calculated as:</p>
          <p>LGO=Σ log2[fGO(D)/fGO(N)]
where fGO is the frequency of occurrence of a given GO term for the disease-related (D)
and neutral mutations (N) adding one pseudo-count to each class. To prevent the
overfitting, the LGO scores are evaluated considering fGO values computed over the
training sets without including in the GO term counts of the corresponding test set.</p>
        </sec>
        <sec id="sec-18-2-6">
          <title>Support Vector Machine software</title>
          <p>The LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) has been used for the
SVM implementation [6]. The selected SVM kernel is a Radial Basis Function (RBF)
kernel K(xi,xj)=exp(-γ||xi-xj||2) and γ and C parameters are optimized performing a grid
like search. After input rescaling the values of the best parameters are C=8 and
γ=0.03125</p>
          <p>CI(i)=[Σa=120(fa(i)-fa)2]1/2</p>
        </sec>
      </sec>
      <sec id="sec-18-3">
        <title>Statistical indexes for accuracy measure</title>
        <p>The prediction accuracy is scored with several measures. In this paper the efficiency of
our predictors have been scored using the following statistical indexes.</p>
        <p>The overall accuracy is:
where P is the total number of correctly predicted mutations and N is the total number of
mutations. The Matthewʼs correlation coefficient C is defined as:</p>
        <p>C(s)=[p(s)n(s)-u(s)o(s)] / D
where D is the normalization factor:
P(s)=p(s) / [p(s) + o(s)]
for each class s (D and N, stand for disease-related and neutral mutations respectively);
p(s) and n(s) are the total number of correct predictions and correctly rejected
assignments, respectively, and u(s) and o(s) are the numbers of false negative and false
positive for the class s.</p>
        <p>The coverage S (sensitivity) for each discriminated class s is evaluated as:
where p(s) and u(s) are the same as in Equation 5.</p>
        <p>The probability of correct predictions P (or positive predictive values) is computed as:
where p(s) and o(s) are the same as in Equation 5 (ranging from 0 to 1).
For each prediction a reliability score (RI) is calculated as follows:
where O(D) is the SVM output. Other standard scoring measures, such as the area
under the ROC curve (AUC) and the true positive rate (TPR= Q(s)) at 10% of False
Positive Rate (FPR= 1-P(s)) are also computed [7].</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Kabsch</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
            <given-names>C</given-names>
          </string-name>
          :
          <article-title>Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features</article-title>
          .
          <source>Biopolymers</source>
          <year>1983</year>
          ,
          <volume>22</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2577</fpage>
          -
          <lpage>2637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Capriotti</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calabrese</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casadio</surname>
            <given-names>R</given-names>
          </string-name>
          :
          <article-title>Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information</article-title>
          .
          <source>Bioinformatics</source>
          <year>2006</year>
          ,
          <volume>22</volume>
          (
          <issue>22</issue>
          ):
          <fpage>2729</fpage>
          -
          <lpage>2734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Pei</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grishin</surname>
            <given-names>NV</given-names>
          </string-name>
          :
          <article-title>AL2CO: calculation of positional conservation in a protein sequence alignment</article-title>
          .
          <source>Bioinformatics</source>
          <year>2001</year>
          ,
          <volume>17</volume>
          (
          <issue>8</issue>
          ):
          <fpage>700</fpage>
          -
          <lpage>712</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>Nucleic Acids Res</source>
          <year>1997</year>
          ,
          <volume>25</volume>
          (
          <issue>17</issue>
          ):
          <fpage>3389</fpage>
          -
          <lpage>3402</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Bioinformatics</source>
          <year>2004</year>
          ,
          <volume>20</volume>
          (
          <issue>18</issue>
          ):
          <fpage>3710</fpage>
          -
          <lpage>3715</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>Neural Comput</source>
          <year>2001</year>
          ,
          <volume>13</volume>
          (
          <issue>9</issue>
          ):
          <fpage>2119</fpage>
          -
          <lpage>2147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Baldi</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunak</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chauvin</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andersen</surname>
            <given-names>CA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
            <given-names>H</given-names>
          </string-name>
          :
          <article-title>Assessing the accuracy of prediction algorithms for classification: an overview</article-title>
          .
          <source>Bioinformatics</source>
          <year>2000</year>
          ,
          <volume>16</volume>
          (
          <issue>5</issue>
          ):
          <fpage>412</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>