<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Meta-Learning for Escherichia Coli Bacteria Patterns Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hafida Bouziane</string-name>
          <email>h_bouziane@univ-usto.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Belhadri Messabih</string-name>
          <email>messabih@univ-usto.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdallah Chouarfia</string-name>
          <email>chouarfia@univ-usto.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>MB University</institution>
          ,
          <addr-line>BP 1505 El M'Naouer 3100 Oran</addr-line>
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <fpage>139</fpage>
      <lpage>150</lpage>
      <abstract>
        <p>In machine learning area, there has been a great interest during the past decade to the theory of combining machine learning algorithms. The approaches proposed and implemented become increasingly interesting at the moment when many challenging real-world problems remain difficult to solve, especially those characterized by imbalanced data. Learning with imbalanced datasets is problematic, since the uneven distribution of data influences the behavior of the majority of machine learning algorithms, which often lead to poor performance. It is within this type of data that our study is placed. In this paper, we investigate a meta-learning approach for classifying proteins into their various cellular locations based on their amino acid sequences, A meta-learner system based on k-Nearest Neighbors (kNN) algorithm as base-classifier, since it has shown good performance in this context as individual classifier and DECORATE as meta-classifier using cross-validation tests for classifying Escherichia Coli bacteria proteins from the amino acid sequence information is evaluated. The paper reports also a comparison against a Decision Tree induction as baseclassifier. The experimental results show that the k-NN-based meta-learning model is more efficient than the Decision Tree-based model and the individual k-NN classifier.</p>
      </abstract>
      <kwd-group>
        <kwd>Classification</kwd>
        <kwd>Meta-Learning</kwd>
        <kwd>Imbalanced Data</kwd>
        <kwd>Subcellular Localization</kwd>
        <kwd>E</kwd>
        <kwd>coli</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Most of the current research projects in bioinformatics deal with structural
and functional aspects of genes and proteins. High-throughput genome
sequencing techniques have led to an explosion of newly generated protein
sequences. Nowadays, the function of a huge number among them is still not
known. This challenge provides strong motivation for developing
computational methods that can infer the protein’s function from the amino
acid sequence information. Thus, many automated methods have been
developed for predicting protein structural and molecular properties such as
domains, active sites, secondary structure, interactions, and localization from
only the amino acid sequence information. One helpful step for
understanding and therefore, elucidating the biochemical and cellular function
of proteins is to identify their subcellular distributions within the cell. Most
 
of the existing predictors for protein localization sites are used with the
assumption that each protein in the cell has one, and only one, subcellular
location. In each cell compartment, specific proteins ensure specific roles
that describe their cellular function which is critical to a cell’s survival. This
fact means that the knowledge of the compartment or site in which a protein
resides allows to infer its function. So far, many methods and systems have
been developed to predict protein subcellular locations and one of the most
thoroughly studied single cell organism is Escherichia coli (E.coli) bacteria.</p>
      <p>
        The first approach for predicting the localization sites of proteins from
their amino acid sequences was a rule based expert system PSORT developed
by Nakai and Kanehisa [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ], then the use of a probabilistic model by Horton
and Nakai [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which could learn its parameters from a set of training data,
improved significantly the prediction accuracy. It achieved an accuracy of
81% on E.coli dataset. Later, the use of standard classification algorithms
achieved higher prediction accuracy. Among these algorithms, k-Nearest
Neighbors (k-NN), binary Decision Tree and Naïve Bayesian classifier. The
best accuracy has been achieved by k-NN classifier, that the classification of
the E.coli proteins into 8 classes achieved an accuracy of 86% by
crossvalidation tests [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], The accuracy has been improved significantly compared
to that obtained before. Since these works, many systems that support
automated prediction of subcellular localization using variety of machine
learning techniques have been proposed. With recent progress in this domain,
various features of a protein are considered, like composition of amino acids
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], pseudo amino acids [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and dipeptide and physico-chemical properties
[
        <xref ref-type="bibr" rid="ref7">7,8</xref>
        ]. The performance of existing methods varies and different prediction
accuracies are claimed. Most of them achieve high accuracy for the most
populated locations, but are generally less accurate on the locations
containing fewer specific proteins. Recently, there has been a great interest
to the theory of combining classifiers to improve performance [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ]. Several
approaches known as ensembles of classifiers (committee approaches) have
been proposed and investigated through a variety of artificial and real-world
datasets. The main idea behind is that often the ensemble achieves higher
performance than each of its individual classifier component. One can
distinguish two groups of methods: methods that combine several
heterogeneous learning algorithms as base-level classifiers over the same
feature set [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ], such as stacking, grading and voting, and methods which
construct ensembles (homogeneous classifiers) generated by applying a
single learning algorithm as base-classifier by sub-sampling the training sets,
creating artificial data to construct several learning sets from the original
feature set, such as boosting [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ], bagging [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ] and Random Forests [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ].
In protein localization sites prediction problem, data distribution is often
imbalanced. For the best of our knowledge, there are two major approaches
that try to solve the class imbalance problems: the one which use resampling
 
methods and the one that modify the existing learning algorithms.
Resampling strategy balances the classes by adding artificial data for
improving the minority class prediction of some classifiers. Here, we focus
on the resampling methods, since they are simplest methods to increase the
size of the minority class. This article investigates the effectiveness of the
meta-learning approach DECORATE |14] to create a meta-level dataset
trained using a simple k-NN algorithm as base-classifier in classifying
proteins in their subcellular locations in E.coli benchmark dataset using
cross-validation and compares the results by using Decision Tree induction as
base-classifier.
      </p>
      <p>The rest of the paper is organized as follows. Section 2, presents the
materials and the methodology adopted and presents a brief description of
E.coli benchmark dataset as well as the evaluation measures used for
performance evaluation. Then, section 3 summarizes and discusses the results
obtained by the experiments, it also presents a comparison of Decision Tree
induction against the k-NN algorithm as base-classifiers to the meta-classifier
DECORATE. Finally, section 4 concludes this study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Material and Methods</title>
      <p>2.1 E.coli Dataset</p>
      <p>
        The prokaryotic gram-negative bacterium Escherichie Coli is an
important component of the biosphere, it colonises the lower gut of animals
and humans. The Escherichia Coli benchmark dataset has been submitted to
the UCI1 Machine Learning Data Repository [
        <xref ref-type="bibr" rid="ref14">15</xref>
        ]. It is well descripted in
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1,2,3</xref>
        ]. The dataset patterns are characterized by attributes calculated from the
amino acid sequences. Protein patterns in the E.coli dataset are classified to
eight classes, it is a drastically imbalanced dataset of 336 patterns. One can
find classes with more than 130 patterns and other ones with only 2 or 5
patterns. Each pattern with eight attributes (7 predictive and 1 name
corresponding to the accenssion number for the SWISSPROT2 database),
where the predictive attributes correspond to the following features : (1) mcg:
McGeoch's method for signal sequence recognition [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ], the signal sequence
is estimated by calculating discriminate score using length of N-terminal
positively-charged region (H-region); (2) gvh: Von Heijne's method [
        <xref ref-type="bibr" rid="ref16 ref17">17,18</xref>
        ]
for signal sequence recognition., the score estimating the cleavage signal is
evaluated using weight-matrix and the cleavage sites consensus patterns to
detect signal-anchor sequences; (3) lip: Von Heijne's Signal Peptidase II
consensus sequence score; (4) chg: binary attribute indicating presence of
charge on N-terminus of predicted lipoproteins; (5) aac: score of discriminate
                                                           
1 Web site: http://archive.ics.uci.edu/ml
2 Web site: http://www.uniprot.org/
 
analysis of the amino acid content of outer membrane and periplasmic
proteins; (6) alm1: score of the ALOM membrane spanning region prediction
program, it determines whether a segment is transmembrane or peripheral; (7)
alm2: score of ALOM program after excluding putative cleavable signal
regions from the sequence.
      </p>
      <p>
        Protein patterns in this dataset are organized as follows: 143 patterns of
cytoplasm (cp), 77 of inner membrane without signal sequence (im), 52 of
periplasm (pp), 35 of inner membrane without uncleavable signal sequence
(imU), 20 of outer membrane without lipoprotein (omL), 5 of outer membrane
with lipoprotein (omL), 2 of inner membrane without lipoprotein (imL) and 2
patterns of inner membrane with cleavage signal sequence (imS). The class
distribution is extremely imbalanced, especially for imL and imS proteins.
2.2 Base-Classifiers
The problem considered here is multi-class, let us denote by Q the number of
categories or classes, Q≥3. Each object is represented by its description x ϵ
X, where X represents the feature set and its category y ϵ Y, where Y denotes
a set of the Q categories and can be identified with the set of indices of the
categories: Y={1, …,Q}. The assignation of the descriptions to the categories
is performed by means of a classifier, The chosen classifiers are then
described in the following subsections.
2.2.1 k-Nearest Neighbors Classifier
The k-nearest neighbors (k-NN) rule [
        <xref ref-type="bibr" rid="ref18">19</xref>
        ] is considered as a lazy approach. It
is one of the oldest and simplest supervised learning algorithm. Objects are
assigned to the class having the majority of the k Nearest Neighbors in the
training set. Usually, Euclidean distance is used as the distance metric. Given
a test example x with unknown class, the algorithm assigns to the example x
the class which is most frequent among the k training examples nearest to
that query example, according to the distance metric. The classification
accuracy of k-NN algorithm can be improved significantly if the distance
metric is learned with specialized algorithms, many studies try to find the
best way to improve the k-NN performance taking into account this factor. In
practice, k is usually chosen to be odd. The best choice of this parameter
depends on the data concerned with the problem at hand. This algorithm has
shown good performance in biological and medical data classification
problems.
2.2.2 Decision Tree Induction
A Decision Tree [
        <xref ref-type="bibr" rid="ref19">20</xref>
        ] is a powerful way of knowledge representation. The
model produced by a decision tree classifier is represented in the form of tree
structure. The principle, consists in building decision trees by recursively
selecting attributes on which to split. The criterion used for selecting an
attribute is information gain. A leaf node indicates the class of the examples.
 
The instances are classified by sorting them down the tree from the root node
to some leaf nodes. Posterior probabilities are estimated by the class
frequencies of the training set in each end node. In this study, we used a
decision tree built by C4.5 [
        <xref ref-type="bibr" rid="ref20">21</xref>
        ].
2.3 Meta-Classifier
Meta-learners such as Boosting, Bagging and Random Forests provide
diversity by sub-sampling or re-weighting the existing training examples [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ].
Decorate (Diverse Ensemble Creation by Oppositional Relabeling of Artificial
Training Examples) performs by adding randomly constructed examples to
the training set when building new ensemble members (committee). It has
been conceived basing on a diversity measure introduced by the authors. The
measure defined expresses the ensemble member disagreement with the
ensemble's prediction. If Cj is an ensemble member classifier, Cj(x) the class
label predicted by the classifier Cj for the example x and C *(x) the prediction
of the ensemble, the diversity dj of Cj on the example x is defined as follows :
The diversity of an ensemble of M members, on a training set of N examples
is computed as follows :
(1) 
(2)
The approach consists in constructing an ensemble of classifiers which
maximize the diversity measure D. Three parameters are needed: the artificial
size which is a fraction of the original training set, the desired number of
member classifiers and maximum number of iterations to perform. Initially,
the ensemble contains the classifier (base-classifier) trained on the original
data. The members added to the ensemble in the successive iteration are
trained on the original training data combined with some artificial data. To
generate the artificial training examples named as diversity data, the algorithm
takes in account the specified fraction of the training set size. The class labels
assigned to the diversity data differ maximally from the current predictions of
the committee (completely opposite labels). The current classifier is added to
the committee if it increases the ensemble diversity, otherwise it is rejected.
The process is repeated until the desired committee size is reached or the
number of iterations is equal to the maximum fixed. Each classifier Cj of the
committee C* provides probabilities for the class membership of each
example to classify. If PCj,k (x) represents the estimated probability of x to
belong to the class labeled k according to the classifier Cj, to classify an
example x, the algorithm considers the most probable class as the label for x
as follows :
Where Pk (x) represents the probability that x belongs to the class labeled k
computed for the entire ensemble , it is expressed as :
(3)
(4)
In this paper, we performed two sets of experiments. In the first one, we used
the k-NN classifier as base-classifier. In the second one, we used Decision
Tree as base-classifier, which is used in the original DECORATE
conception. Our goal was to empirically evaluate the two models on the E.coli
dataset. For this purpose, we proceed for the two sets of experiments in two
steps. In the first step, we evaluated both the two individual classifiers on
Ecoli dataset applying cross-validation and in the second step we used the
meta-learning system applying also cross-validation to prediction
performance assessment. For all experiments, we made preliminary trials to
select the appropriate parameters (model selection).
2.4 Evaluation Measures
Any results obtained by machine learning algorithms must be evaluated
before one can have any confidence in their classifications, this aspect of
machine learning theory is not only usefull but fondamental. There are several
standard methods for evaluation. In what follows, we present only the
measures used in this study.
2.4.1 Cross Validation
In this study, we used Cross Validation tests to evaluate the classifier
robustness, this methodology is most suitable to avoid biased results. Thus,
the whole training set was divided into five mutually exclusive and
approximately equal-sized subsets and for each subset used in test, the
classifier was trained on the fusion of all the other subsets. So, cross
validation was run five times for each classifier and the average value of the
five-cross validations was calculated to estimate the overall classification
accuracy.
2.4.2 Classification Accuracy Measurements
Some of the most relevant evaluation measures are precision, recall and
Fmeasure. In this study, we adopted the three measures, for evaluating the
effectiveness of the classification for each class and the classification
accuracy for all the classes as performance measures. A confusion matrix
(contingency table of size QXQ has been used, M = (mkl)1≤k,l≤Q, where mkl
denotes the number of examples observed in class k and classified in class l.
The rows indicate different classes observed and the columns show the result
of the classification method for each class. The number of correctly classified
examples is the sum of diagonal elements in the matrix, all others are
incorrectly classified. The F-measure has two components, which are: the
Recall and the Precision. The Recall is the ratio of the number of positive
examples (correctly classified) of class k and the number of all positive
(observed) examples in class k. We can express this ratio using confusion
matrix elements as follows:
The Precision is the ratio of number of correctly classified examples of class
k and the number of examples assigned to class k, it can formulated as
follows:
(5)
(6)
(7)
(8)
The F-measure is then defined as :
The classification accuracy is the ratio of number of all correctly classified
examples and the total number of examples (both positive and negative), it is
given by :
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Results</title>
      <p>In this section we report the results for each experiment by highlighting for
each step the evaluation measure values. The most important evaluation
values are shown with bold typeface. It is important to note, that adding
training instance which is common characteristic of DECORATE implies
increasing training time. This is visible when performing with a great number
of needed classifiers for the ensemble and the desired number of artificial data
to create for learning the meta-classifier.
Observed
The tables given above report the results of ensembles versus individual
classifiers. In this experiment, we applied 5-fold cross-validations. The E.coli
dataset is randomly partitioned approximately equally sized subsets. Table 1
and Table 3 summarize the performance in test of each individual classifier
for each class. Table 2 and Table 4 give the number of patterns obtained for
each class using DECORATE-based k-NN (Dk-NN) and DECORATE-based
C4.5 (DC4.5). The best results for k-NN were obtained when setting k=9.
and imU proteins. Whereas, no improvement has been observed for the two
minority class proteins namely imL and imS, which are the most difficult to
Observed
Table 3 and Table 5 show that Decision Tree used as individual classifier
performs poorly than the individual k-NN. However, in Table 4 the
improvement is well observed in both cp, im and om proteins. Not
suprisingly, Dk-NN gives better results than DC4.5, which confirms once
again its power in this context. What is important to notify is that even the
ensembles Dk-NN and DC4.5 fail in classifying pp and imU with high
confidence and fail completely for umL and imS. The influence of the number
of ensembles (size) needed for the meta-classifier on the performance of the
two ensembles Dk-NN and DC4.5 is shown in Fig.1.
The results reported for this study show that the classification attempts of
inner membrane with lipoprotein (imL) and inner membrane with cleavable
signal sequence (imS) proteins failed for each classifier and consequently also
for Dk-NN and DC4.5. This situation is caused by the extremely low number
of examples in these classes (one example used for training and one example
for testing). On the other hand, outer membrane with lipoprotein (omL)
proteins were classified with 100% success rate by kNN classifier and both
Dk-NN and DC4.5. The cytoplasm (cp) proteins were relatively well
classified by almost all classifiers. Fig.2 highlights the performance in test of
each classifier and shows well the superiority of the ensembles Dk-NN and
DC4.5 in classifying E.coli patterns. Finally; it should be emphasis that this
results are better than those obtained by combining heterogeneous classifiers
by majority voting rule, since an average classification success of 88.3% was</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>
        achieved [
        <xref ref-type="bibr" rid="ref21">22</xref>
        ]. Nevertheless, all these results prove that combining classifiers
is indeed a fruitful strategy.
      </p>
      <p>More recently, several ensemble learning algorithms have emerged that have
different strengths regardless the type of data involved for the problem in
question. One is often confused to make an effective choice among them.
Protein cellular localization sites prediction is one among the most
challenging problems in modern computational biology. Various approaches
have been proposed and applied to solve this problem but the extremely
imbalanced distribution of proteins over the cellular locations make the
prediction much more difficult. In this study, we applied DECORATE
ensemble learning, investigating two standard machine learning approaches to
improve the performance in classifying E.coli proteins to their cellular
locations, based on their amino acid sequences. The experiments show that
the k-NN-based meta-learning model outperforms the individual k-NN
classifier and achieves better classification accuracy than the Decision
Treebased model. Further investigations will be carried out to provide a much
more improved ensemble model.
5. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Nakai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanehisa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Expert system for predicting protein localization sites in gram-negative bacteria</article-title>
          .
          <source>Proteins: Structure, Function, and Genetics</source>
          .
          <volume>11</volume>
          ,
          <fpage>95</fpage>
          -
          <lpage>110</lpage>
          (
          <year>1991</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Nakai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanehisa</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A knowledge base for predicting protein localization sites in eukaryotic cells</article-title>
          .
          <source>Genomics</source>
          .
          <volume>14</volume>
          ,
          <fpage>897</fpage>
          -
          <lpage>911</lpage>
          (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Horton</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A probabilistic classification system for predicting the cellular localization sites of proteins</article-title>
          .
          <source>In :Proceedings of Intelligent Systems in Molecular Biology</source>
          , pp
          <fpage>109</fpage>
          -
          <lpage>115</lpage>
          . St. Louis, USA (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Horton</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Better prediction of protein cellular localization sites with the k Nearest Neighbors classifier</article-title>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>152</lpage>
          . AAAI Press. Halkidiki,
          <string-name>
            <surname>Greece</surname>
          </string-name>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Nakashima</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikishawa</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Discrimination of intracellular and extracellular proteins using amino acid composition and residue pair frequencies</article-title>
          .
          <source>J. Mol. Biol</source>
          .
          <volume>238</volume>
          ,
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>K. J.</given-names>
          </string-name>
          , Kanehisa,  M.:
          <article-title>Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs</article-title>
          .
          <source>Bioinformatics</source>
          .
          <volume>19</volume>
          ,
          <fpage>1656</fpage>
          -
          <lpage>1663</lpage>
          (
          <year>2003</year>
          ). [7]  Sarda,  D., Chua,  G.H., 
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
            B.,
          <string-name>
            <surname>Krishnan</surname>
          </string-name>
          ,  A. 
          <article-title>:pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <volume>6</volume>
          ,
          <issue>152</issue>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Rashid</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghava</surname>
            ,
            <given-names>G. P. S.</given-names>
          </string-name>
          :
          <article-title>Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <volume>8</volume>
          ,
          <issue>337</issue>
          (
          <year>2007</year>
          ).   
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Dietterich</surname>
            ,
            <given-names>T. G.</given-names>
          </string-name>
          :
          <article-title>Ensemble methods in machine learning</article-title>
          . In: Kittler,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Roli</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.),
          <source>First International Workshop on Multiple Classifier Systems, LNCS</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .. Springer-Verlag (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Wolpert</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          :
          <article-title>Stacked generalization</article-title>
          .
          <source>Neural Networks</source>
          .
          <volume>5</volume>
          ,
          <fpage>241</fpage>
          -
          <lpage>259</lpage>
          (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          :
          <article-title>Experiments with a new boosting algorithm</article-title>
          . In: Saitta,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (Ed.),
          <source>Proceedings of the Thirteenth International Conference on Machine Learning (ICML96)</source>
          . pp.
          <fpage>148</fpage>
          -
          <lpage>156</lpage>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Bagging predictors</article-title>
          .
          <source>Machine Learning</source>
          .
          <volume>24</volume>
          (
          <issue>2</issue>
          ),
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L. I.</given-names>
          </string-name>
          :
          <article-title>Rotation forest: A new classifier ensemble method</article-title>
          .
          <source>IEEE Transaction in Pattern Analysis</source>
          .
          <volume>28</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1619</fpage>
          -
          <lpage>1630</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Melville</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mooney</surname>
          </string-name>
          , R.:
          <article-title>Constructing diverse classifier ensembles using artificial training examples</article-title>
          .
          <source>The Eighteenth International Joint Conference on Artificial Intelligence</source>
          , pp.
          <fpage>505</fpage>
          -
          <lpage>510</lpage>
          . Acapulco, Mexico,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merz</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>UCI repository of machine learning databases (</article-title>
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Mcgeoch</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donald</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rixon</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          :
          <article-title>Sequence determination and genetic content of the short unique region in the genome of herpes simplex virus type 1</article-title>
          .
          <source>J Mol. Biol</source>
          .
          <volume>181</volume>
          ,
          <issue>113</issue>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Heijne</surname>
            ,
            <given-names>G. V.:</given-names>
          </string-name>
          <article-title>A new method for predicting signal sequence cleavage sites</article-title>
          .
          <source>Nucleic Acids Research</source>
          .
          <volume>14</volume>
          ,
          <fpage>4683</fpage>
          -
          <lpage>4690</lpage>
          (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Heijne</surname>
            ,
            <given-names>G. V.</given-names>
          </string-name>
          :
          <article-title>The structure of signal peptides from bacterial lipoproteins</article-title>
          .
          <source>Protein Engineering</source>
          .
          <volume>2</volume>
          ,
          <fpage>531</fpage>
          -
          <lpage>534</lpage>
          (
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Cover</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>T</given-names>
          </string-name>
          , P. E.:
          <article-title>Nearest neighbor pattern classification</article-title>
          .
          <source>IEEE Transactions on information Theory</source>
          .
          <volume>13</volume>
          (
          <issue>1</issue>
          ),
          <fpage>21</fpage>
          -
          <lpage>27</lpage>
          (
          <year>1967</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olshen</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          :
          <article-title>Classification and regression trees</article-title>
          .
          <source>Monterey</source>
          , Chapman &amp; Hall (
          <year>1984</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <source>C4</source>
          .
          <article-title>5: Programs for Machine Learning</article-title>
          . Morgan Kaufmann Publishers, San Mateo,CA (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Bouziane</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Messabih</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chouarfia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A Voting-Based Combination System for Protein Cellular Localization Sites Prediction</article-title>
          .
          <source>In IEEE International Conference on Information and Computer Applications (ICICA)</source>
          , pp.
          <fpage>166</fpage>
          -
          <lpage>173</lpage>
          ,
          <string-name>
            <surname>Dubai</surname>
          </string-name>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>