A Multiple Classifier System for fast an accurate learning
               in Neural Network context
    E. F. Romero1, R.M. Valdovinos2, R. Alejo3, J. R. Marcial-Romero2, J. A. Carras-
                                    co-Ochoa4

                             Universidad Autónoma del Estado de Mexico,
    1
      Centro Universitario Valle de Chalco, Hermenegildo Galena #3, Col. Ma. Isabel, Valle de
                                           Chalco, Mexico
       2
         Facultad de Ingeniería, Ciudad Universitaria, Cerro de Coatepec s/n, Toluca, Mexico.
    3
      Tecnológico de Estudios Superiores de Jocotitlán, Carretera Toluca-Atlacomulco km 44.8,
                     Col. Ejido de San Juan y San Agustín, Jocotitlán, Mexico
                        4
                          Instituto Nacional de Astrofísica Óptica y Electrónica
      {eliasfranck, li_rmvr, ralejoll}@hotmail.com, jrmarcialr@uaemex.mx, ariel@inaoep.mx


        Abstract. Nowadays, the Multiple Classification Systems (MCS) (also called
        as ensemble of classifiers, committee of learners and mixture of experts) consti-
        tutes a well-established research field in Pattern Recognition and Machine
        Learning. The MCS consists in dividing the whole problem with resampling
        methods, or using different models for constructing the system over a single da-
        ta set. A similar approach is studied in the Neural Network context, with the
        Modular Neural Network. The main difference between these approaches is the
        processing cost associate to the training step of the Modular Neural Network (in
        its classical form), due to each module requires to be learned with the whole da-
        ta set. In this paper, we analyze the performance of a Modular Neural Network
        and a Multiple Classifier System integrated by small Modular Neural Networks
        as individual member, in order to identity the convenience of each one. The ex-
        periments here were carried out on datasets from real problems showing the ef-
        fectiveness of the Multiple Classifier System in terms of overall accuracy and
        processing time respect to uses a single Modular Neural Network.

Keywords: Artificial Neural Networks, Modular Neural Networks, Mixture of Experts,
Linear Perceptron.


1       Introduction
The Modular Neural Networks (MNN) presents a new trend in Neural Network (NN)
architectural designs. It has been motivated by the highly-modular nature in biological
networks and based on the “divide and conquer” approach [1]. The MNN bases its
structure on the idea of a cooperative or competitive working, fragmenting the
problem into modules where each module is part of the whole problem [10]. Some
advantages of this network respect to other models are:
 1. Learning speed. The numbers of iterations needed to train the individual mod-
      ules is less than the number of iterations needed to train a Non-Modular NN for
      the same task [5].


                                              50
2.    Data processing. MNN is useful when it is working with different data sources
      [2], or when the data has been preprocessed with different techniques.
 3. Knowledge distribution. In a MNN, the network modules tend to specialize by
      learning from different regions of the input space [5]. And the modules can be
      trained independently and in parallel.
   There exist several implementations of the MNN, although the most important dif-
ference among them refers to the nature of the gating network. In some cases, this
corresponds to a single neuron evaluating the performance of the other expert mod-
ules [5], other are based on a NN trained with a data set different from the one used
for training the expert networks [2]. Finally, training all modules, including the inte-
grator module, with the same dataset [6].
   On the other hand, currently, the multiple classifier system (MCS) (also known as
ensemble of classifiers, committee of learners, etc.) is a set of individual classifiers
whose decisions are combined when classifying new patterns. Some reasons for com-
bining multiple classifiers to solve a given learning problem are: First, MCS tries to
exploit the local different behavior of the individual classifiers to improve the accura-
cy of the overall system. Second, in some cases MCS might not be better than the
single best classifier but can diminish or eliminate the risk of picking an inadequate
single classifier. Finally, the limited representational capability of learning algo-
rithms, it is possible that the classifier space considered for the problem does not con-
tain the optimal classifier
   To ensure a high performance of the MCS it is necessary to have enough diversity
in the individual decisions, and consider an acceptable individual accuracy of each
membership, which constitutes the MCS.
   Some aspects of the MCS aim to overcome in comparison when a single classifier
is used, are [7]: The MCS takes advantage of the combined decision over the individ-
ual classifier decisions, the correlated errors of the individual components can be
eliminated when the global decisions is considered, the training patterns cannot pro-
vide enough information to select the best classifier, the learning algorithm may be
unsuitable to solve the problem and finally, the individual search space cannot contain
the objective function.
   In this paper, a comparative study that aims to display the advantages of both
methods, the MNN and a MCS are used for classification task. In the first method
(MNN) each member corresponds to a linear perceptron, and in the MCS each indi-
vidual classifier corresponds to a single MNN, that is to say, the MCS is a neural
network made with MNN.


2.   Modular Neural Network
MNN called as systems committee, Hierarchical Mixture of Experts or Hybrid Sys-
tems [6], bases its structure (modular) on the modularity of the human nervous sys-
tem, in which each brain region has a specific function, but in turn, the regions are
interconnected. Therefore, we can say that an ANN is modular if the computation
performed by the network can be decomposed into two or more modules or subsys-
tems that work independently on the same or part of the problem. Each module corre-


                                          51
sponds to a feed forward artificial neural network, and can be considered as neurons
in the network as a whole.
   In its most basic implementation, all modules are of the same type [5], [2], but dif-
ferent schemes can be used. In the classical architecture, all modules, including the
gating module, have n input units, that is, the number of features in the sample. The
number of output neurons in the expert networks is equal to the number of classes c,
whereas in the gating network it is equal to the number of experts r [6] (Fig. 1).


                        Fig. 1. Representation of the MNN architecture [6].

     In the learning process, the network uses the stochastic gradient function:
                                 ⎛ r            ⎛ 1        2 ⎞ ⎞
                             − ln⎜⎜ ∑ g i ∗ exp⎜ − s − Z i ⎟ ⎟⎟
                                 ⎝ i =1         ⎝ 2          ⎠ ⎠                        (eq.1)
where s is the desired output for the input x, and Zi is the output vector of the j’th
expert network, gj is the output of the gating network, ui is the total weighted input
received by output neuron j of the gating network.
   Given a pattern x n-dimensional as input, the overall learning process of the MNN
considers the following steps:
 1. Random initialization of the synaptic weights for the different networks with
     small values uniformly distributed. Henceforth, we will consider wji as weights
     of the expert network and wti as the integrating network.
 2. The pattern x is presented to each and every one of the networks (experts and
     integrating network) so, the output of the knowledge network is given by:
                                      Z im = x ∗ w mji                           (eq.2)
      where x is the input vector, and the superscript m is indicative of module. Simi-
      larly, the output of the gating network is obtained by, where ui = x * wti:
                                      g =
                                            exp(u i )                              (eq.3)
                                              i     r

                                                   ∑ exp(u )
                                                   j =1
                                                               j


3.    Adjusting the weights of the expert networks and the gating network: To adjust
      weights, two criteria are taken into account.
                                      m           m                      m
      a. From expert networks: w ji ( I + 1) = wi ( I ) + η ∗ hi (s − Z i ) x (eq.4)
       b.   For the gating network: wti ( I + 1) = wti ( I ) + η (hi ( I ) − g i ( I )) x   (eq.5)
            where:


                                                     52
                                            ⎛ 1          2 ⎞
                                    g ∗ exp⎜ − s − Z im ⎟
                            hi = r          ⎝   2          ⎠                    (eq.6)
                                               ⎛ 1      m 2 ⎞
                                ∑     g ∗ exp⎜ − s − Z i      ⎟
                                 j =1          ⎝ 2            ⎠
    4.    Finally, the network decides how the modules outputs will be combines to
          obtain the final output of the MNN by: 𝑍 = !!!! g𝑖 ∗ 𝑧            (eq.7)


3        Multiple Classifier System

Let D = {D1,..., Dh} be a set of h classifiers. Each classifier Di (i = 1,..., h) gets as
input a feature vector x ∈ Rn, and assigns it to one of the c problem classes. The out-
put of the MCS is an h-dimensional vector [D1(x), . . . , Dh(x)] T containing the deci-
sions of each h individual classifiers. After that the individual decisions are combined
by some strategy [9], [8] in order to obtain a final decision.
    For constructing a MCS it is based on two aspects: the diversity in the individual
decisions and the accuracy of the single classifiers. The methods used to achieve di-
versity can be described in five groups [4]: Pattern manipulation, attribute manipula-
tion, tags manipulation, using different classification algorithms and use randomness.
   To integrate the MCS, in this study we use subsamples which consider patterns
manipulation, such that the resulting subsets have a proportional size to the number of
classifiers that integrate the MCS. Thus, in the experiments here reported the MCS
was integrated with 7 and 9 classifier each one, according to [11], this means that the
subsample only includes seven or nine percent of the samples included in the original
training dataset.
   To obtain the subsamples, we use the random selection without replacement of pat-
terns [12] and Bagging [3]. In the first method, the random selection is performed
without replacement of patterns in which a certain pattern cannot be selected more
than once, thereby reducing the redundancy patterns. On the other hand, Bagging
produces subsamples called Bootstrap, where each subsample has the same size than
the original dataset. For each subsample obtained with Bagging, each pattern has a
probability of 1-(1/m)m of being selected at least once between the m times that is
selected with , that is to say, each pattern has approximately 63% chance of appearing
in the subsample.
   When the subsamples are integrated using some resampling method, each one is
presented to the MCS (Fig. 2). After that, for combining the individual classifier deci-
sions in the literature two strategies are proposed: Fusion and selection. In classifier
selection, each individual classifier is supposed as an expert in a part of the feature
space and correspondingly, only one classifier is selected to label the input vector. In
classifier fusion, each component is supposed to have knowledge of the whole feature
space and thus, all individual classifiers are taken into account to decide the label for
the input vector.


                                              53
                                  Fig. 2. MCS of MNN.


4      Experimental Results

The results correspond to the experiments carried out over 12 real data sets taken
from the UCI Machine Learning Database Repository (http://archive.ics.uci.edu/ml/).

              Table 1. A brief summary of the UCI databases used in this paper.

          Dataset     Classes Features      Training samples     Test Samples

          Cancer         2         9              546                 137
          German         2         24             800                 200
          Heart          2         13             216                  54
          Iris           3         4              120                  30
          Liver          2         6              276                  69
          Phoneme        2         5              4322                1082
          Pima           2         8              615                 153
          Satimage       6         36             5147                1288
          Segment        7         19             1848                462
          Sonar          2         60             167                  41
          Vehicle        4         18             678                 168
          Waveform       3         21             4000                1000

   For each database, we estimate the average predictive accuracy and processing
time by 5-fold cross-validation, considering the 80% as the training set and the re-
maining as the test set (20%). According to the scheme of MNN and the MCS, some
specifications are as follows:
 1. Topology. Each expert in the MNN corresponds to a linear perceptron, in which
     the number of nodes in the input layer corresponds to the number of attributes in
     the input pattern. For the experts network, the number of neurons in the output
     layer is equal to the number of categories in the problem, while for the integrat-
     ing network is equal to the number of experts used.


                                           54
 2.   Connection weights. The connection weights where initialized to random values
      in the range between -0.5 and 0.5.
 3. Each MNN consists of 5 modules and an gating network.
 4. For the final decision on the MCS, simple majority voting was used.
   Only the result of the best technique on each database has been presented. Analo-
gously, for each database, related to the number of subsamples to induce the individu-
al classifiers, that is, the number of classifiers in the MCS, we have experimented
with 7 and 9 elements, and the best results have been finally included in Table 2. Be-
sides, single MNN classification accuracy for each original training set is also report-
ed as the baseline classifier.
   Since the accuracies are very different for the distinct data sets, using these results
across the data sets will be inadequate. Instead we calculate ranks for the methods.
For each data set, the method with the best accuracy receives rank 1, and the worst
receives rank 5. If there is a tie, the ranks are shared. Thus the overall rank of a meth-
od is the averaged rank of this method across the 12 data sets. The smaller rank indi-
cates the better method.
   In Table 2 are two sections; the first one includes the MNN results. The second
section shows the results when the MCS is used with 7 and 9 classifiers. In this case,
the corresponding capital letter identifies the resampling method used for obtaining
the subsamples: randomly without replacement (A) and Bagging (B). The results
correspond to the overall accuracy and the standard deviation included in parentheses
and values in bold type indicate the highest accuracy for each database.

                             Table 2. Overall Accuracy results.
   Dataset                          MCS 7 Classifiers              MCS 9 Classifiers
                   MNN               A               B              A             B
Cancer           88.4 (4.6)     88.4 (3.1)      87.9 (3.0)      87.1 (4.7)   86.5 (4.2)
Heart            73.7 (8.6)     81.5 (5.4)      81.5 (4.5)      78.9 (4.7)   80.4 (7.4)
Liver            63.5 (5.4)     54.8 (8.1)      62.9 (6.9)      62.0 (4.9)   67.0 (3.8)
Pima             66.5 (1.6)     68.0 (1.8)      67.6 (3.2)      66.1 (2.3)   67.8 (2.4)
Sonar            65.9 (6.2)     73.7 (3.2)      67.8 (4.7)      77.1 (12.2)  70.7 (7.1)
Iris             80.7 (11.4)    78.0 (6.9)      82.0 (8.0)      78.0 (7.7)   80.0 (6.7)
Vehicle          36.4 (7.1)     47.1 (3.7)      42.8 (10.9)     42.2 (4.0)   43.5 (3.8)
German           61.8 (18.0)    73.7 (1.3)      72.4 (4.5)      73.2 (1.9)   72.7 (4.1)
Phoneme          67.9 (4.5)     67.2 (5.5)      68.9 (3.5)      67.7 (4.2)   68.1 (4.2)
Waveform         77.2 (2.7)     81.6 (1.7)      82.0 (2.6)      79.2 (3.8)   80.2 (3.3)
Segment          78.2 (5.6)     75.0 (2.2)      74.9 (2.2)      76.9 (2.4)   74.5 (1.8)
Overall Rank        46.5           34.0            28.5            37.5          33.0

   From results shown in Table 2, some comments may be drawn. First, except with
Cancer and Segment data set, it is clear that some MCS schemes leads to better per-
formance than the MNN. This is confirmed by the general basis of the MNN, which
clearly corresponds to the poorer. Second, comparing the MCS using 7 or 9 classifi-
ers, it is possible to observe that when we use a MCS with 7 classifiers we can find
some results with a precision greater than (or equivalent) rating when nine classifiers
are used. Finally, to compare different resampling methods, the A method (random


                                           55
selection without replacemnt) behave generally better performance than the B method
(Bagging), using MCS with 7 classifiers on 5 datasets. In fact, for best results, the
details are still very close to the winner.
   The Vehicle data set is a special case due to the poor performance, regardless of
the scheme used. In this case, a thorough analysis of the data distribution is necessary
in order to identify the reason why the MNN and the MCS are not able to recognize
the kinds of problem which is required.
   Another aspect to be analyzed is the computational cost associated with each mo-
del. To this end, Table 3 shows the time required in minutes during the training and
the classification process by each classifier model.

                          Table 3. Training time (in minutes).
                                                    MCS with
                  Dataset      MNN
                                          7 Classifiers 9 Classifiers
                                             A       B      A     B
                  Cancer        11.3        9.4     9.2    9.5   9.2
                  Heart          5.5        4.3     4.1    3.7   2.5
                  Liver          3.3        1.5     1.3    1.7   1.7
                  Pima           9.8        9.8     9.9    9.6   9.8
                  Sonar          9.1        0.3     0.4    0.3   0.3
                  Iris           2.3        2.3     2.2    1.8   1.8
                  Vehicle        0.3        0.3     0.3    0.3   0.3
                  German        31.4       21.6     22.0 25.2 22.9
                  Phoneme       66.8       61,6     55.6 57.4 59.0
                  Waveform       174.5     131.5   124.7 133.0 127.2
                  Segment         1.4       2.0     2.1   2.1   2.1

   Results in the Table 3 clearly show large differences between the processing times
obtained by the three models used. It is interesting to note that in the majority of cas-
es, the time required by the MNN is almost two times more than the required by any
MCS. For example, with Sonar dataset the MNN requires nine times more than the
MCS. These differences could be because the MCS uses small subsamples in the
training process m/L, where m, is the number of patterns of training and L the number
of subsamples [12], reducing the computational cost in terms of runtime. In fact, us-
ing 9 classifiers requires less time in most cases, because the subsamples are smaller.
   Finally, regarding the performance of the schemes used, we can note that the best
classification results was obtained with an MCS with 7 classifiers requiring less time
processing respect to the single MNN and short differences respect to an MCS with 9
members.


5      Concluding Remarks and Future work

Designing a MCS with MNN as individual classifiers has been here analyzed. Two
MCS were used, with 7 and whit 9 classifiers. For the single MNN architecture, we
have employed five network experts and one gating network. The experimental re-


                                          56
sults allow comparing these models, in terms of processing time and predictive accu-
racy. From this, it has been possible to corroborate that in general, the MCS clearly
outperforms the classifier obtained with the MNN.
   In addition, when comparing the behavior of the resampling methods, it has been
empirically demonstrated, that to use the random selection without replacement offers
the best performance: with greater precision and lower computational cost.
   Finally, by comparing the results of the classification and the processing time re-
quired for each model, the use of the MCS provides the best performance, being the
best option to improve the binomial time-accuracy.
   As a future work to expand this research, aimed mainly at the improvement the
single MNN performance. In this context, other architectures with different parame-
ters and possible mechanisms such as regularization/Cross-validation must be ana-
lyzed. Also, it should be further investigated the relationship between the individual
classifiers and the resampling methods in order to determine the “optimal” scenario.

Acknowledgements. This work has partially                    been    supported    by    grants
3834/2014/CIA project, from the Mexican UAEM.


References
1. Alejo, R. “Análisis del Error en Redes Neuronales: Corrección del error de los datos y
   distribuciones no balanceadas”. Tesis Doctoral. Universitat Jaume I, Castelló de la Plana,
   España (2010).
2. Bauckhage, C. Thurau, C.: "Towards a Fair'n Square aimbot Using Mixture of Experts to
   Learn Context Aware Weapon Handling", in Proccedings of (GAME-ON'04), Ghent, Bel-
   gium, pp. 20-24 (2004).
3. Breiman, L.: "Bagging predictors", Machine Learning 26 (2), pp.123 – 140, (1996).
4. Dietterich, T. G. “Ensemble methods in machine learning,” Lecture Notes in Computer
   Science, vol. 1857, pp. 1–15 (2000).
5. Hartono, P., Hashimoto S.: "Ensemble of Linear Perceptrons with Confidence Level Out-
   put", in Proceedings of the 4th International. (USA), pp. 97 -106, (2000).
6. Kadlec, P. and Gabrys B. “Learnt Topology Gating Artificial Neural Networks”. IJCNN,
    pp. 2604-2611. IEEE (2008).
7. Kuncheva, L. I.: "Using measures of similarity and inclusion of multiple classifier fusion
    by decision templates", Fuzzy Sets and systems, 122 (3), pp. 401 -407 (2001).
8. Kuncheva, L.I. Bezdek, J.C. Duin R.P.W.: “Decision templates for multiple classifier
    fusion”. Pattern Recognition, 34, pp. 299–314 (2001).
9. Kuncheva L.; Roumen I. Kountchev K.: "Generating classifier outputs of fixed accuracy
    and diversity", Pattern Recognition letters, 23, pp. 593 -600 (2002).
10. Martínez L. M., Rodríguez P. A. “Modelado de sus funciones cognitivas para entidades
    artificiales mediante redes neuronales modulares”. Tesis doctoral, Universidad Politécnica
    de Madrid. España (2008).
11. Valdovinos R.M., Sánchez J.S.: “Sistemas Múltiples de Clasificación, preprocesado,
    construcción, fusión y evaluación”. Académica Española: Alemania (2011).
12. Valdovinos, R.M., Sánchez J.S.: “Class-dependant resampling for medical applications”,
    In: Proc. 4th Intl. Conf. on Machine Learning and Applications, pp. 351–356 (2005).


                                             57