Pruning AdaBoost for Continuous Sensors Mining
                           Applications
                          M. Rastgoo, G. Lemaitre, X. Rafael Palou, F. Miralles and P. Casale 1


Abstract. In this work, pruning techniques for the AdaBoost clas-               continuous learning framework. In this paper, experiments on prun-
sifier are evaluated specially aimed for a continuous learning frame-           ing methods for continuous data-streams mining are performed. The
work in sensors mining applications. To assess the methods, three               AdaBoost algorithm is trained on subsequent batches of incoming
pruning schemes are evaluated using standard machine-learning                   data followed by consecutive pruning steps. The advantage of this
benchmark datasets, simulated drifting datasets and real cases. Early           approach is twofold: (i) on the first hand, when new concepts are
results obtained show that pruning methodologies approach and                   learned, pruning allows to maintain the ensemble in order to be the
sometimes out-perform the no-pruned version of the classifier, being            least memory consuming and (ii) on the other hand, pruning provides
at the same time more easily adaptable to the drift in the training dis-        a first attempt to retain only the significant information acquired
tribution. Future works are planned in order to evaluate the approach           from previous knowledge. The reminder of this paper is organized
in terms of time efficiency and extension to big-data analysis.                 as follows. In Section 1, the continuous learning framework, the Ad-
                                                                                aBoost algorithm and the used pruning methods are introduced and
                                                                                explained in details. In Section 3, validation protocols are described
1     Introduction                                                              and, in Section 4, results are presented. Finally, Section 5 discusses
                                                                                the obtained results and concludes the paper.
As the number of sensors deployed every day in the real world in-
creases, the ambition of mining these continuous data-streams be-
comes a crucial part in applications. In the recent years, data mining
techniques started to be very popular in sensors mining tasks spe-
cially when related to learning from data streams [3] [10]. These
techniques, stated upon the machine learning framework, are de-
signed to generate a predictive model from a well sampled training
dataset distribution. The model is further used to classify any future
instance of data without the possibility to be updated if the value dis-
tribution of the data-stream changes. In other words, the paradigm
provided by the typical machine learning setting is not suitable for
continuous mining of data streams [5]. The AdaBoost learning func-
tion [2] allows a suitable framework for mining continuous streams
[8]. Being an incremental ensemble of classifiers, this learning func-
tion is updated to grow its knowledge just adding new classifiers to
the previous models. Nevertheless, when many subsequent batches
of data are provided, Adaboost tends to create large ensembles that
suffer of two main drawbacks: (i) increasing memory needed to store
the decision model and (ii) over-fitting. Pruning techniques can be
suited for reducing the dimension of the ensemble by selecting only
specific models. The first attempt of pruning an AdaBoost classi-
fiers was introduced by Margineantu and Dietterich [6] by mean of
comparing five different methods, namely (i) early stopping, (ii) KL
divergence, (iii) Kappa statistics, (iv) Kappa error convex Hull and
(v) Reduce error with back-fitting. Hernanadez-Lobato et al. [4] used                         Figure 1. Continuous Learning Framework
Genetic Algorithms to prune the AdaBoost ensemble, searching in
the space of all possible subsets of classifiers created by AdaBoost.           2   Pruning AdaBoost in Continuous Learning
Zhang et al. [11] defined pruning as a quadratic integer program-
ming problem with the aim to find a fixed size subset of k classifiers          In a continuous learning framework, as shown in Fig. 1, new knowl-
with minimum misclassification and maximum diversity. Neverthe-                 edge is acquired only when the current model does not fit anymore
less, those works are no suitable solutions for pruning AdaBoost in a           the incoming data-stream distribution [9]. This decision is performed
                                                                                by evaluating the current classifier using the function g as perfor-
1    Barcelona Digital Technology Center, Barcelona, Spain, email:              mance measure and evaluating the obtained performance e. When
    plcasale@bdigital.org                                                       e is not good enough, the current model hi is updated training the


                                                                           53
learning function f with the new incoming data Di+1 . Incremen-                            2.2     Pruning methods
tal learning functions should be preferred. In this way, only the new
incoming data will be used for both maintaining the previous knowl-                        Three different pruning methods have been used and compared,
edge acquired, not having to store historical data. The AdaBoost al-                       namely (i) Reduce Error, (ii) Learner Weights Analysis and (iii)
gorithm represents an incremental learning function able to properly                       Pareto Analysis. The Reduced Error algorithm was used first in [6].
meet these requirements. Nevertheless the classifiers created by Ad-                       Being the original implementation not suitable from a continuous
aBoost grows linearly as many subsequent learning steps are per-                           learning framework, an improved version is proposed in this work
formed. Here, the pruning function p allows to maintain the model                          in order to speed-up the process. Pruning has also been performed
computationally optimal. Aim of this work is evaluating between dif-                       using Learner Weights and Pareto Analysis methodologies, both of
ferent pruning functions p in terms of classifier performance. In the                      them able to provide a set of most discriminative learners from the
following subsection, AdaBoost and the pruning methods are pre-                            whole ensemble. From the far of our knowledge, no previous appli-
sented and explained in details.                                                           cation of those methodologies has been done in the tasks of pruning
                                                                                           an AdaBoost ensemble.

2.1     AdaBoost                                                                           2.2.1   Reduce Error (RE)
AdaBoost, short for Adaptive Boosting, is an ensemble learning al-                         In this algorithm, the first step is performed in order to initialize the
gorithm that allows to obtain an high performance classifier by a                          pruning distribution Wt and to select the weak classifier ht from the
linear combination of weak learners. Algorithm 1 shows the pseu-                           ensemble H which minimizes the classification error t on Wt distri-
docode for AdaBoost. The algorithm takes as input a training set                           bution. This classifier is added to the pruned ensemble P , a weight αt
(xi , yi ) where xi is a N -dimensional feature vector, and yi are                         is assigned to it and Wt+1 distribution is also updated as in AdaBoost
the class labels. After T rounds of training, T weak classifiers ht                        routine. Then, iteratively, each remaining classifier ht is individually
and T weights αt are combined to assemble the final strong clas-                           added to the ensemble P and the classification error t of this new
sifier. Higher weights αt are assigned to the best weak classifiers                        ensemble is evaluated on the pruning set using Wt+1 distribution.
ht . Instantiations of AdaBoost may differ due to the choice of the                        In order to select the best classifier, the classifier ht combined with
                                                                                           P minimizing the classification error t is definitely added to P , a
Algorithm 1 AdaBoost Algorithm                                                             weight αt is assigned to it and Wt+2 distribution is also updated as
Input:                                                                                     in AdaBoost routine. The routine stops when the number of classi-
  - Training set of N samples (xi , yi ), with i = 1 . . . N , xi ∈ RN , yk ∈ Y =          fiers in the sub-ensemble P reaches a ppre-specified size. The two
  {1, +1} ;                                                                                main changes with respect the original RE algorithm are the follow-
  - Weak learning algorithm WeakLearn ;
  - Number of learning iteration T ;                                                       ing. In the original version, a final back-fitting approach is performed
Initialize W1 (k) = 1/N, k = 1, . . . , N ;                                                only after the selection of each weak classifier while in our approach
  for t = 1, . . . , T do                                                                  selection is done at each step. In addition, each weak classifier is
      1. Train WeakLearn using distribution Wt and get weak hypothesis ht ;
                                                                                           added to the pruned ensemble P only after being re-weighted. This
      2. Compute classification error t = P rk∼Wt [ht (xk ) 6= yk ] ;                     procedure ensures better classification results than the original RE
                               1−
      3. Compute αt = 21 ln(  t ) ;
                                  t
                                                                                           formulation.
      4. Update distribution:
                            Wt (k) exp(−αt yk ht (xk ))
            Wt+1 (k) =                                  ;
                                       Zt                                                  2.2.2   Learner Weights Analysis (WA)
            where Zt is a normalization factor chosen so that Wt+1 will be a proper
     distribution function.                                                                From the distributions of the weights αt in the ensemble, weak learn-
  end for                                                                                  ers were selected based on the following assumptions: (i) weak learn-
Output:      P                                                                             ers with higher ensemble weight αt are the best weak learners of the
 H(x) = sign( Tt=1 αt ht (x)) ;                                                            ensemble and (ii) an ensemble is better when more diversified the
                                                                                           classifiers forming it are. The technique works as follow. AdaBoost
weak learning algorithm, defined as a learner performing slightly bet-                     is applied on the batch of data to obtain an ensemble of T classifiers.
ter than random guessing (> 50% right-classification). A variety of                        Then, a matrix M is built, by grouping the ensemble weights αt of
weak learners e.g., neural networks or decision trees can be used.                         each decision stump classifier using their dimension parameter. M
Decision stumps are the most common weak classifiers used in Ad-                           is of size n × D where n is the number of element for each of the
aBoost. Decision stumps are one-level decision trees equivalent to a                       D dimensions. In order to select the best classifiers, M is first sorted
threshold that best splits the data. Each stump learner is character-                      formerly by row and subsequently by column, always in a descen-
ized by three parameters: (i) the nth dimension of the features set                        dant order. M is transformed into a vector V by concatenating all its
where the classifier is applied, (ii) the decision level, i.e., the thresh-                columns. Finally t classifiers corresponding to the t first weights of
old splitting the data in the nth given dimension and (iii) the decision                   V , with t << T , are selected. The value of t determines the pruning
sign (−1 or +1) determining the inequality direction for the thresh-                       percentage.
olding. For a given batch of data with a set of features of size n, at
each iteration of AdaBoost the decision stump that minimizes the er-                       2.2.3   Pareto Analysis (PA)
ror t in an nth dimension of the training distribution is selected. The
information provided by the final set of decision stumps selected by                       PA is based on the assumption that few key actions will produce sig-
AdaBoost can be used for mining which are the significant features                         nificant overall effects. Applied to ensemble learning, this technique
of the data-stream and, more important, which is the best split in the                     implies that only few key weak classifiers will have an high impact
data.                                                                                      on the overall performance of the ensemble. PA proposed a statistical


                                                                                      54
point of view in order to select these key classifiers. This technique is
used to estimate effectiveness of each feature dimension, and accord-
ingly selects the classifiers from feature dimensions with high im-
pact. The effectiveness could be adjusted using a threshold. First, the
features are grouped based on the total number of ensemble weight
which are considers as outliers in each dimension. The outliers could
be found with reference to first and third quartile (Q1 , Q3 ), and inter
quartile range (IQR). Values above Q3 + 1.5 × (IQR) are con-                       (a) Linear Drift             (b) Circular Drift       (c) Circular Narrow
                                                                                                                                         Drift
sidered as outliers in each case. The frequency distribution of these
outliers is sorted in descendant order and the cumulative distribution
is computed. Then, the features dimensions are selected based on a                                    Figure 2. Artificial data with drifts
threshold level corresponding to the number of classifiers to keep. All
dimensions with lower cumulative percentage than the threshold (i.e.
                                                                                   Exp. 1: In the first experiment, it is assumed that data distribution
desired percentage of maximum cumulative value) are taken into ac-
                                                                                   is subject to the change due to different drifts and the ensemble
count. From the selected feature dimensions, the maximum weights
                                                                                   is incrementally grown over the drifted batches of incoming data
are used to highlight the learners. The technique can be perceived as
                                                                                   with the main aim to classify the current batch of information.
a principle dimension selection, where the dimensions considered as
                                                                                   After the training, the pruning and the testing are applied on a
more important are selected.
                                                                                   different samplings of the same drifted batch. The experiment is
                                                                                   repeated five times following a 5-fold cross-validation paradigm.
3     Validation Protocol                                                          Exp. 2: The aim of the second experiment is to evaluate the poten-
                                                                                   tial of pruning in classifying both previous and current informa-
Three typologies of experiments have been performed in order to                    tion. With the training kept as in the previous experiment, at each
validate the effectiveness of the pruning methods on both static and               step i the ensemble is pruned and tested on pruning and testing
drifting distributions. A cross-validation approach has been used for              sets of the joint distribution C0 ∪ . . . ∪ Ci . The experiment has
validating the methods. At each step of the cross-validation procees,              been performed on five different runs, following a 5-fold cross-
the dataset has been randomly divided into three sub-sets, training                validation paradigm.
(50%), pruning (40%) and testing(10%) sets. In the following sec-
tions the validation protocols adopted for each topology of experi-
ment are described. Under the model described in Fig. 1, a proper                3.3    Real World Datasets
threshold T h has been chosen in order to train the model always on
the new incoming data.                                                           Three real-world datasets have been used in order to evaluate the pro-
                                                                                 posed methodology on a real world scenario. The datasets considered
3.1    UCI Datasets Repository                                                   are described in the following.

Five datasets from the UCI repository [1] have been used for eval-               • The Sensor Stream(SS) dataset [12] contains sensors information
uating the effectiveness of the pruning methods. In this validation                (temperature, humidity, light and sensor voltage) collected from
step, the KL divergence method as originally proposed in [6], has                  fifty-four sensors deployed at Intel Berkeley Research Lab. The
been added in order to have a baseline comparison. The datasets con-               whole stream contains consecutive information over two months
sidered are Australian, Breast, Diabetes, Heart and Pima. The mean                 (2 219 803 instances). The experiment aims to infer the illumi-
number of instances in the datasets is around 700, except Heart hav-               nance state based on the measurements provided by each sensor.
ing 270 instance. The aim of the experiment is to analyse the re-                  Illuminance higher than 200 lux are considered as class 1 oth-
sults by pruning at 90% an initial ensemble. The average error rate                erwise considered as class −1. Every fifteen days, a new batch
for each technique was computed using a modified version of ten                    of data is collected which leads to three drifts considering the
fold cross-validation able to consider the pruning sets into the eval-             changes in the lab environment due to weather, humidity and
uation process, with the percentage previously outlined. AdaBoost                  office work. The experiment was performed using 4-fold cross-
algorithm was used to create an ensemble of hundred weak classi-                   validation paradigm.
fiers. Then, each pruning method was performed in order to create a              • Power Supply(PS) [12] is the second dataset used. The dataset
pruned sub-ensemble containing only ten classifiers.                               contains hourly power supply consumptions of the Italian electric-
                                                                                   ity company. The stream contains three year power supply records
                                                                                   from 1995 to 1998 (29 928 instances). The experiment aims to
3.2    Simulated Drifting Datasets
                                                                                   predict the day state morning (1) - night (−1)) based on the raw
The second set of experiments has been focused on testing the prun-                consumption value. The drifting in this stream is mainly derived
ing methods in a continuous learning framework. These have been                    by some features such as the season, weather, hours of a day and
performed using three sets of simulated data-streams that include                  the day of the week. The data were split in three batches represent-
drifting. The datasets are generated using the software provided                   ing one drift for each year. The experiment was performed using
by [7]. Figure 2 shows the three different settings for each experi-               3-fold cross-validation paradigm.
ment. Four linear drifts have been considered for the first dataset and          • Elec 2(E2) is the third dataset used. This dataset containing 27 549
three circular drifts have been created for the remaining two datasets.            instances is composed of seven drifts, each representing a week
The ensemble was incrementally grown using all the drifted distribu-               day. The drifts are due to changes of power consumptions over
tions. The experiments performed using the simulated datasets are                  the weekdays. The experiment was performed using 7-fold cross-
described in the following.                                                        validation paradigm.


                                                                            55
     (a) Results obtained on Linear Drift in Exp.1         (b) Results obtained on Circular Drift in Exp.1        (c) Results obtained on Circular Narrow Drift in Exp.1


                                               Figure 3. Results obtained on simulated drifting datasets for Exp.1

As in the Exp. 2 on simulated data, AdaBoost is trained for each drift                in Eq. 1. Hence, methods with negative relative errors are performing
on the training set of current data. The pruning function is applied on               better than the reference model.
a pruning set which contains samples of previous and new batches of
                                                                                                                           no pruned − pruned
data.                                                                                                        rel = −1 ·                                              (1)
                                                                                                                                no pruned


                                                                                      4.1     UCI Datasets Repository
                                                                                      In Fig. 4 the results obtained on the five UCI datasets are reported.
                                                                                      RE is the method performing better than the others, being better than
                                                                                      the reference in Aus, Dia and Hea datasets, and slightly worst than
                                                                                      the reference in Pim. Similar behavior is obtained by WA. Pruning
                                                                                      performs always bad on Bre, where the best result is provided by PA.


                                                                                      4.2     Simulated Drifting Datasets
                                                                                      Results obtained on simulated drifting datasets with Exp. 1 are re-
                                                                                      ported in Fig. 3. RE is the best pruning method for linear and circular
                                                                                      drifting datasets, as previous experiments suggest. In both linear and
  Figure 4. Performance of pruning methods on UCI datasets with %90
                              pruning
                                                                                      circular drifting, WA performs better than PA. Non of the pruning
                                                                                      methods work better than the no-pruned version for high percentage
                                                                                      of pruning. Nevertheless, WA works slightly better than no-pruned
                                                                                      Adaboost when the percentage of pruning is almost 50%. As it may
                                                                                      be expected, the performance of the pruned ensemble generally get
                                                                                      worse as the percentage of pruning increases. Nevertheless, RE is
                                                                                      able to maintain its performance constant over the pruning percent-
                                                                                      age in the circular dataset and almost constant in the narrow pruning
                                                                                      dataset. For Exp. 2, results obtained on simulated drifting datasets
                                                                                      are reported in Fig. 6. In this setting, all the pruned ensemble behave
                                                                                      better than their correspondent no-pruned classifiers. As all previous
                                                                                      experiment suggest, RE is the best method, followed by WA. Also
                                                                                      in this case, although the performance of the methods decreases as
                                                                                      the percentage of pruning increases, RE remains almost constant re-
                                                                                      gardless of the percentage. It should be also noted that the AdaBoost
                                                                                      performance in this experiment is rather bad, reaching a global er-
                                                                                      ror up to 40%. The pruning methods improve this performance until
    Figure 5. Performance of pruning methods on real world datasets
                                                                                      reaching an error of 25%.


                                                                                      4.3     Real World Datasets
4 Experimental Results
                                                                                      Results obtained on the real world dataset are shown in Fig. 5. Re-
In this section, results obtained on the experiments described in the                 sults obtained with the PS datasets are shown in Fig. 7. RE con-
previous section are reported. Misclassification error has been cho-                  firms to be the best pruning method, followed by WA. For SS and
sen as performance measure. In particular, the pruning methods has                    E2 datasets, WA and PA provide the same performance. It should be
been evaluated using the relative error (rel ) with respect to the error             noted that RE performs better than the no-pruned version for all the
provided by the no-pruned version of AdaBoost, computed as shown                      experiments.


                                                                                 56
     (a) Results obtained on Linear Drift in Exp.2          (b) Results obtained on Circular Drift in Exp.2    (c) Results obtained on Circular Narrow Drift in Exp.2


                                               Figure 6. Results obtained on simulated drifting datasets for Exp. 2


                                                                                      research in order to be conceptually capable to run following time
                                                                                      efficiency guidelines and methods based on genetic algorithm and
                                                                                      semi-definite programming have been not used for comparison. Fi-
                                                                                      nally, a study on the extension of the proposed methods towards a
                                                                                      big-data approach is planned to be done. This research shows that
                                                                                      pruning by selecting the weak classifiers from different pools of sub-
                                                                                      sampled data may improve the final ensemble in terms of accuracy,
                                                                                      diversity and adaptation ability to drift. The employed procedures
                                                                                      in this work can be easily adapted for large datasets and continuous
                                                                                      learning environment with the high quantity of incoming data.

                                                                                      6     Acknowledgments
      Figure 7. Performance of pruning methods on the PS dataset                      This work is supported by the Information and Communication Tech-
                                                                                      nologies Collaborative Project action BrainAble within the Seventh
                                                                                      Framework of the European Commission, project number ICT-2010-
5   Discussions and Conclusions                                                       247447.

In this work, experiments have been carried out in order to evaluate                  REFERENCES
the potential of different pruning methods and their performance in
                                                                                       [1] D.J. Newman A. Asuncion. UCI machine learning repository, 2007.
the framework of continuous learning. The Reduced Error method                         [2] Y. Freund and R. Schapire, ‘A short introduction to boosting’, J. Japan.
is the most consistent method followed by Learner Weight Analysis.                         Soc. for Artif. Intel., 14(5), 771–780, (1999).
The use of Pareto Analysis does not seem to be justified during the                    [3] J. Gama and M. Gaber (Eds), Learning from Data Streams – Processing
experiment. Nevertheless, one of the important characteristic of this                      techniques in Sensor Networks, Springer, 2007.
                                                                                       [4] Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Rubén
method consists in the capability of defining automatically the num-                       Ruiz-Torrubiano, and Ángel Valle, ‘Pruning adaptive boosting ensem-
ber of classifiers of the pruned ensemble. PA may be automatized                           bles by means of a genetic algorithm’, in IDEAL, pp. 322–329, (2006).
by thresholding the performance. Early results show that this auto-                    [5] Ludmila I. Kuncheva, ‘Classifier ensembles for changing environ-
matic version performs better than the original method in most of the                      ments’, in In Multiple Classifier Systems, pp. 1–15. Springer, (2004).
                                                                                       [6] Dragos D. Margineantu and Thomas G. Dietterich. Pruning adaptive
cases. Experiments on simulated datasets in case of Exp 1 show that
                                                                                           boosting, 1997.
pruning methods are more efficient over wider drifted distribution                     [7] Leandro L. Minku, Allan P. White, and Xin Yao, ‘The impact of diver-
rather than narrow drifted distribution. Due to the nature of the nar-                     sity on online ensemble learning in the presence of concept drift’, IEEE
row circular dataset, drift stages have more common area and since in                      Trans. on Knowl. and Data Eng., 22(5), 730–742, (May 2010).
this experiment, current stage has more effect for pruning, compare                    [8] Martin Scholz and Ralf Klinkenberg, ‘Boosting classifiers for drifting
                                                                                           concepts’, Intelligent Data Analysis (IDA), Special Issue on Knowledge
to previous stage, the pruning performances are slightly lower. At the                     Discovery from Data Streams, 2006, 2007, (2006).
same time, Exp 2 show that pruning methods perform better than the                     [9] Jan Tozicka, Michael Rovatsos, Michal Pechoucek, and Stepan Urban,
original classifier when the whole drifting distribution is presented.                     ‘Malef 58: Framework for distributed machine learning and data min-
Based on Fig. 6, pruning ensemble through the incremental learning,                        ing’, Int. J. Intell. Inf. Database Syst., 2(1), 6–24, (February 2008).
                                                                                      [10] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han, ‘Mining concept-
definitely improves the final results. Finally, results obtained by ex-
                                                                                           drifting data streams using ensemble classifiers’, in Proceedings of the
periments on real datasets prove that pruning through the continuous                       ninth ACM SIGKDD, KDD ’03, pp. 226–235, New York, NY, USA,
learning process provides very close or better results than AdaBoost.                      (2003). ACM.
As future works, an evaluation of the method efficiency in terms of                   [11] Yi Zhang, Samuel Burer, and W. Nick Street, ‘Ensemble pruning via
computational complexity will be considered since this parameter                           semi-definite programming’, Journal of Machine Learning Research,
                                                                                           7, 1315–1338, (2006).
has a great importance in a continuous learning framework. For this                   [12] X. Zhu. Stream data mining repository, 2010.
main motivation, the reduced error method had been modified in our


                                                                                 57