Pruning AdaBoost for Continuous Sensors Mining Applications M. Rastgoo, G. Lemaitre, X. Rafael Palou, F. Miralles and P. Casale 1 Abstract. In this work, pruning techniques for the AdaBoost clas- continuous learning framework. In this paper, experiments on prun- sifier are evaluated specially aimed for a continuous learning frame- ing methods for continuous data-streams mining are performed. The work in sensors mining applications. To assess the methods, three AdaBoost algorithm is trained on subsequent batches of incoming pruning schemes are evaluated using standard machine-learning data followed by consecutive pruning steps. The advantage of this benchmark datasets, simulated drifting datasets and real cases. Early approach is twofold: (i) on the first hand, when new concepts are results obtained show that pruning methodologies approach and learned, pruning allows to maintain the ensemble in order to be the sometimes out-perform the no-pruned version of the classifier, being least memory consuming and (ii) on the other hand, pruning provides at the same time more easily adaptable to the drift in the training dis- a first attempt to retain only the significant information acquired tribution. Future works are planned in order to evaluate the approach from previous knowledge. The reminder of this paper is organized in terms of time efficiency and extension to big-data analysis. as follows. In Section 1, the continuous learning framework, the Ad- aBoost algorithm and the used pruning methods are introduced and explained in details. In Section 3, validation protocols are described 1 Introduction and, in Section 4, results are presented. Finally, Section 5 discusses the obtained results and concludes the paper. As the number of sensors deployed every day in the real world in- creases, the ambition of mining these continuous data-streams be- comes a crucial part in applications. In the recent years, data mining techniques started to be very popular in sensors mining tasks spe- cially when related to learning from data streams [3] [10]. These techniques, stated upon the machine learning framework, are de- signed to generate a predictive model from a well sampled training dataset distribution. The model is further used to classify any future instance of data without the possibility to be updated if the value dis- tribution of the data-stream changes. In other words, the paradigm provided by the typical machine learning setting is not suitable for continuous mining of data streams [5]. The AdaBoost learning func- tion [2] allows a suitable framework for mining continuous streams [8]. Being an incremental ensemble of classifiers, this learning func- tion is updated to grow its knowledge just adding new classifiers to the previous models. Nevertheless, when many subsequent batches of data are provided, Adaboost tends to create large ensembles that suffer of two main drawbacks: (i) increasing memory needed to store the decision model and (ii) over-fitting. Pruning techniques can be suited for reducing the dimension of the ensemble by selecting only specific models. The first attempt of pruning an AdaBoost classi- fiers was introduced by Margineantu and Dietterich [6] by mean of comparing five different methods, namely (i) early stopping, (ii) KL divergence, (iii) Kappa statistics, (iv) Kappa error convex Hull and (v) Reduce error with back-fitting. Hernanadez-Lobato et al. [4] used Figure 1. Continuous Learning Framework Genetic Algorithms to prune the AdaBoost ensemble, searching in the space of all possible subsets of classifiers created by AdaBoost. 2 Pruning AdaBoost in Continuous Learning Zhang et al. [11] defined pruning as a quadratic integer program- ming problem with the aim to find a fixed size subset of k classifiers In a continuous learning framework, as shown in Fig. 1, new knowl- with minimum misclassification and maximum diversity. Neverthe- edge is acquired only when the current model does not fit anymore less, those works are no suitable solutions for pruning AdaBoost in a the incoming data-stream distribution [9]. This decision is performed by evaluating the current classifier using the function g as perfor- 1 Barcelona Digital Technology Center, Barcelona, Spain, email: mance measure and evaluating the obtained performance e. When plcasale@bdigital.org e is not good enough, the current model hi is updated training the 53 learning function f with the new incoming data Di+1 . Incremen- 2.2 Pruning methods tal learning functions should be preferred. In this way, only the new incoming data will be used for both maintaining the previous knowl- Three different pruning methods have been used and compared, edge acquired, not having to store historical data. The AdaBoost al- namely (i) Reduce Error, (ii) Learner Weights Analysis and (iii) gorithm represents an incremental learning function able to properly Pareto Analysis. The Reduced Error algorithm was used first in [6]. meet these requirements. Nevertheless the classifiers created by Ad- Being the original implementation not suitable from a continuous aBoost grows linearly as many subsequent learning steps are per- learning framework, an improved version is proposed in this work formed. Here, the pruning function p allows to maintain the model in order to speed-up the process. Pruning has also been performed computationally optimal. Aim of this work is evaluating between dif- using Learner Weights and Pareto Analysis methodologies, both of ferent pruning functions p in terms of classifier performance. In the them able to provide a set of most discriminative learners from the following subsection, AdaBoost and the pruning methods are pre- whole ensemble. From the far of our knowledge, no previous appli- sented and explained in details. cation of those methodologies has been done in the tasks of pruning an AdaBoost ensemble. 2.1 AdaBoost 2.2.1 Reduce Error (RE) AdaBoost, short for Adaptive Boosting, is an ensemble learning al- In this algorithm, the first step is performed in order to initialize the gorithm that allows to obtain an high performance classifier by a pruning distribution Wt and to select the weak classifier ht from the linear combination of weak learners. Algorithm 1 shows the pseu- ensemble H which minimizes the classification error t on Wt distri- docode for AdaBoost. The algorithm takes as input a training set bution. This classifier is added to the pruned ensemble P , a weight αt (xi , yi ) where xi is a N -dimensional feature vector, and yi are is assigned to it and Wt+1 distribution is also updated as in AdaBoost the class labels. After T rounds of training, T weak classifiers ht routine. Then, iteratively, each remaining classifier ht is individually and T weights αt are combined to assemble the final strong clas- added to the ensemble P and the classification error t of this new sifier. Higher weights αt are assigned to the best weak classifiers ensemble is evaluated on the pruning set using Wt+1 distribution. ht . Instantiations of AdaBoost may differ due to the choice of the In order to select the best classifier, the classifier ht combined with P minimizing the classification error t is definitely added to P , a Algorithm 1 AdaBoost Algorithm weight αt is assigned to it and Wt+2 distribution is also updated as Input: in AdaBoost routine. The routine stops when the number of classi- - Training set of N samples (xi , yi ), with i = 1 . . . N , xi ∈ RN , yk ∈ Y = fiers in the sub-ensemble P reaches a ppre-specified size. The two {1, +1} ; main changes with respect the original RE algorithm are the follow- - Weak learning algorithm WeakLearn ; - Number of learning iteration T ; ing. In the original version, a final back-fitting approach is performed Initialize W1 (k) = 1/N, k = 1, . . . , N ; only after the selection of each weak classifier while in our approach for t = 1, . . . , T do selection is done at each step. In addition, each weak classifier is 1. Train WeakLearn using distribution Wt and get weak hypothesis ht ; added to the pruned ensemble P only after being re-weighted. This 2. Compute classification error t = P rk∼Wt [ht (xk ) 6= yk ] ; procedure ensures better classification results than the original RE 1− 3. Compute αt = 21 ln(  t ) ; t formulation. 4. Update distribution: Wt (k) exp(−αt yk ht (xk )) Wt+1 (k) = ; Zt 2.2.2 Learner Weights Analysis (WA) where Zt is a normalization factor chosen so that Wt+1 will be a proper distribution function. From the distributions of the weights αt in the ensemble, weak learn- end for ers were selected based on the following assumptions: (i) weak learn- Output: P ers with higher ensemble weight αt are the best weak learners of the H(x) = sign( Tt=1 αt ht (x)) ; ensemble and (ii) an ensemble is better when more diversified the classifiers forming it are. The technique works as follow. AdaBoost weak learning algorithm, defined as a learner performing slightly bet- is applied on the batch of data to obtain an ensemble of T classifiers. ter than random guessing (> 50% right-classification). A variety of Then, a matrix M is built, by grouping the ensemble weights αt of weak learners e.g., neural networks or decision trees can be used. each decision stump classifier using their dimension parameter. M Decision stumps are the most common weak classifiers used in Ad- is of size n × D where n is the number of element for each of the aBoost. Decision stumps are one-level decision trees equivalent to a D dimensions. In order to select the best classifiers, M is first sorted threshold that best splits the data. Each stump learner is character- formerly by row and subsequently by column, always in a descen- ized by three parameters: (i) the nth dimension of the features set dant order. M is transformed into a vector V by concatenating all its where the classifier is applied, (ii) the decision level, i.e., the thresh- columns. Finally t classifiers corresponding to the t first weights of old splitting the data in the nth given dimension and (iii) the decision V , with t << T , are selected. The value of t determines the pruning sign (−1 or +1) determining the inequality direction for the thresh- percentage. olding. For a given batch of data with a set of features of size n, at each iteration of AdaBoost the decision stump that minimizes the er- 2.2.3 Pareto Analysis (PA) ror t in an nth dimension of the training distribution is selected. The information provided by the final set of decision stumps selected by PA is based on the assumption that few key actions will produce sig- AdaBoost can be used for mining which are the significant features nificant overall effects. Applied to ensemble learning, this technique of the data-stream and, more important, which is the best split in the implies that only few key weak classifiers will have an high impact data. on the overall performance of the ensemble. PA proposed a statistical 54 point of view in order to select these key classifiers. This technique is used to estimate effectiveness of each feature dimension, and accord- ingly selects the classifiers from feature dimensions with high im- pact. The effectiveness could be adjusted using a threshold. First, the features are grouped based on the total number of ensemble weight which are considers as outliers in each dimension. The outliers could be found with reference to first and third quartile (Q1 , Q3 ), and inter quartile range (IQR). Values above Q3 + 1.5 × (IQR) are con- (a) Linear Drift (b) Circular Drift (c) Circular Narrow Drift sidered as outliers in each case. The frequency distribution of these outliers is sorted in descendant order and the cumulative distribution is computed. Then, the features dimensions are selected based on a Figure 2. Artificial data with drifts threshold level corresponding to the number of classifiers to keep. All dimensions with lower cumulative percentage than the threshold (i.e. Exp. 1: In the first experiment, it is assumed that data distribution desired percentage of maximum cumulative value) are taken into ac- is subject to the change due to different drifts and the ensemble count. From the selected feature dimensions, the maximum weights is incrementally grown over the drifted batches of incoming data are used to highlight the learners. The technique can be perceived as with the main aim to classify the current batch of information. a principle dimension selection, where the dimensions considered as After the training, the pruning and the testing are applied on a more important are selected. different samplings of the same drifted batch. The experiment is repeated five times following a 5-fold cross-validation paradigm. 3 Validation Protocol Exp. 2: The aim of the second experiment is to evaluate the poten- tial of pruning in classifying both previous and current informa- Three typologies of experiments have been performed in order to tion. With the training kept as in the previous experiment, at each validate the effectiveness of the pruning methods on both static and step i the ensemble is pruned and tested on pruning and testing drifting distributions. A cross-validation approach has been used for sets of the joint distribution C0 ∪ . . . ∪ Ci . The experiment has validating the methods. At each step of the cross-validation procees, been performed on five different runs, following a 5-fold cross- the dataset has been randomly divided into three sub-sets, training validation paradigm. (50%), pruning (40%) and testing(10%) sets. In the following sec- tions the validation protocols adopted for each topology of experi- ment are described. Under the model described in Fig. 1, a proper 3.3 Real World Datasets threshold T h has been chosen in order to train the model always on the new incoming data. Three real-world datasets have been used in order to evaluate the pro- posed methodology on a real world scenario. The datasets considered 3.1 UCI Datasets Repository are described in the following. Five datasets from the UCI repository [1] have been used for eval- • The Sensor Stream(SS) dataset [12] contains sensors information uating the effectiveness of the pruning methods. In this validation (temperature, humidity, light and sensor voltage) collected from step, the KL divergence method as originally proposed in [6], has fifty-four sensors deployed at Intel Berkeley Research Lab. The been added in order to have a baseline comparison. The datasets con- whole stream contains consecutive information over two months sidered are Australian, Breast, Diabetes, Heart and Pima. The mean (2 219 803 instances). The experiment aims to infer the illumi- number of instances in the datasets is around 700, except Heart hav- nance state based on the measurements provided by each sensor. ing 270 instance. The aim of the experiment is to analyse the re- Illuminance higher than 200 lux are considered as class 1 oth- sults by pruning at 90% an initial ensemble. The average error rate erwise considered as class −1. Every fifteen days, a new batch for each technique was computed using a modified version of ten of data is collected which leads to three drifts considering the fold cross-validation able to consider the pruning sets into the eval- changes in the lab environment due to weather, humidity and uation process, with the percentage previously outlined. AdaBoost office work. The experiment was performed using 4-fold cross- algorithm was used to create an ensemble of hundred weak classi- validation paradigm. fiers. Then, each pruning method was performed in order to create a • Power Supply(PS) [12] is the second dataset used. The dataset pruned sub-ensemble containing only ten classifiers. contains hourly power supply consumptions of the Italian electric- ity company. The stream contains three year power supply records from 1995 to 1998 (29 928 instances). The experiment aims to 3.2 Simulated Drifting Datasets predict the day state morning (1) - night (−1)) based on the raw The second set of experiments has been focused on testing the prun- consumption value. The drifting in this stream is mainly derived ing methods in a continuous learning framework. These have been by some features such as the season, weather, hours of a day and performed using three sets of simulated data-streams that include the day of the week. The data were split in three batches represent- drifting. The datasets are generated using the software provided ing one drift for each year. The experiment was performed using by [7]. Figure 2 shows the three different settings for each experi- 3-fold cross-validation paradigm. ment. Four linear drifts have been considered for the first dataset and • Elec 2(E2) is the third dataset used. This dataset containing 27 549 three circular drifts have been created for the remaining two datasets. instances is composed of seven drifts, each representing a week The ensemble was incrementally grown using all the drifted distribu- day. The drifts are due to changes of power consumptions over tions. The experiments performed using the simulated datasets are the weekdays. The experiment was performed using 7-fold cross- described in the following. validation paradigm. 55 (a) Results obtained on Linear Drift in Exp.1 (b) Results obtained on Circular Drift in Exp.1 (c) Results obtained on Circular Narrow Drift in Exp.1 Figure 3. Results obtained on simulated drifting datasets for Exp.1 As in the Exp. 2 on simulated data, AdaBoost is trained for each drift in Eq. 1. Hence, methods with negative relative errors are performing on the training set of current data. The pruning function is applied on better than the reference model. a pruning set which contains samples of previous and new batches of no pruned − pruned data. rel = −1 · (1) no pruned 4.1 UCI Datasets Repository In Fig. 4 the results obtained on the five UCI datasets are reported. RE is the method performing better than the others, being better than the reference in Aus, Dia and Hea datasets, and slightly worst than the reference in Pim. Similar behavior is obtained by WA. Pruning performs always bad on Bre, where the best result is provided by PA. 4.2 Simulated Drifting Datasets Results obtained on simulated drifting datasets with Exp. 1 are re- ported in Fig. 3. RE is the best pruning method for linear and circular drifting datasets, as previous experiments suggest. In both linear and Figure 4. Performance of pruning methods on UCI datasets with %90 pruning circular drifting, WA performs better than PA. Non of the pruning methods work better than the no-pruned version for high percentage of pruning. Nevertheless, WA works slightly better than no-pruned Adaboost when the percentage of pruning is almost 50%. As it may be expected, the performance of the pruned ensemble generally get worse as the percentage of pruning increases. Nevertheless, RE is able to maintain its performance constant over the pruning percent- age in the circular dataset and almost constant in the narrow pruning dataset. For Exp. 2, results obtained on simulated drifting datasets are reported in Fig. 6. In this setting, all the pruned ensemble behave better than their correspondent no-pruned classifiers. As all previous experiment suggest, RE is the best method, followed by WA. Also in this case, although the performance of the methods decreases as the percentage of pruning increases, RE remains almost constant re- gardless of the percentage. It should be also noted that the AdaBoost performance in this experiment is rather bad, reaching a global er- ror up to 40%. The pruning methods improve this performance until Figure 5. Performance of pruning methods on real world datasets reaching an error of 25%. 4.3 Real World Datasets 4 Experimental Results Results obtained on the real world dataset are shown in Fig. 5. Re- In this section, results obtained on the experiments described in the sults obtained with the PS datasets are shown in Fig. 7. RE con- previous section are reported. Misclassification error has been cho- firms to be the best pruning method, followed by WA. For SS and sen as performance measure. In particular, the pruning methods has E2 datasets, WA and PA provide the same performance. It should be been evaluated using the relative error (rel ) with respect to the error noted that RE performs better than the no-pruned version for all the provided by the no-pruned version of AdaBoost, computed as shown experiments. 56 (a) Results obtained on Linear Drift in Exp.2 (b) Results obtained on Circular Drift in Exp.2 (c) Results obtained on Circular Narrow Drift in Exp.2 Figure 6. Results obtained on simulated drifting datasets for Exp. 2 research in order to be conceptually capable to run following time efficiency guidelines and methods based on genetic algorithm and semi-definite programming have been not used for comparison. Fi- nally, a study on the extension of the proposed methods towards a big-data approach is planned to be done. This research shows that pruning by selecting the weak classifiers from different pools of sub- sampled data may improve the final ensemble in terms of accuracy, diversity and adaptation ability to drift. The employed procedures in this work can be easily adapted for large datasets and continuous learning environment with the high quantity of incoming data. 6 Acknowledgments Figure 7. Performance of pruning methods on the PS dataset This work is supported by the Information and Communication Tech- nologies Collaborative Project action BrainAble within the Seventh Framework of the European Commission, project number ICT-2010- 5 Discussions and Conclusions 247447. In this work, experiments have been carried out in order to evaluate REFERENCES the potential of different pruning methods and their performance in [1] D.J. Newman A. Asuncion. UCI machine learning repository, 2007. the framework of continuous learning. The Reduced Error method [2] Y. Freund and R. Schapire, ‘A short introduction to boosting’, J. Japan. is the most consistent method followed by Learner Weight Analysis. Soc. for Artif. Intel., 14(5), 771–780, (1999). The use of Pareto Analysis does not seem to be justified during the [3] J. Gama and M. Gaber (Eds), Learning from Data Streams – Processing experiment. Nevertheless, one of the important characteristic of this techniques in Sensor Networks, Springer, 2007. [4] Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Rubén method consists in the capability of defining automatically the num- Ruiz-Torrubiano, and Ángel Valle, ‘Pruning adaptive boosting ensem- ber of classifiers of the pruned ensemble. PA may be automatized bles by means of a genetic algorithm’, in IDEAL, pp. 322–329, (2006). by thresholding the performance. Early results show that this auto- [5] Ludmila I. Kuncheva, ‘Classifier ensembles for changing environ- matic version performs better than the original method in most of the ments’, in In Multiple Classifier Systems, pp. 1–15. Springer, (2004). [6] Dragos D. Margineantu and Thomas G. Dietterich. Pruning adaptive cases. Experiments on simulated datasets in case of Exp 1 show that boosting, 1997. pruning methods are more efficient over wider drifted distribution [7] Leandro L. Minku, Allan P. White, and Xin Yao, ‘The impact of diver- rather than narrow drifted distribution. Due to the nature of the nar- sity on online ensemble learning in the presence of concept drift’, IEEE row circular dataset, drift stages have more common area and since in Trans. on Knowl. and Data Eng., 22(5), 730–742, (May 2010). this experiment, current stage has more effect for pruning, compare [8] Martin Scholz and Ralf Klinkenberg, ‘Boosting classifiers for drifting concepts’, Intelligent Data Analysis (IDA), Special Issue on Knowledge to previous stage, the pruning performances are slightly lower. At the Discovery from Data Streams, 2006, 2007, (2006). same time, Exp 2 show that pruning methods perform better than the [9] Jan Tozicka, Michael Rovatsos, Michal Pechoucek, and Stepan Urban, original classifier when the whole drifting distribution is presented. ‘Malef 58: Framework for distributed machine learning and data min- Based on Fig. 6, pruning ensemble through the incremental learning, ing’, Int. J. Intell. Inf. Database Syst., 2(1), 6–24, (February 2008). [10] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han, ‘Mining concept- definitely improves the final results. Finally, results obtained by ex- drifting data streams using ensemble classifiers’, in Proceedings of the periments on real datasets prove that pruning through the continuous ninth ACM SIGKDD, KDD ’03, pp. 226–235, New York, NY, USA, learning process provides very close or better results than AdaBoost. (2003). ACM. As future works, an evaluation of the method efficiency in terms of [11] Yi Zhang, Samuel Burer, and W. Nick Street, ‘Ensemble pruning via computational complexity will be considered since this parameter semi-definite programming’, Journal of Machine Learning Research, 7, 1315–1338, (2006). has a great importance in a continuous learning framework. For this [12] X. Zhu. Stream data mining repository, 2010. main motivation, the reduced error method had been modified in our 57