=Paper=
{{Paper
|id=Vol-1998/paper_03
|storemode=property
|title=autoBagging: Learning to Rank Bagging Workflows with Metalearning
|pdfUrl=https://ceur-ws.org/Vol-1998/paper_03.pdf
|volume=Vol-1998
|authors=Fábio Pinto,Vitor Cerqueira,Carlos Soares,João Mendes-Moreira
|dblpUrl=https://dblp.org/rec/conf/pkdd/PintoCSM17
}}
==autoBagging: Learning to Rank Bagging Workflows with Metalearning==
autoBagging: Learning to Rank Bagging Workflows with Metalearning Fábio Pinto ( ), Vı́tor Cerqueira, Carlos Soares and João Mendes-Moreira INESC TEC/Faculdade de Engenharia, Universidade do Porto Rua Dr. Roberto Frias, s/n Porto, Portugal 4200-465 fhpinto@inesctec.pt vmac@inesctec.pt csoares@fe.up.pt jmoreira@fe.up.pt Abstract. Machine Learning (ML) has been successfully applied to a wide range of domains and applications. One of the techniques behind most of these successful applications is Ensemble Learning (EL), the field of ML that gave birth to methods such as Random Forests or Boosting. The complexity of applying these techniques together with the market scarcity on ML experts, has created the need for systems that enable a fast and easy drop-in replacement for ML libraries. Automated ma- chine learning (autoML) is the field of ML that attempts to answers these needs. We propose autoBagging, an autoML system that automat- ically ranks 63 bagging workflows by exploiting past performance and metalearning. Results on 140 classification datasets from the OpenML platform show that autoBagging can yield better performance than the Average Rank method and achieve results that are not statistically dif- ferent from an ideal model that systematically selects the best workflow for each dataset. For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package on CRAN. Keywords: automated machine learning, metalearning, bagging, classification 1 Introduction Ensemble learning (EL) has proven itself as one of the most powerful techniques in Machine Learning (ML), leading to state-of-the-art results across several do- mains. Methods such as bagging, boosting or Random Forests are considered some of the favourite algorithms among data science practitioners. However, getting the most out of these techniques still requires significant expertise and it is often a complex and time consuming task. Furthermore, since the number of ML applications is growing exponentially, there is a need for tools that boost the data scientist’s productivity. The resulting research field that aims to answers these needs is Automated Machine Learning (autoML). In this paper, we address the problem of how to au- tomatically tune an EL algorithm, covering all components within it: generation (how to generate the models and how many), pruning (which technique should be used to prune the ensemble and how many models should be discarded) and integration (which model(s) should be selected and combined for each predic- tion). We focus specifically in the bagging algorithm [2] and four components of the algorithm: 1) the number of models that should be generated 2) the pruning method 3) how much models should be pruned and 4) which dynamic integra- tion method should be used. For the remaining of this paper, we call to a set of these four elements a bagging workflow. Our proposal is autoBagging, a system that combines a learning to rank approach together with metalearning to tackle the problem of automatically generate bagging workflows. Ranking is a common task in information retrieval. For instance, to answer the query of a user, a search engine ranks a plethora of documents according to their relevance. In this case, the query is replaced by new dataset and autoBagging acts as ranking engine. Figure 1 shows an overall schema of the proposed system. We leverage the historical predictive performance of each workflow in several datasets, where each dataset is characterised by a set of metafeatures. This metadata is then used to generate a metamodel, using a learning to rank approach. Given a new dataset, we are able to collect metafeatures from it and feed them to the metamodel. Finally, the metamodel outputs an ordered list of the workflows, taking into account the characteristics of the new dataset. Metadata Algorithm Learning to Rank Algorithm A Estimates of Algorithm A Performance a d d d Metafeatures Metamodel d Extraction d Ranked New Algorithms Dataset Fig. 1. Learning to Rank with Metalearning. The red lines represent offline tasks and the green ones represent online ones. We tested the approach in 140 classification datasets from the OpenML plat- form for collaborative ML [12] and 63 bagging workflows, that include two prun- ing techniques and two dynamic selection techniques. Results show that auto- Bagging has a better performance than two strong baselines, Bagging with 100 trees and the average rank. Furthermore, testing the top 5 workflows recom- mended by autoBagging guarantees an outcome that is not statistically different from the Oracle, an ideal method that for each dataset always selects the best workflow. For the purpose of reproducibility and generalizability, autoBagging is available as an R package.1 1 https://github.com/fhpinto/autoBagging 2 autoBagging We approach the problem of algorithm selection as a learning to rank problem [7]. Lets take D as the dataset set and A as the algorithm set. Y = {1, 2, ..., l} is the label set, where each value represents a relevance score, which represents the relative performance of a given algorithm. Therefore, l ≺ l − 1 ≺ ... ≺ 1, where ≺ represents an order relationship. Furthermore, Dm = {d1 , d2 , ..., dm } is the set of datasets for training and di is the i-th dataset, Ai = {ai,1 , ai,2 , ..., ai,ni } is the set of algorithms associated with dataset di and yi = {yi,1 , yi,2 , ..., yi,ni } is the set of labels associated with dataset di , where ni represents the sizes of Ai and yi ; ai,j represents the j-th algorithm in Ai ; and yi,j ∈ Y represents the j-th label in yi , representing the relevance score of ai,j with respect to di . Finally, the meta-dataset is denoted as S = {(di , Ai ), yi }m i=1 . We use metalearning to generate the metafeature vectors xi,j = φ(di , ai,j ) for each dataset-algorithm pair, where i = 1, 2, ..., m; j = 1, 2, ..., ni and φ represents the metafeatures extraction functions. These metafeatures can describe di , ai,j or even the relationship between both. Therefore, taking xi = {xi,1 , xi,2 , ..., xi,ni } we can represent the meta-dataset as S 0 = {(xi , yi )}m i=1 . Our goal is to train a meta ranking model f (d, a) = f (x) that is able to assign a relevance score to a given new dataset-algorithm pair d and a, given x. 2.1 Metafeatures We approach the problem of generating metafeatures to characterize d and a with the aid of a framework for systematic metafeatures generation [10]. Es- sentially, this framework regards a metafeature as a combination of three com- ponents: meta-function, a set of input objects and a post-processing function. The framework establishes how to systematically generate metafeatures from all possible combinations of object and post-processing alternatives that are com- patible with a given meta-function. Thus, the development of metafeatures for a MtL approach simply consists of selecting a set of meta-functions (e.g. entropy, mutual information and correlation) and the framework systematically generates the set of metafeatures that represent all the information that can be obtained with those meta-functions from the data. For this task in particular, we selected a set of meta-functions that are able to characterize the datasets as completely as possible (measuring information regarding the target variable, the categorical and numerical features, etc) the algorithms and the relationship between the datasets and the algorithms (who can be seen as landmarkers [9]). Therefore, the set of meta-functions used is: skewness, Pearson’s correlation, Maximal Information Coefficient (MIC [11]), entropy, mutual Information, eta squared (from ANOVA test) and rank of each algorithm [1]. Each meta-function is used to systematically measure information from all possible combination of input objects available for this task. We defined the input objects available as: discrete descriptive data of the datasets, continuous descriptive data of the datasets, discrete output data of the datasets, five sets of predictions (discrete predicted data) for each dataset (naive bayes, decision tree with depth 1, 2 and 3, and majority class). For instance, if we take the example of using Entropy as meta-function, it is possible to measure information in discrete descriptive data, discrete output data and discrete predicted data (if the base-level problem is a classification task). After computing the entropy of all these objects, it might be necessary to aggregate the information in order to keep the tabular form of the data. Take for the example the aggregation required for the entropy values computed for each discrete attribute. Therefore, we choose a palette of aggregation functions to capture several dimensions of these values and minimize the loss of information by aggregation. In that sense, the post-processing functions chosen were: average, maximum, minimum, standard deviation, variance and histogram binning. Given these meta-functions, the available input objects and post-processing functions, we are able to generate a set of 131 metafeatures. To this set we add eight metafeatures: the number of examples of the dataset, the number of at- tributes and the number of classes of the target variable; and five landmarkers (the ones already described above) estimated using accuracy as error measure. Furthermore, we add four metafeatures to describe the components of each work- flow: the number of trees, the pruning method, the pruning cut point and the dynamic selection method. In total, autoBagging uses a set of 143 metafeatures. 2.2 Metatarget In order to be able to learn a ranking meta-model f (d, a), we need to compute a metatarget that represents a score z to each dataset-algorithm pair (d, a), so that: F : (D, A) → Z, where F is the ranking meta-models set and Z is the metatarget set. To compute z, we use a cross validation error estimation methodology (4-fold cross validation in the experiments reported in this paper, Section 3), in which we estimate the performance of each bagging workflow for each dataset using Cohen’s kappa score [4]. On top of the estimated kappa score, for each dataset, we rank the bagging workflows. This ranking is the final form of the metatarget and it is then used for learning the meta-model. 3 Experiments Our experimental setup comprises 140 classification datasets extracted from the OpenML platform for collaborative machine learning [12]. We limited the datasets extracted to a maximum of 5000 instances, a minimum of 300 instances and a maximum of 1000 attributes, in order to speed up the experiments and exclude datasets that could be too small for some of bagging workflows that we wanted to test. Regarding bagging workflows, we limited the hyperparameters of the bagging workflows to four: number of models generated, pruning method, pruning cut point and dynamic selection method. Specifically, each hyperparameter could take the following values: – Number of models: 50, 100 or 200. Decision trees was chosen as learning algorithm. – Pruning method: Margin Distance Minimization(MDSQ) [8], Boosting-Based Pruning (BB) [8] or none. – Pruning cut point: 25%, 50% or 75%. – Dynamic integration method: Overall Local Accuracy (OLA), a dynamic se- lection method [13]; K-nearest-oracles-eliminate (KNORA-E) [6], a dynamic combination method; and none. The combination of all these hyperparameters described above generated 63 valid workflows. We tested these bagging workflows in the datasets extracted from OpenML with 4-fold cross validation, using Cohen’s kappa as evaluation metric. We used the XGBoost learning to rank implementation for gradient boosting of decision trees [3] to learn the metamodel as described in Section 2. As baselines, at the base-level, we use 1) bagging with 100 decision trees 2) the average rank method, which basically is a model that always predicts the bagging workflow with the best average rank in the meta training set and the 3) oracle, an ideal model that always selects the best bagging workflow for each dataset. As for the meta-level, we use as baseline the average rank method. As evaluation methodology, we use an approach similar to the leave-one-out methodology. However, each test fold consists of all the algorithm-dataset pairs associated with the test dataset. The remaining examples are used for training purposes. The evaluation metric at the meta-level is the Mean Average Precision at 10 (MAP@10) and at the base-level, as mentioned before, we use Cohen’s kappa. The methodology recommended by Demšar [5] was used for statistical validation of the results. 3.1 Results Figure 2 shows a loss curve, relating the average loss in terms of performance with the number of workflows tested following the ranking suggested by each method. The loss is calculated as the difference between the performance of the best algorithm ranked by the method in comparison with the ground truth ranking. The loss for all datasets is then averaged for aggregation purposes. We can see, as expected, that the average loss decreases for both methods as the number of workflows tested increases. In terms of comparison between autoBagging and the Average Rank method, it is possible to visualize that autoBagging shows a superior performance for all the values of the x axis. Interestingly, this result is particularly noticeable in the first tests. For instance, if we test only the top 1 workflow recommended by autoBagging, on average, the kappa loss is half of the one we should expect from the suggestion made by the average rank method. 0.08 0.06 Kappa Loss % Method 0.04 Autobagging Average Rank 0.02 0.00 0 20 40 60 Number of Tests Fig. 2. Loss curve comparing autoBagging with the Average Rank baseline. We evaluated these results to assess their statistical significance using Demšar’s methodology. Figures 3 and 4 show the Critical Difference (CD) diagrams for both the meta and the base-level. CD CD 1 2 3 4 5 6 1 2 Oracle Bagging AutoBagging@5 AverageRank Average AutoBagging AutoBagging@3 AutoBagging@1 Rank Fig. 3. Critical Difference diagram Fig. 4. Critical Difference diagram (with α = 0.05) of the experiments at (with α = 0.05) of the experiments at the meta-level. the base-level. At the meta-level, using MAP@10 as evaluation metric, autoBagging presents a clearly superior performance in comparison with the Average Rank. The dif- ference is statistically significant, as one can see in the CD diagram. This result is in accordance with performance that visualized in Figure 2 for both methods. At the base-level, we compared autoBagging with three baselines, as men- tioned before: bagging with 100 decision trees, the Average Rank method and the oracle. We test three versions of autoBagging, taking the top 1, 3 and 5 bagging workflows ranked by the meta-model. For instance, in autoBagging@3, we test the top 3 bagging workflows ranked by the meta-model and we choose the best. Starting by the tail of the CD diagram, both the Average Rank method and autoBagging@1 show a superior performance than Bagging with 100 decision trees. Furthermore, autoBagging@1 also shows a superior performance than the Average Rank method. This result confirms the indications that we visualized in Figure 2. The CD diagram shows also autoBagging@3 and autoBagging@5 have a sim- ilar performance. However, and we must highlight these results, autoBagging@5 shows a performance that is not statistically different from the oracle. This is extremely promising since it shows that the performance of autoBagging excels if the user is able to test the top 5 bagging workflows ranked by the system. 4 Conclusion This paper presents autoBagging, an autoML system that makes use of a learning to rank approach and metalearning to automatically suggest a bagging ensemble specifically designed for a given dataset. We tested the approach on 140 classi- fication datasets and the results show that autoBagging is clearly better than the baselines to which was compared. In fact, if the top five workflows suggested by autoBagging are tested, results show that the system achieves a performance that is not statistically different from the oracle, a method that systematically selects the best workflow for each dataset. For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package. Acknowledgements This work was partly funded by the ECSEL Joint Undertaking, the framework programme for research and innovation horizon 2020 (20142020) under grant agreement number 662189-MANTIS-2014-1. References 1. Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to data mining. Springer Science & Business Media (2008) 2. Breiman, L.: Bagging predictors. Machine learning 24(2), 123–140 (1996) 3. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. pp. 785–794. KDD ’16, ACM (2016) 4. Cohen, J.: A coefficient of agreement for nominal scales. Educational and psycho- logical measurement 20(1), 37–46 (1960) 5. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. JMLR 7(Jan), 1–30 (2006) 6. Ko, A.H., Sabourin, R., Britto Jr, A.S.: From dynamic classifier selection to dy- namic ensemble selection. Pattern Recognition 41(5), 1718–1731 (2008) 7. Liu, T.Y.: Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3(3), 225–331 (2009) 8. Martı́nez-Muñoz, G., Hernández-Lobato, D., Suárez, A.: An analysis of ensemble pruning techniques based on ordered aggregation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2), 245–259 (2009) 9. Pfahringer, B., Bensusan, H., Giraud-Carrier, C.: Tell me who can learn you and i can tell you who you are: Landmarking various learning algorithms. In: ICML. pp. 743–750 (2000) 10. Pinto, F., Soares, C., Mendes-Moreira, J.: Towards automatic generation of metafeatures. In: PAKDD. pp. 215–226. Springer (2016) 11. Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turn- baugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel asso- ciations in large data sets. Science 334(6062), 1518–1524 (2011) 12. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014) 13. Woods, K., Kegelmeyer Jr, W.P., Bowyer, K.: Combination of multiple classifiers using local accuracy estimates. Transactions on Pattern Analysis and Machine Intelligence 19(4), 405–410 (1997)