-

Combining Feature and Algorithm Hyperparameter Selection using some Metalearning Methods

Miguel Viana Cachada

mcachada@gmail.com 1

Salisu Mamman Abdulrahman

salisu.abdul@gmail.com 0

Pavel Brazdil

pbrazdil@inesctec.pt 1 0 LIAAD - INESC TEC/Faculdade de Ci 1 LIAAD - INESC TEC/Faculdade de Economia, Universidade do Porto 2 encias da Universidade do Porto and Kano University of Science and Technology Wudil , Kano State , Nigeria

Machine learning users need methods that can help them identify algorithms or even work ows (combination of algorithms with preprocessing tasks, using or not hyperparameter con gurations that are di erent from the defaults), that achieve the potentially best performance. Our study was oriented towards average ranking (AR), an algorithm selection method that exploits meta-data obtained on prior datasets. We focused on extending the use of a variant of AR* that takes A3R as the relevant metric (combining accuracy and run time). The extension is made at the level of diversity of the portfolio of workows that is made available to AR. Our aim was to establish whether feature selection and di erent hyperparameter con gurations improve the process of identifying a good solution. To evaluate our proposal we have carried out extensive experiments in a leave-one-out mode. The results show that AR* was able to select work ows that are likely to lead to good results, especially when the portfolio is diverse. We additionally performed a comparison of AR* with Auto-WEKA, running with different time budgets. Our proposed method shows some advantage over Auto-WEKA, particularly when the time budgets are small.

Average Ranking Selection of Classi cation Algorithms Combining Feature and Algorithm Selection Hyperparameters Con guration

Users of machine learning systems are facing the problem of how to choose a combination of data processing tools and algorithms. The goal is usually dened as maximizing or minimizing some quantitative measure. In classi cation problems, the goal could be optimizing the classi cation accuracy, the lift score or the ROC area (AUC). A typical practical data mining task consists of many sub-tasks which result in an extremely large search space that could be very time consuming for humans to explore manually. These sub-tasks correspond, mostly, to the work ow phases highlighted in Fig. 1: preprocessing (e.g., feature selection), algorithm selection and parametrization. Therefore strategies and methods are needed that can help the users to suggest or select an optimized data mining solution.

Machine learning workflow Data transformation

Model configuration

Data extraction

Cleansing

Preprocessing

Algorithm Selection

Hyperparameters

Model Evaluation

Model Deployment

Our work aimed to use average ranking, a very simple algorithm selection method [ 2 ], and study how the performance of this method is a ected by a more diversi ed portfolio of work ows. These, in addition to using algorithm default con gurations (as in [ 2 ]), include work ows where the application of classi cation algorithms is preceded by a feature selection method and/or alternative hyperparameter con gurations are used. It is expected that average ranking would perform better with a larger portfolio. On the other hand, it can be argued that this does not come without a cost, in terms of resources and time.

However, with the emergence of on-line sources such as OpenML [ 35 ], the results from applying di erent work ows to a wide range of datasets are becoming available. Our work can be used as a proof of concept that average ranking serves as a good baseline recommendation method with which alternative methods could be compared. With that objective in mind we performed a small-scale comparison between average ranking and Auto-WEKA [ 34 ] and present those results here.

The remainder of this paper is organized as follows: in Section 2 we present an overview of existing work in related areas. Section 3 describes the average ranking method with a focus on the variants of ranking methods that incorporate both accuracy and run time in the evaluation strategy. Section 4 provides the experimental results and an empirical comparison with Auto-WEKA [ 34 ]. Section 5 presents conclusions and discusses future work. 2

Related Work

In this paper we address a particular case of the algorithm selection problem, oriented towards the selection of classi cation algorithms, that has been thoroughly investigated over the last 25 years. One approach to algorithm selection/recommendation relies on metalearning. The simplest method uses just performance results on di erent datasets in the form of rankings. Some commonly used measures of performance are accuracy, AUC or A3R that combines accuracy and run time [ 1 ]. The rankings are then aggregated to obtain a single aggregated ranking. The aggregated ranking can be used as a simple model to identify the top algorithms to be used. This strategy is sometimes referred to as the T op-N strategy [ 5 ].

A more advanced approach, often considered as the classical metalearning approach, uses, in addition to performance results, a set of measures that characterize datasets [ 28, 5, 32 ]. Other approaches exploit estimates of performance based on past tests in so-called active testing method for algorithm selection [ 22 ].

The main idea of feature subset selection (FS) is to remove redundant or irrelevant features from the dataset as they can lead to a reduction of the classi cation accuracy or clustering quality and to an unnecessary increase of computational cost. Feature selection and dimensionality reduction approaches are discussed extensively in literature [ 27, 14 ]. Many practical studies were performed to evaluate these techniques on di erent elds, for example: e-mail ltering and drug discovery [ 18 ], e-mail spam detection [ 12 ], bioinformatics [ 31 ] and healthcare [ 8, 7 ]. In some of these studies it was observed that dimensionality reduction techniques improved the classi ers accuracy while others conclude the opposite. It can be observed that the usefulness of such techniques is likely to be dependent on the type of data and problem. A question arises whether the inclusion of such work ows in the given portfolio improves the overall quality of search for the best solution.

The choice of algorithm hyperparameters has been typically formalized as an optimization problem, being the objective function the same metric used to evaluate the performance of the corresponding algorithm. A simple method is grid search, which consists in exhaustively searching within a prede ned set of hyperparameter values. Another method, random search, was introduced by Bergstra and Bengio [ 3 ] to overcome the potential large computational cost of grid search. Some classes of algorithms, like neural networks, allow for the use of gradient descent for hyperparameter optimization [ 25 ]. Classical heuristics for optimized search have also been suggested, for example, genetic algorithms [ 29 ], tabu search [ 13 ] and particle swarm optimisation [ 26 ].

An approach that has recently been attracting attention for its good results is Bayesian optimization. It consists in iteratively tting a probabilistic model as each hyperparameter combination is tested. The aim is that this model gives good suggestions on what combinations should be tried next. Some methods to construct this function, optimizers, have been suggested, namely, SMAC - Sequential Model-based Algorithm Con guration [ 17 ], Spearmint [ 33 ] and TPE Tree Parzen Estimator [ 3 ]. An optimizer benchmarking framework and comparison of the three methods is presented in [ 10 ].

Auto-WEKA [ 20, 34 ] is a tool designed to help novice users of ML by automatically searching through the joint space of WEKA learning algorithms and their respective hyperparameter settings to maximize a given performance measure (for instance accuracy, AUC, etc.) by using a SMAC [ 17 ] optimizer.

Other algorithm selection tools include auto-sklearn [ 11 ] that, apart from using a SMAC optimizer, uses classical metalearning techniques, namely, dataset similarity through metafeatures and automatic ensemble construction, ASlib [ 4 ], a benchmark library for algorithm selection containing 17 algorithm selection scenarios from six di erent areas with a focus on (but not limited to) constraint satisfaction problems, AutoFolio [ 24 ] that uses SMAC to automatically determine a well-performing algorithm selection approach and its hyperparameters for a given algorithm selection data and Leveraging Learning to Automatically Manage Algorithms (LLAMA) [ 19 ], an R package for algorithm portfolios and selection. 3

Overview of the AR Method

This section presents a brief review of the average ranking method that is often used in comparative studies in the machine learning literature. This method can be regarded as a variant of Borda's method [ 23 ]. For each dataset the algorithms are ordered according to the performance measure chosen (e.g., predictive accuracy) and assigned ranks. Among popular ranking criteria we nd, for instance, success rates, AUC, and signi cant wins [ 6, 9, 21 ]. The best algorithm is assigned j rank 1, the runner-up is assigned rank 2, and so on. Let rAk be the rank of algorithm k on dataset j. In this work we use average ranks, which is obtained using 0 D

X rAk A

j j=1 n (1) where n is the number of datasets. The nal ranking is obtained by ordering the average ranks and assigning ranks to the individual algorithms accordingly.

Average ranking represents a useful method for deciding which algorithm should be used. It would normally be followed on a new, target dataset: rst the algorithm with rank 1 is evaluated, then the one with rank 2 and so on. In each step the better one is maintained as the potentially best option. In this context, average ranking can be referred to as the recommended ranking. 3.1

Average Ranking AR* that gives preference to fast tests

Ranking methods use a particular performance measure to construct an ordering of algorithms. Some commonly used measures of performance are accuracy, AUC or A3R that combines accuracy and run time [ 1 ]. Although other measures could have been used instead, here we focus on the ranking that exploits a combined measure of accuracy and run time, A3R. As the authors of [ 1 ] have shown, this leads to good results when loss time curves [ 30 ] are used in the evaluation. Here we use the following formulation for this measure:

A3Rdairef ;aj =

SRdaij

SRdairef (Tadji =Tadrief )P (2) where SRdaij and SRdairef represent the success rates (accuracies) of algorithms aj and aref on dataset di, where aref represents a given reference algorithm. Instead of accuracy, AUC or another measure can be used as well. Similarly, Tadji and Tadrief represent the run times of the algorithms, in seconds. As the average ranking method does not require pairwise comparisons, the values of SRdairef and Tadrief can be set to 1. Hence, we can use the simpli ed formula A3R0aj = SRdaij =(Tadji )P , as in [ 30 ].

To trade o the importance of time, the denominator is raised to the power of P, while P is usually some small number, such as 1=64, representing in e ect, the 64th root. This is motivated by the observation that run times vary much more than accuracies. It is not uncommon that one particular algorithm is several orders of magnitude slower (or faster) than another. Obviously, we do not want the time ratios to completely dominate the equation. If we take P = 1=N (i.e. N th root), we get a number that goes to 1 when P is approaching 0. In our experiments described in Section 4, we follow [ 30, 2 ] and choose P = 1=64, as it was shown to lead to better results than some other settings. This version of average ranking is referred to as AR*. 3.2

Evaluation using loss time curves

Our aim was to investigate the impact of using feature selection and di erent hyperparameter con gurations on average ranking method. The results are presented in the form of loss time curves [ 30 ] which show how the performance loss depends on time. The loss is calculated as the di erence between the performance of the algorithm identi ed using the proposed ranking method and the ideal choice (which is the best performance known, identi ed within the largest set of performance results available in our study). The individual loss time curves are aggregated into a mean loss time curve1. The mean loss time curve can be characterized by a number representing the mean loss in a given interval (MIL). This characteristic is similar to AUC, but there is an important di erence. When talking about AUCs, the values fall in the 0-1 interval. Our loss time curves span the interval Tmin - Tmax which is speci ed by the user. Typically the user only worries about run times when they exceed a minimum value. In the experiments here we have set Tmin to 10 seconds. The value of Tmax was set to 104 seconds, i.e. about 2.78 hours. 1 Another possibility would be to use median loss time curves. These could be accompanied by 25% and 75% percentile bands.

The Methodology Adopted and Experimental Results Overview of the methodology adopted

Our rst aim is to study the e ect of using feature selection with given classication algorithms when using the A3R-based average raking method. We accomplished this by constructing portfolios of algorithms that are preceded or not by feature selection. We then performed a similar analysis with portfolios of work ows that also include di erent hyperparameter con gurations for some of the algorithms.

The work ows are evaluated over each of the 37 datasets we use (refer to Table 4 in the Appendix). In each study, loss time curves and MIL values are generated using leave-one-out (LOO) cross-validation, where 36 datasets are used to generate the model (i.e., average ranking) and the corresponding loss time curve on the dataset left out.

The average ranking is followed sequentially. The AR* method uses the concept of current best (abest). When this method comes to testing algorithm ai in the ranking, the test is carried out using 10-fold cross validation (CV). If the performance of algorithm ai is better than that of abest, ai is used as the new abest. This way we obtain one loss time curve per dataset in every LOO cycle. The individual loss time curves are used to generate the mean loss time curve. Using LOO cross-validation helps to gain con dence that the average ranking can be e ectively transferred to new datasets and produce satisfactory outcomes.

Note that the loss is computed in reference to the best work ow overall for the dataset left out, which is identi ed within the largest portfolio of algorithms available. In our study, this portfolio includes all variants that take di erent hyperparameter con gurations into consideration, with and without feature selection.

All experiments were performed using Weka software [ 15 ] and its algorithms implementations. 4.2

E ect of feature selection

To study the e ect of feature selection, we use a total of 124 work ows. The setting for the experiment uses two di erent portfolios. The rst one contains 62 classi cation algorithms with their default hyperparameter settings, while the second includes work ows consisting of Correlation Feature Selection (CFS) [ 16 ] followed by one of the 62 classi cation algorithms.

Fig. 2 shows the results of 3 di erent variants of the A3R-based average ranking method. The variant AR* uses just our set of classi cation algorithms with no feature selection method and default hyperparameter con gurations. AR*+FS+A is the variant where our algorithms are always preceded by CFS. The variant AR* FS+A uses all 124 algorithm con gurations.

Overall, the variant AR* FS+A yields the smallest MIL value (Table 1). However, the MIL of the variant AR* is very close and, in addition, both loss time curves are quite similar (Fig. 2). 0 1e+0 1e+1 1e+4 1e+5

Regarding AR*+FS+A, although the algorithms preceded by feature selection are, in general, faster than their counterparts, as they deal with datasets with fewer features, their accuracy tends to be negatively a ected. In Fig. 2 it can be observed that the AR*+FS+A loss time curve is able to compete with the other two variants up to approximately the 6 second mark, but eventually reaches its limit in accuracy improvement. This happens because the set of algorithms used does not include the ones with performance closer to the best available. In our study, AR*+FS+A provides the best accuracy for just 4 datasets, while AR* FS+A achieves the best accuracy in 13 datasets. 4.3

E ect of diversity of hyperparameter con gurations

We have additionally experimented with several versions of some of the algorithms in our algorithm portfolio. Each version was associated with a particular con guration of the corresponding hyperparameters. The extended portfolio included 3 versions of Multilayer Perceptron, 7 of Support Vector Machines (with polynomial and radial basis function kernels), 7 of Random Forests, 8 of J48 and 8 7 ) s iton 6 p e tag 5 n e c rpe 4 ( s s Lo 3 y c a rcu 2 c A 1

AR*

AR* + Hyp + A AR* ± FS + Hyp + A 5 of K-Nearest Neighbors, totalling 30 additions (refer to Tables 5 and 6 in the Appendix for more information regarding the algorithms used).

To study the e ect of di erent algorithm parametrizations we again de ne AR* as the baseline portfolio and compare it with AR*+Hyp+A that includes the 30 extra algorithm versions mentioned above and the variant AR* FS+Hyp+A, which contemplates also feature selection.

AR* FS+Hyp+A is the portfolio with the largest number of work ows, 184, corresponding to 62 algorithms with default con gurations, 30 versions with alternative hyperparameter con gurations and the same 92 previous con gurations preceded by feature selection. 1e+1 1e+2 1e+3

Time (seconds) 1e+4 1e+5

From Table 2 it can be observed that, similarly as in the previous study (Section 4.2), the variant with more diversity, AR* FS+Hyp+A in this case, is the one with the lowest MIL. However, when compared with AR*+Hyp+A, we note that the di erence in terms of MIL is not very high and also, when observing both loss time curves (Fig. 3), we cannot say that one is better than the other. In fact, up to the 100 seconds mark AR* FS+Hyp+A appears to be worse than AR*+Hyp+A. In our view, this is because, in many cases, average ranking includes di erent variants of the same algorithm in a close succession. This may lead to a loss of time spent on testing variants that do not result in any gain. Nevertheless, the variant AR* FS+Hyp+A is the most complete set of algorithm alternatives and includes all best possible accuracies achievable per dataset, hence its loss time curve eventually reaches 0.

When compared with AR*, the other two portfolios allow to clearly achieve better loss time curves and lower MIL, con rming that a portfolio that is richer in algorithm versions has increased the chances of providing better recommendations. Our results also suggest, however, that it is desirable to provide methods that allow to construct portfolios that include all important variants, but exclude others, in order to avoid losing time testing many "similar" alternatives. 4.4

Comparison between A3R-based AR and Auto-WEKA

For the purpose of evaluating the e ectiveness of our proposed method in practice, we have decided to carry out similar experiments using Auto-WEKA [ 20, 34 ]. An empirical study was done by comparing the accuracy obtained by each system for the same run time. The Auto-WEKA runs were made using crossvalidation with 10 folds: reference accuracy results from the average of all folds; reference run time is the sum of training and testing time, measured using all data to train and test the model.

Regarding the Auto-WEKA experiment, a required parameter is time limit, which sets a time budget for it to run. The actual run time is, however, somewhat di erent from the one de ned in the parameter, so it needs to be measured. After running, Auto-WEKA outputs one or more recommendations for the model con gurations (that always include an algorithm and its hyperparameters and may or may not include a feature selection procedure). For the purpose of this empirical study, we used only the rst Auto-WEKA recommendation for each dataset. The train and test time spent needs to be added to the given budget. In this study we have used four time budgets: 5, 15, 30 and 60 minutes. The steps to compare both methods are detailed below. They refer to one dataset but the same steps were repeated for all datasets. 1. Auto-WEKA is run for a prede ned time budget. Its actual run time (search time) is measured and the recommended con guration is recorded. 2. The recommended con guration is run within a WEKA Experiment class, using cross-validation with 10 folds. The predictive accuracy for the recommended algorithm returned by Auto-WEKA is obtained. 3. The Auto-WEKA model run time is measured from the tasks of training and testing the recommended con guration over all data. 4. Auto-WEKA total run time is computed by adding the search time to the recommended model run time. The sum of the two times is used to retrieve the actual performance of our system.

One way to compare the overall performances of the algorithm recommendation methods is to count the number of data sets on which a particular method is the overall winner. Table 3 shows the aggregated results for the four Auto-WEKA time budgets, where Win refers to the number of cases where AR* FS+Hyp+A has higher accuracy than Auto-WEKA. This is a simple comparison that does not consider the statistical signi cance test of the di erences. Still, it can be observed that AR* FS+Hyp+A has more wins for lower Auto-WEKA time budgets. In this paper we have presented a study that exploits the average ranking method AR* that gives preference to well-performing work ows which are also fast to test. Our portfolios combine algorithm selection with feature selection and a set of hyperparameter con gurations. The portfolio that uses the most diverse number of work ow con gurations (AR* FS+Hyp+A) achieved the best results. The second runner-up was AR*+Hyp+A with quite similar performance. These results con rm that it is indeed important to incorporate hyperparameter con gurations into the portfolio, as it can lead to improved performance. Although the addition of extra variants to the given portfolio could, in principle, slow down the process of identifying the best or the near best solution, the usage of AR* mitigates this adverse e ect. It opts for fast and good performing work ows before the others.

Additionally, we have compared A3R-based average ranking against AutoWEKA and showed that the proposed method competes quite well with the latter, especially when smaller time budgets are used.

Future work

Our future plan is to devise a method of pruning portfolios of work ows with the aim to avoid investing time in testing the variants of algorithms that are of similar nature and have similar performance. Another alternative line that could be followed could explore the approaches based on active testing (AT) and/or surrogate models, whose aim is to select the most promising candidate to test.

It would be desirable to use a much larger portfolio to guarantee that the potentially best solutions are not really left out. We could collect more test results with alternative feature selection methods and preprocessing methods, as well as hyperparameter con gurations, or even reuse a larger set of results already available through OpenML [ 35 ].

We could also extend the comparison with Auto-WEKA to higher time budgets, as well as use other currently available methods to perform further comparisons (e.g., auto-sklearn [ 11 ]).

We are also considering the investigation of approaches that allow to take into account the costs (run time) of o -line tests used to gather the meta-data that our proposed method exploits. Their cost could be set to some fraction of the cost of on-line test (i.e. tests on a new dataset), but not really ignored altogether.

Acknowledgements The authors wish to express our gratitude to the following institutions which have provided funding to support this work: { Federal Government of Nigeria Tertiary Education Trust Fund under the TETFund 2012 AST$D Intervention for Kano University of Science and Technology, Wudil, Kano State, Nigeria for PhD Overseas Training; { Project NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics/NORTE-01-0145-FEDER-000016, which is nanced by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). SMO PolyKernel SMO RBFKernel IBK MultilayerPerceptron H = [a, o] *Parameters in bold are the default. **Adapted from WEKA documentation.

Parameter interval*

Parameter description**

Size of each bag, as a percentage of the training P = [80, 100] set size.

Number of attributes to randomly investigate, K = [0, 250] where 0 = int(log2(#attributes)+1).

M = [ 1, 10 ] Minimum number of instances per leaf .

Con dence threshold for pruning (smaller values C = [0.01, 0.25, 0.5] incur more pruning).

1. Abdulrahman , S.M. , Brazdil , P. : Measures for Combining Accuracy and Time for Meta-learning . In: Meta-Learning and Algorithm Selection Workshop at ECAI 2014 . pp. 49 { 50 ( 2014 )

2. Abdulrahman , S.M. , Brazdil , P., van Rijn, J.N. , Vanschoren , J.: Speeding up algorithm selection using average ranking and active testing by introducing runtime . Machine learning Special Issue on Metalearning and Algorithm Selection ( 2017 )

3. Bergstra , J.S. , Bardenet , R. , Bengio , Y. , Kegl , B. : Algorithms for hyper-parameter optimization . In: Advances in Neural Information Processing Systems . pp. 2546 { 2554 ( 2011 )

4. Bischl , B. , Kerschke , P. , Kottho , L. , Lindauer , M. , Malitsky , Y. , Frechette , A. , Hoos , H. , Hutter , F. , Leyton-Brown , K. , Tierney , K. , et al.: Aslib: A benchmark library for algorithm selection . Arti cial Intelligence 237 , 41 { 58 ( 2016 )

5. Brazdil , P. , Giraud-Carrier , C. , Soares , C. , Vilalta , R.: Metalearning: Applications to data mining . Springer Science & Business Media ( 2008 )

6. Brazdil , P. , Soares , C. , Da

Costa

, J.P. : Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results . Machine Learning 50 ( 3 ), 251 { 277 ( 2003 )

7. Chu , C. , Hsu , A.L. , Chou , K.H. , Bandettini , P. , Lin , C. , Initiative , A.D.N. : Does feature selection improve classi cation accuracy? Impact of sample size and feature selection on classi cation using anatomical magnetic resonance images . Neuroimage 60 ( 1 ), 59 { 70 ( 2012 )

8. Cuingnet , R. , Chupin , M. , Benali , H. , Colliot , O. : Spatial and anatomical regularization of SVM for brain image analysis. . In Advances in Neural Information Processing Systems pp. 460 { 468 ( 2010 )

9. Demsar , J.: Statistical Comparisons of Classi ers over Multiple Data Sets . The Journal of Machine Learning Research 7 , 1 { 30 ( 2006 )

10. Eggensperger , K. , Feurer , M. , Hutter , F. , Bergstra , J. , Snoek , J. , Hoos , H. , LeytonBrown, K.: Towards an empirical foundation for assessing bayesian optimization of hyperparameters . In: NIPS workshop on Bayesian Optimization in Theory and Practice . pp. 1 { 5 ( 2013 )

11. Feurer , M. , Klein , A. , Eggensperger , K. , Springenberg , J. , Blum , M. , Hutter , F.: E cient and robust automated machine learning . In: Advances in Neural Information Processing Systems . pp. 2962 { 2970 ( 2015 )

12. Gansterer , W.N. , Janecek , A.G. , Neumayer , R.: Spam ltering based on latent semantic indexing . In: Survey of Text Mining II , pp. 165 { 183 . Springer ( 2008 )

13. Gomes , T.A. , Prudncio , R.B. , Soares , C. , Rossi , A.L. , Carvalho , A. : Combining meta-learning and search techniques to select parameters for support vector machines . Neurocomputing 75 ( 1 ), 3 { 13 ( 2012 )

14. Guyon , I. , Elissee , A. : An introduction to variable and feature selection . In: Journal of machine learning research . pp. 1157 { 1182 . JMLR. ( 2003 )

15. Hall , M. , Frank , E. , Holmes , G. , Pfahringer , B. , Reutemann , P. , Witten , I.H. : The WEKA Data Mining Software: An Update . ACM SIGKDD explorations newsletter 11(1) , 10 { 18 ( 2009 )

16. Hall , M.A. : Correlation-based feature selection for machine learning . Ph.D. thesis , The University of Waikato ( 1999 )

17. Hutter , F. , Hoos , H.H. , Leyton-Brown , K.: Sequential model-based optimization for general algorithm con guration . International Conference on Learning and Intelligent Optimization pp. 507 { 523 ( 2011 )

18. Kohavi , R. , John, G.H.: Wrappers for feature subset selection . Arti cial intelligence . 97 ( 1 ), 273 { 324 ( 1997 )

19. Kottho , L. : Llama: leveraging learning to automatically manage algorithms . arXiv preprint arXiv:1306.1031 ( 2013 )

20. Kottho , L. , Thornton , C. , Hoos , H.H. , Hutter , F. , Leyton-Brown , K.: Auto-weka 2.0: Automatic model selection and hyperparameter optimization in weka . Journal of Machine Learning Research 17 , 1 { 5 ( 2016 )

21. Leite , R. , Brazdil , P. : Active Testing Strategy to Predict the Best Classi cation Algorithm via Sampling and Metalearning . In: ECAI . pp. 309 { 314 ( 2010 )

22. Leite , R. , Brazdil , P. , Vanschoren , J.: Selecting Classi cation Algorithms with Active Testing . In: Machine Learning and Data Mining in Pattern Recognition , pp. 117 { 131 . Springer ( 2012 )

23. Lin , S. : Rank aggregation methods . WIREs Computational Statistics 2 , 555 { 570 ( 2010 )

24. Lindauer , M. , Hoos , H.H. , Hutter , F. , Schaub , T. : Autofolio: An automatically con gured algorithm selector . Journal of Arti cial Intelligence Research 53 , 745 { 778 ( 2015 )

25. Maclaurin , D. , Duvenaud , D. , Adams , R.P. : Gradient-based hyperparameter optimization through reversible learning . In: Proceedings of the 32nd International Conference on Machine Learning ( 2015 )

26. de Miranda, P.B., Prud^encio, R.B., de Carvalho , A.C.P. , Soares , C. : Combining a multi-objective optimization approach with meta-learning for svm parameter selection . In: In Systems, Man, and Cybernetics (SMC) . pp. 2909 { 2914 . IEEE. ( 2012 )

27. Molina , L.C. , Belanche , L. , Nebot , A. : Feature selection algorithms: a survey and experimental evaluation . In: Proceedings . 2002 IEEE International Conference on. pp. 306 { 313 . IEEE ( 2003 )

28. Pfahringer , B. , Bensusan , H. , Giraud-Carrier , C. : Tell me who can learn you and I can tell you who you are: Landmarking various learning algorithms . In: Proceedings of the 17th International Conference on Machine Learning . pp. 743 { 750 ( 2000 )

29. Reif , M. , Shafait , F. , Dengel , A. : Meta-learning for evolutionary parameter optimization of classi ers . Machine learning 87(3) , 357 { 380 ( 2012 )

30. van Rijn, J.N. , Abdulrahman , S.M. , Brazdil , P. , Vanschoren , J.: Fast Algorithm Selection using Learning Curves . In: Advances in Intelligent Data Analysis XIV . Springer ( 2015 )

31. Saeys , Y. , Inza , I. , Larraaga , P.: A review of feature selection techniques in bioinformatics . bioinformatics 23 ( 19 ), 2507 { 2517 ( 2007 )

32. Smith-Miles , K.A. : Cross-disciplinary Perspectives on Meta-Learning for Algorithm Selection . ACM Computing Surveys (CSUR) 41(1) , 6: 1 {6: 25 ( 2008 )

33. Snoek , J. , Larochelle , H. , Adams , R.P. : Practical bayesian optimization of machine learning algorithms . In: Advances in neural information processing systems . pp. 2951 { 2959 ( 2012 )

34. Thornton , C. , Hutter , F. , Hoos , H.H. , Leyton-Brown , K.: Auto-WEKA: Combined selection and hyperparameter optimization of classi cation algorithms . In: In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 847 { 855 . ACM ( 2013 )

35. Vanschoren , J., van Rijn, J.N. , Bischl , B. , Torgo , L. : Openml: networked science in machine learning . ACM SIGKDD Explorations Newsletter 15 ( 2 ), 49 { 60 ( 2013 )