Combining Feature and Algorithm
          Hyperparameter Selection using some
                Metalearning Methods

 Miguel Viana Cachada1 , Salisu Mamman Abdulrahman2 , and Pavel Brazdil1
      1
        LIAAD - INESC TEC/Faculdade de Economia, Universidade do Porto
                  mcachada@gmail.com,pbrazdil@inesctec.pt
    2
      LIAAD - INESC TEC/Faculdade de Ciências da Universidade do Porto and
       Kano University of Science and Technology Wudil, Kano State, Nigeria
                             salisu.abdul@gmail.com


       Abstract. Machine learning users need methods that can help them
       identify algorithms or even workflows (combination of algorithms with
       preprocessing tasks, using or not hyperparameter configurations that
       are different from the defaults), that achieve the potentially best per-
       formance. Our study was oriented towards average ranking (AR), an
       algorithm selection method that exploits meta-data obtained on prior
       datasets. We focused on extending the use of a variant of AR* that
       takes A3R as the relevant metric (combining accuracy and run time).
       The extension is made at the level of diversity of the portfolio of work-
       flows that is made available to AR. Our aim was to establish whether
       feature selection and different hyperparameter configurations improve
       the process of identifying a good solution. To evaluate our proposal we
       have carried out extensive experiments in a leave-one-out mode. The re-
       sults show that AR* was able to select workflows that are likely to lead
       to good results, especially when the portfolio is diverse. We additionally
       performed a comparison of AR* with Auto-WEKA, running with dif-
       ferent time budgets. Our proposed method shows some advantage over
       Auto-WEKA, particularly when the time budgets are small.

       Keywords: Average Ranking, Selection of Classification Algorithms,
       Combining Feature and Algorithm Selection, Hyperparameters Configu-
       ration.


1    Introduction

Users of machine learning systems are facing the problem of how to choose a
combination of data processing tools and algorithms. The goal is usually de-
fined as maximizing or minimizing some quantitative measure. In classification
problems, the goal could be optimizing the classification accuracy, the lift score
or the ROC area (AUC). A typical practical data mining task consists of many
sub-tasks which result in an extremely large search space that could be very time
consuming for humans to explore manually. These sub-tasks correspond, mostly,
to the workflow phases highlighted in Fig. 1: preprocessing (e.g., feature selec-
tion), algorithm selection and parametrization. Therefore strategies and methods
are needed that can help the users to suggest or select an optimized data mining
solution.


                              Machine learning workflow
                    Data transformation       Model configuration

       Data                        Pre-      Algorithm                  Model        Model
                 Cleansing
    extraction                  processing   Selection                Evaluation   Deployment
                                                           Hyper-
                                                         parameters


                         Fig. 1. A typical machine learning workflow.


    Our work aimed to use average ranking, a very simple algorithm selection
method [2], and study how the performance of this method is affected by a more
diversified portfolio of workflows. These, in addition to using algorithm default
configurations (as in [2]), include workflows where the application of classification
algorithms is preceded by a feature selection method and/or alternative hyper-
parameter configurations are used. It is expected that average ranking would
perform better with a larger portfolio. On the other hand, it can be argued that
this does not come without a cost, in terms of resources and time.
    However, with the emergence of on-line sources such as OpenML [35], the
results from applying different workflows to a wide range of datasets are becom-
ing available. Our work can be used as a proof of concept that average ranking
serves as a good baseline recommendation method with which alternative meth-
ods could be compared. With that objective in mind we performed a small-scale
comparison between average ranking and Auto-WEKA [34] and present those
results here.
    The remainder of this paper is organized as follows: in Section 2 we present
an overview of existing work in related areas. Section 3 describes the average
ranking method with a focus on the variants of ranking methods that incorpo-
rate both accuracy and run time in the evaluation strategy. Section 4 provides
the experimental results and an empirical comparison with Auto-WEKA [34].
Section 5 presents conclusions and discusses future work.

2      Related Work
In this paper we address a particular case of the algorithm selection problem,
oriented towards the selection of classification algorithms, that has been thor-
oughly investigated over the last 25 years. One approach to algorithm selec-
tion/recommendation relies on metalearning. The simplest method uses just
performance results on different datasets in the form of rankings. Some com-
monly used measures of performance are accuracy, AUC or A3R that combines
accuracy and run time [1]. The rankings are then aggregated to obtain a single
aggregated ranking. The aggregated ranking can be used as a simple model to
identify the top algorithms to be used. This strategy is sometimes referred to as
the T op-N strategy [5].
    A more advanced approach, often considered as the classical metalearning
approach, uses, in addition to performance results, a set of measures that char-
acterize datasets [28, 5, 32]. Other approaches exploit estimates of performance
based on past tests in so-called active testing method for algorithm selection
[22].
     The main idea of feature subset selection (FS) is to remove redundant or
irrelevant features from the dataset as they can lead to a reduction of the clas-
sification accuracy or clustering quality and to an unnecessary increase of com-
putational cost. Feature selection and dimensionality reduction approaches are
discussed extensively in literature [27, 14]. Many practical studies were performed
to evaluate these techniques on different fields, for example: e-mail filtering and
drug discovery [18], e-mail spam detection [12], bioinformatics [31] and health-
care [8, 7]. In some of these studies it was observed that dimensionality reduction
techniques improved the classifiers accuracy while others conclude the opposite.
It can be observed that the usefulness of such techniques is likely to be depen-
dent on the type of data and problem. A question arises whether the inclusion
of such workflows in the given portfolio improves the overall quality of search
for the best solution.
    The choice of algorithm hyperparameters has been typically formalized as
an optimization problem, being the objective function the same metric used
to evaluate the performance of the corresponding algorithm. A simple method
is grid search, which consists in exhaustively searching within a predefined set
of hyperparameter values. Another method, random search, was introduced by
Bergstra and Bengio [3] to overcome the potential large computational cost of
grid search. Some classes of algorithms, like neural networks, allow for the use
of gradient descent for hyperparameter optimization [25]. Classical heuristics for
optimized search have also been suggested, for example, genetic algorithms [29],
tabu search [13] and particle swarm optimisation [26].
    An approach that has recently been attracting attention for its good results
is Bayesian optimization. It consists in iteratively fitting a probabilistic model
as each hyperparameter combination is tested. The aim is that this model gives
good suggestions on what combinations should be tried next. Some methods to
construct this function, optimizers, have been suggested, namely, SMAC - Se-
quential Model-based Algorithm Configuration [17], Spearmint [33] and TPE -
Tree Parzen Estimator [3]. An optimizer benchmarking framework and compar-
ison of the three methods is presented in [10].
    Auto-WEKA [20, 34] is a tool designed to help novice users of ML by auto-
matically searching through the joint space of WEKA learning algorithms and
their respective hyperparameter settings to maximize a given performance mea-
sure (for instance accuracy, AUC, etc.) by using a SMAC [17] optimizer.
    Other algorithm selection tools include auto-sklearn [11] that, apart from us-
ing a SMAC optimizer, uses classical metalearning techniques, namely, dataset
similarity through metafeatures and automatic ensemble construction, ASlib [4],
a benchmark library for algorithm selection containing 17 algorithm selection
scenarios from six different areas with a focus on (but not limited to) constraint
satisfaction problems, AutoFolio [24] that uses SMAC to automatically deter-
mine a well-performing algorithm selection approach and its hyperparameters
for a given algorithm selection data and Leveraging Learning to Automatically
Manage Algorithms (LLAMA) [19], an R package for algorithm portfolios and
selection.


3     Overview of the AR Method

This section presents a brief review of the average ranking method that is often
used in comparative studies in the machine learning literature. This method can
be regarded as a variant of Borda’s method [23]. For each dataset the algorithms
are ordered according to the performance measure chosen (e.g., predictive accu-
racy) and assigned ranks. Among popular ranking criteria we find, for instance,
success rates, AUC, and significant wins [6, 9, 21]. The best algorithm is assigned
                                                              j
rank 1, the runner-up is assigned rank 2, and so on. Let rA     k
                                                                  be the rank of al-
gorithm k on dataset j. In this work we use average ranks, which is obtained
using                                         
                                      X D
                                            j
                              rA
                               ˆk =      rAk  ÷ n                              (1)
                                       j=1

where n is the number of datasets. The final ranking is obtained by ordering the
average ranks and assigning ranks to the individual algorithms accordingly.
   Average ranking represents a useful method for deciding which algorithm
should be used. It would normally be followed on a new, target dataset: first the
algorithm with rank 1 is evaluated, then the one with rank 2 and so on. In each
step the better one is maintained as the potentially best option. In this context,
average ranking can be referred to as the recommended ranking.


3.1   Average Ranking AR* that gives preference to fast tests

Ranking methods use a particular performance measure to construct an ordering
of algorithms. Some commonly used measures of performance are accuracy, AUC
or A3R that combines accuracy and run time [1]. Although other measures could
have been used instead, here we focus on the ranking that exploits a combined
measure of accuracy and run time, A3R. As the authors of [1] have shown, this
leads to good results when loss time curves [30] are used in the evaluation. Here
we use the following formulation for this measure:

                                                      SR daij
                                                     SR dairef
                              A3R dairef ,aj =                                     (2)
                                                 (Tadji /Tadref
                                                             i
                                                                )P

where SR daij and SR dairef represent the success rates (accuracies) of algorithms aj
and aref on dataset di , where aref represents a given reference algorithm. Instead
of accuracy, AUC or another measure can be used as well. Similarly, Tadji and Tadref i

represent the run times of the algorithms, in seconds. As the average ranking
method does not require pairwise comparisons, the values of SR dairef and Tadref   i

                                                                     0
can be set to 1. Hence, we can use the simplified formula A3R aj = SR daij /(Tadji )P ,
as in [30].
    To trade off the importance of time, the denominator is raised to the power of
P, while P is usually some small number, such as 1/64, representing in effect, the
64th root. This is motivated by the observation that run times vary much more
than accuracies. It is not uncommon that one particular algorithm is several
orders of magnitude slower (or faster) than another. Obviously, we do not want
the time ratios to completely dominate the equation. If we take P = 1/N (i.e.
N th root), we get a number that goes to 1 when P is approaching 0. In our
experiments described in Section 4, we follow [30, 2] and choose P = 1/64, as
it was shown to lead to better results than some other settings. This version of
average ranking is referred to as AR*.


3.2    Evaluation using loss time curves

Our aim was to investigate the impact of using feature selection and different
hyperparameter configurations on average ranking method. The results are pre-
sented in the form of loss time curves [30] which show how the performance loss
depends on time. The loss is calculated as the difference between the perfor-
mance of the algorithm identified using the proposed ranking method and the
ideal choice (which is the best performance known, identified within the largest
set of performance results available in our study). The individual loss time curves
are aggregated into a mean loss time curve1 . The mean loss time curve can be
characterized by a number representing the mean loss in a given interval (MIL).
This characteristic is similar to AUC, but there is an important difference. When
talking about AUCs, the values fall in the 0-1 interval. Our loss time curves span
the interval Tmin - Tmax which is specified by the user. Typically the user only
worries about run times when they exceed a minimum value. In the experiments
here we have set Tmin to 10 seconds. The value of Tmax was set to 104 seconds,
i.e. about 2.78 hours.
1
    Another possibility would be to use median loss time curves. These could be accom-
    panied by 25% and 75% percentile bands.
4     The Methodology Adopted and Experimental Results

4.1   Overview of the methodology adopted

Our first aim is to study the effect of using feature selection with given classi-
fication algorithms when using the A3R-based average raking method. We ac-
complished this by constructing portfolios of algorithms that are preceded or
not by feature selection. We then performed a similar analysis with portfolios of
workflows that also include different hyperparameter configurations for some of
the algorithms.
    The workflows are evaluated over each of the 37 datasets we use (refer to
Table 4 in the Appendix). In each study, loss time curves and MIL values are
generated using leave-one-out (LOO) cross-validation, where 36 datasets are
used to generate the model (i.e., average ranking) and the corresponding loss
time curve on the dataset left out.
    The average ranking is followed sequentially. The AR* method uses the con-
cept of current best (abest ). When this method comes to testing algorithm ai in
the ranking, the test is carried out using 10-fold cross validation (CV). If the
performance of algorithm ai is better than that of abest , ai is used as the new
abest . This way we obtain one loss time curve per dataset in every LOO cycle.
The individual loss time curves are used to generate the mean loss time curve.
Using LOO cross-validation helps to gain confidence that the average ranking
can be effectively transferred to new datasets and produce satisfactory outcomes.
    Note that the loss is computed in reference to the best workflow overall for
the dataset left out, which is identified within the largest portfolio of algorithms
available. In our study, this portfolio includes all variants that take different
hyperparameter configurations into consideration, with and without feature se-
lection.
    All experiments were performed using Weka software [15] and its algorithms
implementations.


4.2   Effect of feature selection

To study the effect of feature selection, we use a total of 124 workflows. The
setting for the experiment uses two different portfolios. The first one contains 62
classification algorithms with their default hyperparameter settings, while the
second includes workflows consisting of Correlation Feature Selection (CFS) [16]
followed by one of the 62 classification algorithms.
    Fig. 2 shows the results of 3 different variants of the A3R-based average
ranking method. The variant AR* uses just our set of classification algorithms
with no feature selection method and default hyperparameter configurations.
AR*+FS+A is the variant where our algorithms are always preceded by CFS.
The variant AR*±FS+A uses all 124 algorithm configurations.
    Overall, the variant AR*±FS+A yields the smallest MIL value (Table 1).
However, the MIL of the variant AR* is very close and, in addition, both loss
time curves are quite similar (Fig. 2).
                                             9
                                                                                           AR*
                                                                                   AR* + FS + A
                                             8                                     AR* ± FS + A


         Accuracy Loss (percentage points)
                                             7

                                             6

                                             5

                                             4

                                             3

                                             2

                                             1

                                             0
                                             1e+0      1e+1    1e+2         1e+3        1e+4      1e+5
                                                                 Time (seconds)


Fig. 2. Loss time curves of the variants of A3R-based average ranking method used to
study the effect of feature selection

Table 1. Mean interval loss associated with the variants of A3R-based average ranking
method used to study the effect of feature selection

                                                    AR-Variant AR* AR*+FS+A AR*±FS+A
                                                    MIL       0.7735 2.3396   0.7476


    Regarding AR*+FS+A, although the algorithms preceded by feature selec-
tion are, in general, faster than their counterparts, as they deal with datasets
with fewer features, their accuracy tends to be negatively affected. In Fig. 2 it
can be observed that the AR*+FS+A loss time curve is able to compete with the
other two variants up to approximately the 6 second mark, but eventually reaches
its limit in accuracy improvement. This happens because the set of algorithms
used does not include the ones with performance closer to the best available.
In our study, AR*+FS+A provides the best accuracy for just 4 datasets, while
AR*±FS+A achieves the best accuracy in 13 datasets.


4.3   Effect of diversity of hyperparameter configurations

We have additionally experimented with several versions of some of the algo-
rithms in our algorithm portfolio. Each version was associated with a particular
configuration of the corresponding hyperparameters. The extended portfolio in-
cluded 3 versions of Multilayer Perceptron, 7 of Support Vector Machines (with
polynomial and radial basis function kernels), 7 of Random Forests, 8 of J48 and
5 of K-Nearest Neighbors, totalling 30 additions (refer to Tables 5 and 6 in the
Appendix for more information regarding the algorithms used).
    To study the effect of different algorithm parametrizations we again define
AR* as the baseline portfolio and compare it with AR*+Hyp+A that includes
the 30 extra algorithm versions mentioned above and the variant AR*±FS+Hyp+A,
which contemplates also feature selection.
    AR*±FS+Hyp+A is the portfolio with the largest number of workflows, 184,
corresponding to 62 algorithms with default configurations, 30 versions with al-
ternative hyperparameter configurations and the same 92 previous configurations
preceded by feature selection.


                                             8
                                                                                            AR*
                                                                                   AR* + Hyp + A
                                             7                                AR* ± FS + Hyp + A
         Accuracy Loss (percentage points)


                                             6

                                             5

                                             4

                                             3

                                             2

                                             1

                                             0
                                             1e+0     1e+1     1e+2         1e+3        1e+4       1e+5
                                                                 Time (seconds)


Fig. 3. Loss time curves of the variants used to study the effect of hyperparameter
configurations diversity


Table 2. Mean interval loss associated with the variants of A3R-based average ranking
method used to study the effect of hyperparameter configurations diversity

                                                 AR-Variant AR* AR*+Hyp+A AR*±FS+Hyp+A
                                                 MIL       0.7735 0.5035      0.4691


    From Table 2 it can be observed that, similarly as in the previous study
(Section 4.2), the variant with more diversity, AR*±FS+Hyp+A in this case,
is the one with the lowest MIL. However, when compared with AR*+Hyp+A,
we note that the difference in terms of MIL is not very high and also, when
observing both loss time curves (Fig. 3), we cannot say that one is better than
the other. In fact, up to the 100 seconds mark AR*±FS+Hyp+A appears to be
worse than AR*+Hyp+A. In our view, this is because, in many cases, average
ranking includes different variants of the same algorithm in a close succession.
This may lead to a loss of time spent on testing variants that do not result in
any gain. Nevertheless, the variant AR*±FS+Hyp+A is the most complete set
of algorithm alternatives and includes all best possible accuracies achievable per
dataset, hence its loss time curve eventually reaches 0.
    When compared with AR*, the other two portfolios allow to clearly achieve
better loss time curves and lower MIL, confirming that a portfolio that is richer
in algorithm versions has increased the chances of providing better recommenda-
tions. Our results also suggest, however, that it is desirable to provide methods
that allow to construct portfolios that include all important variants, but exclude
others, in order to avoid losing time testing many ”similar” alternatives.

4.4   Comparison between A3R-based AR and Auto-WEKA
For the purpose of evaluating the effectiveness of our proposed method in prac-
tice, we have decided to carry out similar experiments using Auto-WEKA [20,
34]. An empirical study was done by comparing the accuracy obtained by each
system for the same run time. The Auto-WEKA runs were made using cross-
validation with 10 folds: reference accuracy results from the average of all folds;
reference run time is the sum of training and testing time, measured using all
data to train and test the model.
    Regarding the Auto-WEKA experiment, a required parameter is time limit,
which sets a time budget for it to run. The actual run time is, however, somewhat
different from the one defined in the parameter, so it needs to be measured.
After running, Auto-WEKA outputs one or more recommendations for the model
configurations (that always include an algorithm and its hyperparameters and
may or may not include a feature selection procedure). For the purpose of this
empirical study, we used only the first Auto-WEKA recommendation for each
dataset. The train and test time spent needs to be added to the given budget. In
this study we have used four time budgets: 5, 15, 30 and 60 minutes. The steps
to compare both methods are detailed below. They refer to one dataset but the
same steps were repeated for all datasets.
 1. Auto-WEKA is run for a predefined time budget. Its actual run time (search
    time) is measured and the recommended configuration is recorded.
 2. The recommended configuration is run within a WEKA Experiment class,
    using cross-validation with 10 folds. The predictive accuracy for the recom-
    mended algorithm returned by Auto-WEKA is obtained.
 3. The Auto-WEKA model run time is measured from the tasks of training and
    testing the recommended configuration over all data.
 4. Auto-WEKA total run time is computed by adding the search time to the
    recommended model run time. The sum of the two times is used to retrieve
    the actual performance of our system.
    One way to compare the overall performances of the algorithm recommenda-
tion methods is to count the number of data sets on which a particular method is
the overall winner. Table 3 shows the aggregated results for the four Auto-WEKA
time budgets, where Win refers to the number of cases where AR*±FS+Hyp+A
has higher accuracy than Auto-WEKA. This is a simple comparison that does
not consider the statistical significance test of the differences. Still, it can be
observed that AR*±FS+Hyp+A has more wins for lower Auto-WEKA time
budgets.
            Table 3. Comparing AR*±FS+Hyp+A with Auto-WEKA

                    Budget Time(min) Win Loss Ties
                            5         35    1    1
                           15         31    5    1
                           30         29    7    1
                           60         25   11    1


5   Conclusion and Future Work
In this paper we have presented a study that exploits the average ranking method
AR* that gives preference to well-performing workflows which are also fast to
test. Our portfolios combine algorithm selection with feature selection and a
set of hyperparameter configurations. The portfolio that uses the most diverse
number of workflow configurations (AR*±FS+Hyp+A) achieved the best re-
sults. The second runner-up was AR*+Hyp+A with quite similar performance.
These results confirm that it is indeed important to incorporate hyperparam-
eter configurations into the portfolio, as it can lead to improved performance.
Although the addition of extra variants to the given portfolio could, in princi-
ple, slow down the process of identifying the best or the near best solution, the
usage of AR* mitigates this adverse effect. It opts for fast and good performing
workflows before the others.
    Additionally, we have compared A3R-based average ranking against Auto-
WEKA and showed that the proposed method competes quite well with the
latter, especially when smaller time budgets are used.

Future work

Our future plan is to devise a method of pruning portfolios of workflows with
the aim to avoid investing time in testing the variants of algorithms that are of
similar nature and have similar performance. Another alternative line that could
be followed could explore the approaches based on active testing (AT) and/or
surrogate models, whose aim is to select the most promising candidate to test.
   It would be desirable to use a much larger portfolio to guarantee that the
potentially best solutions are not really left out. We could collect more test
results with alternative feature selection methods and preprocessing methods,
as well as hyperparameter configurations, or even reuse a larger set of results
already available through OpenML [35].
    We could also extend the comparison with Auto-WEKA to higher time bud-
gets, as well as use other currently available methods to perform further com-
parisons (e.g., auto-sklearn [11]).
    We are also considering the investigation of approaches that allow to take
into account the costs (run time) of off-line tests used to gather the meta-data
that our proposed method exploits. Their cost could be set to some fraction
of the cost of on-line test (i.e. tests on a new dataset), but not really ignored
altogether.

Acknowledgements The authors wish to express our gratitude to the following
institutions which have provided funding to support this work:

 – Federal Government of Nigeria Tertiary Education Trust Fund under the
   TETFund 2012 AST$D Intervention for Kano University of Science and
   Technology, Wudil, Kano State, Nigeria for PhD Overseas Training;
 – Project NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated
   Multimodal Health Monitoring and Analytics/NORTE-01-0145-FEDER-000016,
   which is financed by the North Portugal Regional Operational Programme
   (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and
   through the European Regional Development Fund (ERDF).


References
 1. Abdulrahman, S.M., Brazdil, P.: Measures for Combining Accuracy and Time for
    Meta-learning. In: Meta-Learning and Algorithm Selection Workshop at ECAI
    2014. pp. 49–50 (2014)
 2. Abdulrahman, S.M., Brazdil, P., van Rijn, J.N., Vanschoren, J.: Speeding up algo-
    rithm selection using average ranking and active testing by introducing runtime.
    Machine learning Special Issue on Metalearning and Algorithm Selection (2017)
 3. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter
    optimization. In: Advances in Neural Information Processing Systems. pp. 2546–
    2554 (2011)
 4. Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M., Malitsky, Y., Fréchette, A.,
    Hoos, H., Hutter, F., Leyton-Brown, K., Tierney, K., et al.: Aslib: A benchmark
    library for algorithm selection. Artificial Intelligence 237, 41–58 (2016)
 5. Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning: Applications
    to data mining. Springer Science & Business Media (2008)
 6. Brazdil, P., Soares, C., Da Costa, J.P.: Ranking learning algorithms: Using ibl
    and meta-learning on accuracy and time results. Machine Learning 50(3), 251–277
    (2003)
 7. Chu, C., Hsu, A.L., Chou, K.H., Bandettini, P., Lin, C., Initiative, A.D.N.: Does
    feature selection improve classification accuracy? Impact of sample size and feature
    selection on classification using anatomical magnetic resonance images. Neuroim-
    age 60(1), 59–70 (2012)
 8. Cuingnet, R., Chupin, M., Benali, H., Colliot, O.: Spatial and anatomical regu-
    larization of SVM for brain image analysis. . In Advances in Neural Information
    Processing Systems pp. 460–468 (2010)
 9. Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. The
    Journal of Machine Learning Research 7, 1–30 (2006)
10. Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., Leyton-
    Brown, K.: Towards an empirical foundation for assessing bayesian optimization
    of hyperparameters. In: NIPS workshop on Bayesian Optimization in Theory and
    Practice. pp. 1–5 (2013)
11. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.:
    Efficient and robust automated machine learning. In: Advances in Neural Informa-
    tion Processing Systems. pp. 2962–2970 (2015)
12. Gansterer, W.N., Janecek, A.G., Neumayer, R.: Spam filtering based on latent
    semantic indexing. In: Survey of Text Mining II, pp. 165–183. Springer (2008)
13. Gomes, T.A., Prudncio, R.B., Soares, C., Rossi, A.L., Carvalho, A.: Combining
    meta-learning and search techniques to select parameters for support vector ma-
    chines. Neurocomputing 75(1), 3–13 (2012)
14. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. In: Jour-
    nal of machine learning research. pp. 1157–1182. JMLR. (2003)
15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
    WEKA Data Mining Software: An Update. ACM SIGKDD explorations newsletter
    11(1), 10–18 (2009)
16. Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis,
    The University of Waikato (1999)
17. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization
    for general algorithm configuration. International Conference on Learning and In-
    telligent Optimization pp. 507–523 (2011)
18. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial intelli-
    gence. 97(1), 273–324 (1997)
19. Kotthoff, L.: Llama: leveraging learning to automatically manage algorithms. arXiv
    preprint arXiv:1306.1031 (2013)
20. Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-weka
    2.0: Automatic model selection and hyperparameter optimization in weka. Journal
    of Machine Learning Research 17, 1–5 (2016)
21. Leite, R., Brazdil, P.: Active Testing Strategy to Predict the Best Classification
    Algorithm via Sampling and Metalearning. In: ECAI. pp. 309–314 (2010)
22. Leite, R., Brazdil, P., Vanschoren, J.: Selecting Classification Algorithms with Ac-
    tive Testing. In: Machine Learning and Data Mining in Pattern Recognition, pp.
    117–131. Springer (2012)
23. Lin, S.: Rank aggregation methods. WIREs Computational Statistics 2, 555–570
    (2010)
24. Lindauer, M., Hoos, H.H., Hutter, F., Schaub, T.: Autofolio: An automatically
    configured algorithm selector. Journal of Artificial Intelligence Research 53, 745–
    778 (2015)
25. Maclaurin, D., Duvenaud, D., Adams, R.P.: Gradient-based hyperparameter op-
    timization through reversible learning. In: Proceedings of the 32nd International
    Conference on Machine Learning (2015)
26. de Miranda, P.B., Prudêncio, R.B., de Carvalho, A.C.P., Soares, C.: Combining
    a multi-objective optimization approach with meta-learning for svm parameter
    selection. In: In Systems, Man, and Cybernetics (SMC). pp. 2909–2914. IEEE.
    (2012)
27. Molina, L.C., Belanche, L., Nebot, A.: Feature selection algorithms: a survey and
    experimental evaluation. In: Proceedings. 2002 IEEE International Conference on.
    pp. 306–313. IEEE (2003)
28. Pfahringer, B., Bensusan, H., Giraud-Carrier, C.: Tell me who can learn you and I
    can tell you who you are: Landmarking various learning algorithms. In: Proceedings
    of the 17th International Conference on Machine Learning. pp. 743–750 (2000)
29. Reif, M., Shafait, F., Dengel, A.: Meta-learning for evolutionary parameter opti-
    mization of classifiers. Machine learning 87(3), 357–380 (2012)
30. van Rijn, J.N., Abdulrahman, S.M., Brazdil, P., Vanschoren, J.: Fast Algorithm
    Selection using Learning Curves. In: Advances in Intelligent Data Analysis XIV.
    Springer (2015)
31. Saeys, Y., Inza, I., Larraaga, P.: A review of feature selection techniques in bioin-
    formatics. bioinformatics 23(19), 2507–2517 (2007)
32. Smith-Miles, K.A.: Cross-disciplinary Perspectives on Meta-Learning for Algo-
    rithm Selection. ACM Computing Surveys (CSUR) 41(1), 6:1–6:25 (2008)
33. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine
    learning algorithms. In: Advances in neural information processing systems. pp.
    2951–2959 (2012)
34. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: Combined
    selection and hyperparameter optimization of classification algorithms. In: In Pro-
    ceedings of the 19th ACM SIGKDD international conference on Knowledge dis-
    covery and data mining. pp. 847–855. ACM (2013)
35. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science
    in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2013)
Appendix


                 Table 4. Datasets used in the study.

                          OpenML
      Dataset                      #Instances #Attributes #Classes
                          Task id
      anneal                     2         898         39        6
      kr-vs-kp                   3       3 196         37        2
      letter                     6      20 000         17       26
      balance-scale             11         625          5        3
      mfeat-factors             12       2 000        217       10
      mfeat-fourier             14       2 000         77       10
      breast-w                  15         699         10        2
      mfeat-karhunen            16       2 000         65       10
      mfeat-morphological       18       2 000          7       10
      mfeat-pixel               20       2 000        241       10
      car                       21       1 728          7        4
      mfeat-zernike             22       2 000         48       10
      cmc                       23       1 473         10        3
      mushroom                  24       8 124         23        2
      nursery                   26      12 960          9        5
      optdigits                 28       5 620         65       10
      credit-a                  29         690         16        2
      page-blocks               30       5 473         11        5
      credit-g                  31       1 000         21        2
      pendigits                 32      10 992         17       10
      cylinder-bands            33         540         40        2
      segment                   36       2 310         20        7
      diabetes                  37         768          9        2
      soybean                   41         683         36       19
      spambase                  43       4 601         58        2
      splice                    45       3 190         62        3
      tic-tac-toe               49         958         10        2
      vehicle                   53         846         19        4
      waveform-5000             58       5 000         41        3
      electricity              219      45 312          9        2
      solar-flare             2068       1 066         13        3
      yeast                   2073       1 484          9       10
      satimage                2074       6 430         37        6
      abalone                 2075       4 177          9       29
      kropt                   2076      28 056          7       18
      baseball                2077       1 340         18        3
      eucalyptus              2079         736         20        5
                          Table 5. Algorithms used in the study.

 # Algorithm                                          # Algorithm
  1 A1DE                                              32 Kstar
  2 AdaBoostM1 DecisionStump                          33 LADTree
  3 AdaBoostM1 IBk                                    34 LMT
  4 AdaBoostM1 J48                                    35 LogitBoost DecisionStump
  5 AdaBoostM1 LMT                                    36 LWL DecisionStump
  6 AdaBoostM1 NaiveBayes                             37 LWL J48
  7 AdaBoostM1 OneR                                   38 LWL RandomTree
  8 AdaBoostM1 RandomTree                             39 MultiBoostAB DecisionStump
  9 AdaBoostM1 REPTree                                40 MultiBoostAB IBk
 10 Bagging DecisionStump                             41 MultiBoostAB J48
 11 Bagging IBk                                       42 MultiBoostAB JRip
 12 Bagging J48                                       43 MultiBoostAB NaiveBayes
 13 Bagging Jrip                                      44 MultiBoostAB OneR
 14 Bagging LMT                                       45 MultiBoostAB RandomTree
 15 Bagging LWL DecisionStump                         46 MultiBoostAB REPTree
 16 Bagging NaiveBayes                                47 MultilayerPerceptron
 17 Bagging OneR                                      48 NaiveBayes
 18 Bagging RandomTree                                49 NaiveBayesUpdateable
 19 Bagging REPTree                                   50 NBTree
 20 BayesNet                                          51 OneR
 21 ClassificationViaRegression M5P                   52 PART
 22 CVParameterSelection ZeroR                        53 RacedIncrementalLogitBoost D.Stump
 23 Dagging DecisionStump                             54 RandomCommittee RandomTree
 24 DecisionStump                                     55 RandomForest
 25 DTNB                                              56 RandomSubSpace REPTree
 26 END ND J48                                        57 RandomTree
 27 HoeffdingTree                                     58 REPTree
 28 IBK                                               59 SimpleLogistic
 29 IterativeClassifierOptimizer LogitBoost D.Stump   60 SMO PolyKernel
 30 J48                                               61 SMO RBFKernel
 31 JRip                                              62 ZeroR


Table 6. Alternative hyperparameter configurations. The hyperparameters that are
not mentioned in the table were used with their default values, with the only exception
of the number of training epochs (N) in MultilayerPerceptron, that was set to 100 for
all experiments instead of the default 500.

                     Parameter
Algorithm                                  Parameter description**
                     interval*
                                           Size of each bag, as a percentage of the training
                     P = [80, 100]
                                           set size.
RandomForest
                                           Number of attributes to randomly investigate,
                     K = [0, 250]
                                           where 0 = int(log2(#attributes)+1).
                     M = [1, 10]           Minimum number of instances per leaf .
                                           Confidence threshold for pruning (smaller values
                     C = [0.01, 0.25, 0.5]
J48                                        incur more pruning).
                     M = [2, 10, 20]       Minimum number of instances per leaf .
                     C = [0.1, 1, 10]      Complexity parameter.
SMO PolyKernel
                     E = [1, 2]            The exponent of the polynomial kernel.
SMO RBFKernel        C = [0.1, 1, 10]      Complexity parameter.
                     K = [1, 5, 20]        The number of neighbors to use.
IBK
                                           Whether to perform distance weighting
                     I = [No, Yes]
                                           by 1/distance.
                                           Number of hidden layers, where
                     H = [a, o]            ’a’ = (#attributes + #classes)/2 and
MultilayerPerceptron
                                           ’o’ = #classes.
                     D = [No, Yes]         Whether the learning rate decays as training progresses.

*Parameters in bold are the default.
**Adapted from WEKA documentation.