An Empirical Study of
    Hyperparameter Importance Across Datasets

                        Jan N. van Rijn and Frank Hutter

                          University of Freiburg, Germany
                        {vanrijn,fh}@cs.uni-freiburg.de


      Abstract. With the advent of automated machine learning, automated
      hyperparameter optimization methods are by now routinely used. How-
      ever, this progress is not yet matched by equal progress on automatic
      analyses that yield information beyond performance-optimizing hyper-
      parameter settings. Various post-hoc analysis techniques exist to analyze
      hyperparameter importance, but to the best of our knowledge, so far
      these have only been applied at a very small scale. To fill this gap, we
      conduct a large scale experiment to discover general trends across 100
      datasets. The results in case studies with random forests and Adaboost
      show that the same hyperparameters typically remain most important
      across datasets. Overall, these results, obtained fully automatically, pro-
      vide a quantitative basis to focus efforts in both manual algorithm design
      and in automated hyperparameter optimization.

      Keywords: Hyperparameter importance, empirical study


1    Introduction
The performance of most machine learning algorithms highly depends on their
hyperparameter settings. Various methods exist to automatically optimize hy-
perparameters, including random search [1], Bayesian optimization [13,16,22],
evolutionary optimization [18], meta-learning [21,23] and bandit-based meth-
ods [17]. All of these techniques require a search space that specifies which
hyperparameters to optimize and which ranges to consider for each of them.
Currently these search spaces are designed by experienced machine learners,
with decisions typically based on a combination of intuition and trial & error.
    Various post-hoc analysis techniques exist that, for a given dataset and al-
gorithm, determine what were the most important hyperparameters; examples
include Forward Selection [14], functional ANOVA [12], and Ablation Analy-
sis [2,7]. However, to the best of our knowledge, all of these techniques have only
been used for individual data sets. Here, to obtain results that can be deemed
more representative, we analyze hyperparameter importance across many differ-
ent datasets. Specifically, we employ the 100 datasets from the OpenML [19,25]
benchmark suite OpenML-100 [3] to determine the most important hyperpa-
rameters of random forests [4] and Adaboost [9]. In this preliminary work,
we focus on the hyperparameter importance analysis framework of functional
ANOVA [12], but in the future we would like to expand it to other methods.
2   Background: Functional ANOVA

The functional ANOVA framework for assessing hyperparameter importance of
algorithms is based on efficient computations of marginal performance. Various
hyperparameter configurations of an algorithm result in various performances.
Functional ANOVA determines per hyperparameter how it contributes to the
variance in performance.
    After introducing notation, we first briefly describe these and then summarize
how they can be plugged into the standard functional ANOVA framework to
yield the importance of all hyperparameters. For details, we refer the reader to
the original paper [12].

Notation. Following the notation of [12], algorithm A has n hyperparameters
with domains Θ1 , . . . , Θn and configuration space Θ = Θ1 × . . . × Θn . Let N
be the set of all hyperparameters of A. An instantiation of A is a vector θ =
hθ1 , . . . , θn i with θi ∈ Θi (this is also called a configuration of A). A partial
instantiation of A is a vector θ U = hθ1 , . . . , θn i with a subset U ⊆ N of the
hyperparameters fixed, and the values for other hyperparameters unspecified.
(Note that from this it follows that θ N = θ).

Efficient marginal predictions. The marginal performance âU (θ U ) is defined
as the average performance of all complete instantiations θ that agree with θ U
in the instantiations of hyperparameters U . We note that this marginal involves
a very large number of terms (even in the case of finite hyperparameter ranges,
it is exponential in the remaining number of hyperparameters N \ U ). However,
for a tree-based model for approximating the performance of configurations, the
average over these terms can be computed exactly by a procedure that is linear
in the number of leaves in the model [12].

Functional ANOVA. Functional ANOVA [10,11,15,24] decomposes a function
ŷ : Θ1 × · · · × Θn → R into additive components that only depend on subsets of
the hyperparameters N :                  X
                                ŷ(θ) =    fˆU (θ U )                        (1)
                                       U ⊆N

The components fˆU (θ U ) are defined as follows:
                          (
             ˆ              fˆ∅                         if U = ∅.
             fU (θ U ) =                P      ˆ                                 (2)
                            âU (θ U ) − W (U fW (θ W ) otherwise,

where the constant fˆ∅ is the mean value of the function over its domain. Our
main interest is the result of the unary function fˆ{j} (θ {j} ), which captures the
effect of varying hyperparameter j, averaging across all possible values of all
other hyperparameters. Additionally, the function fˆU (θ U ) for |U | > 1 captures
the interaction effects between all variables in U (excluding effects of subsets
W ( U ), but studying these higher-order effects is beyond the scope of this
preliminary work.
   Given the individual components, functional ANOVA decomposes the vari-
ance V of ŷ into the contributions by all subsets of hyperparameters VU :
                                                     Z
                    X                          1
              V=        VU , where VU =                fˆU (θ U )2 dθ U ,  (3)
                                            ||Θ U ||
                   U ⊂N


where ||Θ1U || is the probability density of the uniform distribution across Θ U .

Putting it all together. To apply functional ANOVA, we first collect perfor-
mance data hθ i , yi iK
                      k=1 that captures the performance yi (e.g., misclassification
rate or RMSE) of an algorithm A with hyperparameter settings θ i . We then fit
a random forest model to this data and use functional ANOVA to decompose
the variance of each of the forest’s trees ŷ into contributions due to each sub-
set of hyperparameters. Importantly, based on the fast prediction of marginal
performance available for tree-based models, this is an efficient operation re-
quiring only seconds in the experiments for this paper. Overall, based on the
performance data hθ i , yi iK
                            k=1 , functional ANOVA thus provides us with the rela-
tive variance contributions of each individual hyperparameter (with the relative
variance contributions of all subsets of hyperparameters summing to one).


3     Hyperparameter Importance Across Datasets

We report the results of running functional ANOVA on 100 datasets for two clas-
sifiers implemented in scikit-learn [5]: random forests [4] and Adaboost (using
decision trees as base-classifier) [9]. We use almost the same hyperparameters
and ranges as used by auto-sklearn [8], described in detail in Tables 1 and 2,
respectively. The only difference with the auto-sklearn search space is the maxi-
mal features hyperparameter of random forests, which is modelled as a fraction
of the number of available features (range [0.1, 0.9]).
     Both algorithms apply the following data preprocessing steps. Missing values
are imputed (categorical features with the mode; for numerical features, the im-
putation strategy was one of the hyperparameters), categorical hyperparameters
are one-hot-encoded, and constant features are removed.
     For each of the 100 datasets in the OpenML-100 benchmark suite [3], we
created performance data for these two algorithms by executing random config-
urations on a large compute cluster. We ensured that each dataset had at least
200 runs to make functional ANOVA’s model reliable enough for the small hy-
perparameter spaces considered here. (We note that for larger hyperparameter
spaces, more sophisticated data gathering strategies are likely required to accu-
rately model the performance of the best configurations.) The exact performance
data we used is available on OpenML1 .
1
    RF: https://www.openml.org/f/6969; Adaboost: https://www.openml.org/f/6970
                                                                                             Random Forest Case Study

                                                                             Table 1. Random Forest Hyperparameters.

 hyperparameter                    description                     values
 bootstrap                                                         {true, false}
                                   Whether to use bootstrap samples or full train set.
 split criterion                                                   {gini, entropy}
                                   Function to determine the quality of a possible split.
 max. features                                                     [0.1, 0.9]
                                   Fraction of random features sampled per node.
 min. samples leaf                                                 [1, 20]
                                   The minimal number of data points required to split an in-
                                   ternal node.
 min. samples split [2, 20]        The minimal number of data points required per leaf.
 imputation         {mean, median, Strategy for imputing numeric variables.
                    mode}


                         20                                                                                                                                                                 min. samples leaf
                                                                                                                                                          103                               max. features
                         2−2                                                                                                                                                                bootstrap
                                                                                                                                                                                            split criterion


                                                                                                                                     Number of Features
 Variance Contribution


                         2−4         ●


                                     ●


                         2−6                                                                                                 ●
                                                                                                                             ●


                                                                                                                             ●

                                                                                                          ●
                                                                                                          ●


                         2−8
                                     ●
                                     ●                                                                                       ●
                                     ●


                                                                                                                                                          102
                     2−10
                                                                                                          ●


                                                                                         ●


                         −12
                     2
                     2−14
                                                        ●


                     2−16            ●


                                     ●
                                                                                                                                                          101
                                     n


                                                    lit


                                                                         n


                                                                                      ap


                                                                                                          s


                                                                                                                         af
                                                                                                      re
                                tio


                                                                      rio


                                                                                                                        le
                                                   sp


                                                                                   str


                                                                                                   tu
                                ta


                                                                   ite


                                                                                                                    es
                                                                                 ot


                                                                                                  ea


                                                                                                                                                                                    104                         105
                                               es


                                                                                                                                                                103
                               pu


                                                                                                                   pl
                                                                 cr


                                                                              bo
                                              pl


                                                                                                 .f


                                                                                                                                                                      Number of Data Points
                           im


                                                                                                               am
                                          am


                                                             lit


                                                                                             ax
                                                            sp


                                                                                                              .s
                                                                                             m
                                         .s


                                                                                                          in
                                     in


                                                                                                          m
                                    m


 Fig. 1. Marginal contribution of random                                                                                                Fig. 2. Most important hyperparameter
 forest hyperparameters per dataset.                                                                                                    plotted against dataset dimensions.

                                                                                   CD

                                                                             1                        2                          3                  4             5          6


                                     min. samples leaf                                                                                                                           imputation
                                        max. features                                                                                                                            split criterion
                                            bootstrap                                                                                                                            min. samples split


           Fig. 3. Ranked hyperparameter importance, critical distance based on α = 0.05.


3.1                             Hyperparameter Importance of Random Forests

Table 1 and Figures 1–3 present the results of the random forest case study.
    Table 1 shows the precise hyperparameters and ranges we used. Figure 1
shows violinplots of the marginal contribution per hyperparameter per dataset.
The x-axis shows the hyperparameter under investigation, and each data-point
represents Vj /V for hyperparameter j. A high value implies that this hyper-
parameter accounted for a large fraction of variance, and therefore would ac-
count for high accuracy-loss if not set properly. For two datasets (‘tamilnadu-
electricity’ and ‘Mice Protein’) we could not perform a functional ANOVA anal-
ysis, because the random forest classifier performed very similarly across the var-
ious hyperparameter values (almost always a perfect accuracy) and functional
ANOVA’s internal model predicted zero variance.
    The results reveal that most of the variance could be attributed to a small
set of hyperparameters: the minimal samples per leaf and maximal number of
features for determining the split were the most important hyperparameters. It
is reasonable to assume that these hyperparameters have some regions that are
clearly sub-optimal. For example, at every split point, the random forest should
always have at least some features to choose from; if this number is too low
this clearly affects performance. This was also conjectured in earlier work, see
Figure 4.8 of [20].
    Figure 2 shows the most important hyperparameters per dataset, plotted
against the dataset dimensions (number of data points on the x-axis; number of
features on the y-axis). Each data point represents a dataset, the color indicates
which of the hyperparameters was most important, and the size indicates the
marginal contribution (formally, Vj /V).
    The results of Figure 2 add to the results presented in Figure 1: minimal sam-
ples per leaf and maximal features dominate the other hyperparameters. Only
in a few cases bootstrap (‘balance-scale’, ‘credit-a’, ‘kc1’, ‘Australian’, ‘profb’
and ‘climate-model-simulation-crashes’) or the split criterion (‘scene’) was most
important. The imputation strategy was never the most important hyperparam-
eter. This is slightly surprising, since the benchmark suite also contains datasets
that have many missing values.
    Figure 3 shows the result of a Nemenyi test over the average ranks of the
hyperparameters (for details, see [6]). A statistically significant difference was
measured for every pair of classifiers that are not connected by the horizontal
black line. The results support the observations made in Figure 1, giving statisti-
cal evidence that the minimal samples per leaf and maximal number of features
for determining the split are more important than the other hyperparameters.

3.2   Hyperparameter Importance of Adaboost
Table 2 and Figures 4–6 present the results for the Adaboost case study, struc-
tured exactly like the study for random forests.
    Figure 4 shows that similarly to the random forest, most of the variance could
be explained by a small set of hyperparameters, in this case the maximal depth
of the decision tree and, to a lesser degree, the learning rate. Figure 5 shows a
similar result. The maximal depth and learning rate were the most important hy-
perparameters for almost all datasets; there are only a few exceptions where the
boosting algorithm (‘madelon’ and ‘LED-display-domain-7digit’) or the number
of iterations (‘steel-plates-fault’) were the most important hyperparameters.
    The results of the Nemenyi test (Figure 6) support the notion that the afore-
mentioned hyperparameters indeed contributed most to the total variance. The
maximal depth hyperparameter contributed significantly more than the other
hyperparameters, followed by learning rate. Again, the imputation strategy did
not seem to matter.
                                                                                          Adaboost Case Study

                                                                           Table 2. Adaboost Hyperparameters.

 hyperparameter values               description
 algorithm      {SAMME, SAMME.R} Boosting algorithm to use.
 learning rate  [0.01, 2.0]          Learning rate shrinks the contribution of each classifier.
 max. depth     [1, 10]              The maximal depth of the decision trees
 iterations     [50, 500]            Number of estimators to build.
 imputation     {mean, median, mode} Strategy for imputing numeric variables.

                          20                                                                                                                                              max. depth
                                                                                                                                      103                                 learning rate
                          2−2                                                                                                                                             algorithm
                                                                                                                                                                          iterations


                                                                                                                 Number of Features
  Variance Contribution


                          2−4

                          2−6                                                                            ●


                          2−8
                                                                                                                                      102

                      2−10
                                                                       ●


                                                                       ●


                      2−12

                      2−14                                                                                                            101
                                          ●
                                          n


                                                      ns


                                                                   hm


                                                                                      te


                                                                                                     th
                                     tio


                                                                                     ra


                                                                                                                                            103                104                        105
                                                                                                    ep
                                                  tio


                                                                rit
                                     ta


                                                                                                                                                  Number of Data Points
                                                                                                .d
                                                                                 ng
                                                 ra


                                                              go
                                 pu


                                                ite


                                                                                               ax
                                                                                ni
                                                           al
                                im


                                                                            ar


                                                                                           m
                                                                            le


 Fig. 4. Marginal contribution of Ad-                                                                               Fig. 5. Most important hyperparameter
 aboost hyperparameters per dataset.                                                                                plotted against dataset dimensions.

                                                                           CD

                                                                   1                       2                 3                              4            5


                                               max. depth                                                                                                    imputation
                                              learning rate                                                                                                  iterations
                                                 algorithm


            Fig. 6. Ranked hyperparameter importance, critical distance based on α = 0.05.


    We note that the results presented in this paper do by no means imply that
it suffices to tune just the aforementioned hyperparameters. Contrarily, when
enough budget is available it is still advisable to tune all hyperparameters. How-
ever, the results in [12] indicated that focusing only on tuning the most impor-
tant hyperparameters yields improvements faster; thus, when the tuning budget
is small this might be advisable. This could be tested by a multistage process, in
which the tuning process initially focuses on the most important hyperparameter,
and gradually changes its behaviour towards also incorporating less important
hyperparameters. We leave it to future work to test under which circumstances
focusing on a subset of hyperparameters indeed yields better results.
4   Conclusions and Future Work
We conducted a large scale study towards hyperparameter importance using
functional ANOVA on OpenML datasets. Indeed, functional ANOVA and OpenML
complement each other quite well: the experimental results available on OpenML
can serve as an input for functional ANOVA (or any other model-based hy-
perparameter importance analysis technique) and can be used to assess which
hyperparameters are most important.
    To the best of our knowledge, this is the first large scale study over many
datasets assessing which hyperparameters are commonly important. We did this
for two popular tree-based methods, random forests and Adaboost, resulting in
quantifiable measures of their hyperparameters’ importance across datasets.
    In future work, we aim to assess hyperparameter importance in good parts of
the space (this is already supported by functional ANOVA by assessing improve-
ments over a baseline [12]), to also use other analysis techniques for hyperpa-
rameter importance across datasets, and to extend this research towards other
model types. In particular, deep neural networks are known to be very sensi-
tive to some of their hyperparameters, and quantifiable recommendations about
which hyperparameters to focus on would be very useful for the community.
    Finally, we aim to use the knowledge obtained from hyperparameter impor-
tance analysis to prune the hyperparameter search space (see also [26]). Specifi-
cally, experiments across datasets could point out that for new datasets, certain
hyperparameters or ranges are not worthwhile to explore; exploiting this could
potentially lead to large speed-ups for hyperparameter optimization.

Acknowledgements. This work has partly been supported by the European
Research Council (ERC) under the European Union’s Horizon 2020 research and
innovation programme under grant no. 716721.


References
 1. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal
    of Machine Learning Research 13(Feb), 281–305 (2012)
 2. Biedenkapp, A., Lindauer, M., Eggensperger, K., Fawcett, C., Hoos, H., Hutter,
    F.: Efficient parameter importance analysis via ablation with surrogates. In: Proc.
    of AAAI 2017. pp. 773–779 (2017)
 3. Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R.G., van
    Rijn, J.N., Vanschoren, J.: OpenML Benchmarking Suites and the OpenML100.
    ArXiv [stat.ML] 1708.03731v1, 6 pages (2017)
 4. Breiman, L.: Random Forests. Machine learning 45(1), 5–32 (2001)
 5. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Nic-
    ulae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J.,
    Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: expe-
    riences from the scikit-learn project. In: ECML PKDD Workshop: Languages for
    Data Mining and Machine Learning. pp. 108–122 (2013)
 6. Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. The
    Journal of Machine Learning Research 7, 1–30 (2006)
 7. Fawcett, C., Hoos, H.H.: Analysing differences between algorithm configurations
    through ablation. Journal of Heuristics 22(4), 431–458 (2016)
 8. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.:
    Efficient and robust automated machine learning. In: Advances in Neural Informa-
    tion Processing Systems 28, pp. 2962–2970. Curran Associates, Inc. (2015)
 9. Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning
    and an application to boosting. In: European conference on computational learning
    theory. pp. 23–37. Springer (1995)
10. Hooker, G.: Generalized functional anova diagnostics for high-dimensional func-
    tions of dependent variables. Journal of Computational and Graphical Statistics
    16(3), 709–732 (2007)
11. Huang, J.Z., et al.: Projection estimation in multiple regression with application
    to functional anova models. The annals of statistics 26(1), 242–272 (1998)
12. Hutter, F., Hoos, H., Leyton-Brown, K.: An efficient approach for assessing hyper-
    parameter importance. In: Proc. of ICML 2014. pp. 754–762 (2014)
13. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization
    for general algorithm configuration. In: International Conference on Learning and
    Intelligent Optimization. pp. 507–523. Springer (2011)
14. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Identifying key algorithm parame-
    ters and instance features using forward selection. In: International Conference
    on Learning and Intelligent Optimization. pp. 364–381. Springer (2013)
15. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive
    black-box functions. Journal of Global optimization 13(4), 455–492 (1998)
16. Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimiza-
    tion of machine learning hyperparameters on large datasets. In: Proc. of AISTATS
    2017 (2017)
17. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband:
    Bandit-Based Configuration Evaluation for Hyperparameter Optimization. In:
    Proc. of ICLR 2017 (2017)
18. Loshchilov, I., Hutter, F.: CMA-ES for hyperparameter optimization of deep neural
    networks. ArXiv [cs.NE] 1604.07269v1, 8 pages (2016)
19. van Rijn, J.N., Bischl, B., Torgo, L., Gao, B., Umaashankar, V., Fischer, S., Win-
    ter, P., Wiswedel, B., Berthold, M.R., Vanschoren, J.: OpenML: A Collaborative
    Science Platform. In: Proc. of ECML/PKDD 2013, pp. 645–649. Springer (2013)
20. van Rijn, J.N.: Massively Collaborative Machine Learning. Ph.D. thesis, Leiden
    University (2016)
21. van Rijn, J.N., Abdulrahman, S.M., Brazdil, P., Vanschoren, J.: Fast Algorithm
    Selection using Learning Curves. In: Advances in Intelligent Data Analysis XIV.
    pp. 298–309. Springer (2015)
22. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine
    learning algorithms. In: Advances in neural information processing systems 25. pp.
    2951–2959 (2012)
23. Soares, C., Brazdil, P.B., Kuba, P.: A meta-learning method to select the kernel
    width in support vector regression. Machine learning 54(3), 195–209 (2004)
24. Sobol, I.M.: Sensitivity estimates for nonlinear mathematical models. Mathemati-
    cal Modelling and Computational Experiments 1(4), 407–414 (1993)
25. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science
    in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014)
26. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Hyperparameter search space
    pruning–a new component for sequential model-based hyperparameter optimiza-
    tion. In: Proc. of ECML/PKDD 2015. pp. 104–119. Springer (2015)