An Empirical Study of Hyperparameter Importance Across Datasets Jan N. van Rijn and Frank Hutter University of Freiburg, Germany {vanrijn,fh}@cs.uni-freiburg.de Abstract. With the advent of automated machine learning, automated hyperparameter optimization methods are by now routinely used. How- ever, this progress is not yet matched by equal progress on automatic analyses that yield information beyond performance-optimizing hyper- parameter settings. Various post-hoc analysis techniques exist to analyze hyperparameter importance, but to the best of our knowledge, so far these have only been applied at a very small scale. To fill this gap, we conduct a large scale experiment to discover general trends across 100 datasets. The results in case studies with random forests and Adaboost show that the same hyperparameters typically remain most important across datasets. Overall, these results, obtained fully automatically, pro- vide a quantitative basis to focus efforts in both manual algorithm design and in automated hyperparameter optimization. Keywords: Hyperparameter importance, empirical study 1 Introduction The performance of most machine learning algorithms highly depends on their hyperparameter settings. Various methods exist to automatically optimize hy- perparameters, including random search [1], Bayesian optimization [13,16,22], evolutionary optimization [18], meta-learning [21,23] and bandit-based meth- ods [17]. All of these techniques require a search space that specifies which hyperparameters to optimize and which ranges to consider for each of them. Currently these search spaces are designed by experienced machine learners, with decisions typically based on a combination of intuition and trial & error. Various post-hoc analysis techniques exist that, for a given dataset and al- gorithm, determine what were the most important hyperparameters; examples include Forward Selection [14], functional ANOVA [12], and Ablation Analy- sis [2,7]. However, to the best of our knowledge, all of these techniques have only been used for individual data sets. Here, to obtain results that can be deemed more representative, we analyze hyperparameter importance across many differ- ent datasets. Specifically, we employ the 100 datasets from the OpenML [19,25] benchmark suite OpenML-100 [3] to determine the most important hyperpa- rameters of random forests [4] and Adaboost [9]. In this preliminary work, we focus on the hyperparameter importance analysis framework of functional ANOVA [12], but in the future we would like to expand it to other methods. 2 Background: Functional ANOVA The functional ANOVA framework for assessing hyperparameter importance of algorithms is based on efficient computations of marginal performance. Various hyperparameter configurations of an algorithm result in various performances. Functional ANOVA determines per hyperparameter how it contributes to the variance in performance. After introducing notation, we first briefly describe these and then summarize how they can be plugged into the standard functional ANOVA framework to yield the importance of all hyperparameters. For details, we refer the reader to the original paper [12]. Notation. Following the notation of [12], algorithm A has n hyperparameters with domains Θ1 , . . . , Θn and configuration space Θ = Θ1 × . . . × Θn . Let N be the set of all hyperparameters of A. An instantiation of A is a vector θ = hθ1 , . . . , θn i with θi ∈ Θi (this is also called a configuration of A). A partial instantiation of A is a vector θ U = hθ1 , . . . , θn i with a subset U ⊆ N of the hyperparameters fixed, and the values for other hyperparameters unspecified. (Note that from this it follows that θ N = θ). Efficient marginal predictions. The marginal performance âU (θ U ) is defined as the average performance of all complete instantiations θ that agree with θ U in the instantiations of hyperparameters U . We note that this marginal involves a very large number of terms (even in the case of finite hyperparameter ranges, it is exponential in the remaining number of hyperparameters N \ U ). However, for a tree-based model for approximating the performance of configurations, the average over these terms can be computed exactly by a procedure that is linear in the number of leaves in the model [12]. Functional ANOVA. Functional ANOVA [10,11,15,24] decomposes a function ŷ : Θ1 × · · · × Θn → R into additive components that only depend on subsets of the hyperparameters N : X ŷ(θ) = fˆU (θ U ) (1) U ⊆N The components fˆU (θ U ) are defined as follows: ( ˆ fˆ∅ if U = ∅. fU (θ U ) = P ˆ (2) âU (θ U ) − W (U fW (θ W ) otherwise, where the constant fˆ∅ is the mean value of the function over its domain. Our main interest is the result of the unary function fˆ{j} (θ {j} ), which captures the effect of varying hyperparameter j, averaging across all possible values of all other hyperparameters. Additionally, the function fˆU (θ U ) for |U | > 1 captures the interaction effects between all variables in U (excluding effects of subsets W ( U ), but studying these higher-order effects is beyond the scope of this preliminary work. Given the individual components, functional ANOVA decomposes the vari- ance V of ŷ into the contributions by all subsets of hyperparameters VU : Z X 1 V= VU , where VU = fˆU (θ U )2 dθ U , (3) ||Θ U || U ⊂N where ||Θ1U || is the probability density of the uniform distribution across Θ U . Putting it all together. To apply functional ANOVA, we first collect perfor- mance data hθ i , yi iK k=1 that captures the performance yi (e.g., misclassification rate or RMSE) of an algorithm A with hyperparameter settings θ i . We then fit a random forest model to this data and use functional ANOVA to decompose the variance of each of the forest’s trees ŷ into contributions due to each sub- set of hyperparameters. Importantly, based on the fast prediction of marginal performance available for tree-based models, this is an efficient operation re- quiring only seconds in the experiments for this paper. Overall, based on the performance data hθ i , yi iK k=1 , functional ANOVA thus provides us with the rela- tive variance contributions of each individual hyperparameter (with the relative variance contributions of all subsets of hyperparameters summing to one). 3 Hyperparameter Importance Across Datasets We report the results of running functional ANOVA on 100 datasets for two clas- sifiers implemented in scikit-learn [5]: random forests [4] and Adaboost (using decision trees as base-classifier) [9]. We use almost the same hyperparameters and ranges as used by auto-sklearn [8], described in detail in Tables 1 and 2, respectively. The only difference with the auto-sklearn search space is the maxi- mal features hyperparameter of random forests, which is modelled as a fraction of the number of available features (range [0.1, 0.9]). Both algorithms apply the following data preprocessing steps. Missing values are imputed (categorical features with the mode; for numerical features, the im- putation strategy was one of the hyperparameters), categorical hyperparameters are one-hot-encoded, and constant features are removed. For each of the 100 datasets in the OpenML-100 benchmark suite [3], we created performance data for these two algorithms by executing random config- urations on a large compute cluster. We ensured that each dataset had at least 200 runs to make functional ANOVA’s model reliable enough for the small hy- perparameter spaces considered here. (We note that for larger hyperparameter spaces, more sophisticated data gathering strategies are likely required to accu- rately model the performance of the best configurations.) The exact performance data we used is available on OpenML1 . 1 RF: https://www.openml.org/f/6969; Adaboost: https://www.openml.org/f/6970 Random Forest Case Study Table 1. Random Forest Hyperparameters. hyperparameter description values bootstrap {true, false} Whether to use bootstrap samples or full train set. split criterion {gini, entropy} Function to determine the quality of a possible split. max. features [0.1, 0.9] Fraction of random features sampled per node. min. samples leaf [1, 20] The minimal number of data points required to split an in- ternal node. min. samples split [2, 20] The minimal number of data points required per leaf. imputation {mean, median, Strategy for imputing numeric variables. mode} 20 min. samples leaf 103 max. features 2−2 bootstrap split criterion Number of Features Variance Contribution 2−4 ● ● 2−6 ● ● ● ● ● 2−8 ● ● ● ● 102 2−10 ● ● −12 2 2−14 ● 2−16 ● ● 101 n lit n ap s af re tio rio le sp str tu ta ite es ot ea 104 105 es 103 pu pl cr bo pl .f Number of Data Points im am am lit ax sp .s m .s in in m m Fig. 1. Marginal contribution of random Fig. 2. Most important hyperparameter forest hyperparameters per dataset. plotted against dataset dimensions. CD 1 2 3 4 5 6 min. samples leaf imputation max. features split criterion bootstrap min. samples split Fig. 3. Ranked hyperparameter importance, critical distance based on α = 0.05. 3.1 Hyperparameter Importance of Random Forests Table 1 and Figures 1–3 present the results of the random forest case study. Table 1 shows the precise hyperparameters and ranges we used. Figure 1 shows violinplots of the marginal contribution per hyperparameter per dataset. The x-axis shows the hyperparameter under investigation, and each data-point represents Vj /V for hyperparameter j. A high value implies that this hyper- parameter accounted for a large fraction of variance, and therefore would ac- count for high accuracy-loss if not set properly. For two datasets (‘tamilnadu- electricity’ and ‘Mice Protein’) we could not perform a functional ANOVA anal- ysis, because the random forest classifier performed very similarly across the var- ious hyperparameter values (almost always a perfect accuracy) and functional ANOVA’s internal model predicted zero variance. The results reveal that most of the variance could be attributed to a small set of hyperparameters: the minimal samples per leaf and maximal number of features for determining the split were the most important hyperparameters. It is reasonable to assume that these hyperparameters have some regions that are clearly sub-optimal. For example, at every split point, the random forest should always have at least some features to choose from; if this number is too low this clearly affects performance. This was also conjectured in earlier work, see Figure 4.8 of [20]. Figure 2 shows the most important hyperparameters per dataset, plotted against the dataset dimensions (number of data points on the x-axis; number of features on the y-axis). Each data point represents a dataset, the color indicates which of the hyperparameters was most important, and the size indicates the marginal contribution (formally, Vj /V). The results of Figure 2 add to the results presented in Figure 1: minimal sam- ples per leaf and maximal features dominate the other hyperparameters. Only in a few cases bootstrap (‘balance-scale’, ‘credit-a’, ‘kc1’, ‘Australian’, ‘profb’ and ‘climate-model-simulation-crashes’) or the split criterion (‘scene’) was most important. The imputation strategy was never the most important hyperparam- eter. This is slightly surprising, since the benchmark suite also contains datasets that have many missing values. Figure 3 shows the result of a Nemenyi test over the average ranks of the hyperparameters (for details, see [6]). A statistically significant difference was measured for every pair of classifiers that are not connected by the horizontal black line. The results support the observations made in Figure 1, giving statisti- cal evidence that the minimal samples per leaf and maximal number of features for determining the split are more important than the other hyperparameters. 3.2 Hyperparameter Importance of Adaboost Table 2 and Figures 4–6 present the results for the Adaboost case study, struc- tured exactly like the study for random forests. Figure 4 shows that similarly to the random forest, most of the variance could be explained by a small set of hyperparameters, in this case the maximal depth of the decision tree and, to a lesser degree, the learning rate. Figure 5 shows a similar result. The maximal depth and learning rate were the most important hy- perparameters for almost all datasets; there are only a few exceptions where the boosting algorithm (‘madelon’ and ‘LED-display-domain-7digit’) or the number of iterations (‘steel-plates-fault’) were the most important hyperparameters. The results of the Nemenyi test (Figure 6) support the notion that the afore- mentioned hyperparameters indeed contributed most to the total variance. The maximal depth hyperparameter contributed significantly more than the other hyperparameters, followed by learning rate. Again, the imputation strategy did not seem to matter. Adaboost Case Study Table 2. Adaboost Hyperparameters. hyperparameter values description algorithm {SAMME, SAMME.R} Boosting algorithm to use. learning rate [0.01, 2.0] Learning rate shrinks the contribution of each classifier. max. depth [1, 10] The maximal depth of the decision trees iterations [50, 500] Number of estimators to build. imputation {mean, median, mode} Strategy for imputing numeric variables. 20 max. depth 103 learning rate 2−2 algorithm iterations Number of Features Variance Contribution 2−4 2−6 ● 2−8 102 2−10 ● ● 2−12 2−14 101 ● n ns hm te th tio ra 103 104 105 ep tio rit ta Number of Data Points .d ng ra go pu ite ax ni al im ar m le Fig. 4. Marginal contribution of Ad- Fig. 5. Most important hyperparameter aboost hyperparameters per dataset. plotted against dataset dimensions. CD 1 2 3 4 5 max. depth imputation learning rate iterations algorithm Fig. 6. Ranked hyperparameter importance, critical distance based on α = 0.05. We note that the results presented in this paper do by no means imply that it suffices to tune just the aforementioned hyperparameters. Contrarily, when enough budget is available it is still advisable to tune all hyperparameters. How- ever, the results in [12] indicated that focusing only on tuning the most impor- tant hyperparameters yields improvements faster; thus, when the tuning budget is small this might be advisable. This could be tested by a multistage process, in which the tuning process initially focuses on the most important hyperparameter, and gradually changes its behaviour towards also incorporating less important hyperparameters. We leave it to future work to test under which circumstances focusing on a subset of hyperparameters indeed yields better results. 4 Conclusions and Future Work We conducted a large scale study towards hyperparameter importance using functional ANOVA on OpenML datasets. Indeed, functional ANOVA and OpenML complement each other quite well: the experimental results available on OpenML can serve as an input for functional ANOVA (or any other model-based hy- perparameter importance analysis technique) and can be used to assess which hyperparameters are most important. To the best of our knowledge, this is the first large scale study over many datasets assessing which hyperparameters are commonly important. We did this for two popular tree-based methods, random forests and Adaboost, resulting in quantifiable measures of their hyperparameters’ importance across datasets. In future work, we aim to assess hyperparameter importance in good parts of the space (this is already supported by functional ANOVA by assessing improve- ments over a baseline [12]), to also use other analysis techniques for hyperpa- rameter importance across datasets, and to extend this research towards other model types. In particular, deep neural networks are known to be very sensi- tive to some of their hyperparameters, and quantifiable recommendations about which hyperparameters to focus on would be very useful for the community. Finally, we aim to use the knowledge obtained from hyperparameter impor- tance analysis to prune the hyperparameter search space (see also [26]). Specifi- cally, experiments across datasets could point out that for new datasets, certain hyperparameters or ranges are not worthwhile to explore; exploiting this could potentially lead to large speed-ups for hyperparameter optimization. Acknowledgements. This work has partly been supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant no. 716721. References 1. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. Journal of Machine Learning Research 13(Feb), 281–305 (2012) 2. Biedenkapp, A., Lindauer, M., Eggensperger, K., Fawcett, C., Hoos, H., Hutter, F.: Efficient parameter importance analysis via ablation with surrogates. In: Proc. of AAAI 2017. pp. 773–779 (2017) 3. Bischl, B., Casalicchio, G., Feurer, M., Hutter, F., Lang, M., Mantovani, R.G., van Rijn, J.N., Vanschoren, J.: OpenML Benchmarking Suites and the OpenML100. ArXiv [stat.ML] 1708.03731v1, 6 pages (2017) 4. Breiman, L.: Random Forests. Machine learning 45(1), 5–32 (2001) 5. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Nic- ulae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., Varoquaux, G.: API design for machine learning software: expe- riences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning. pp. 108–122 (2013) 6. Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. The Journal of Machine Learning Research 7, 1–30 (2006) 7. Fawcett, C., Hoos, H.H.: Analysing differences between algorithm configurations through ablation. Journal of Heuristics 22(4), 431–458 (2016) 8. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Informa- tion Processing Systems 28, pp. 2962–2970. Curran Associates, Inc. (2015) 9. Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: European conference on computational learning theory. pp. 23–37. Springer (1995) 10. Hooker, G.: Generalized functional anova diagnostics for high-dimensional func- tions of dependent variables. Journal of Computational and Graphical Statistics 16(3), 709–732 (2007) 11. Huang, J.Z., et al.: Projection estimation in multiple regression with application to functional anova models. The annals of statistics 26(1), 242–272 (1998) 12. Hutter, F., Hoos, H., Leyton-Brown, K.: An efficient approach for assessing hyper- parameter importance. In: Proc. of ICML 2014. pp. 754–762 (2014) 13. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization. pp. 507–523. Springer (2011) 14. Hutter, F., Hoos, H.H., Leyton-Brown, K.: Identifying key algorithm parame- ters and instance features using forward selection. In: International Conference on Learning and Intelligent Optimization. pp. 364–381. Springer (2013) 15. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. Journal of Global optimization 13(4), 455–492 (1998) 16. Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimiza- tion of machine learning hyperparameters on large datasets. In: Proc. of AISTATS 2017 (2017) 17. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization. In: Proc. of ICLR 2017 (2017) 18. Loshchilov, I., Hutter, F.: CMA-ES for hyperparameter optimization of deep neural networks. ArXiv [cs.NE] 1604.07269v1, 8 pages (2016) 19. van Rijn, J.N., Bischl, B., Torgo, L., Gao, B., Umaashankar, V., Fischer, S., Win- ter, P., Wiswedel, B., Berthold, M.R., Vanschoren, J.: OpenML: A Collaborative Science Platform. In: Proc. of ECML/PKDD 2013, pp. 645–649. Springer (2013) 20. van Rijn, J.N.: Massively Collaborative Machine Learning. Ph.D. thesis, Leiden University (2016) 21. van Rijn, J.N., Abdulrahman, S.M., Brazdil, P., Vanschoren, J.: Fast Algorithm Selection using Learning Curves. In: Advances in Intelligent Data Analysis XIV. pp. 298–309. Springer (2015) 22. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems 25. pp. 2951–2959 (2012) 23. Soares, C., Brazdil, P.B., Kuba, P.: A meta-learning method to select the kernel width in support vector regression. Machine learning 54(3), 195–209 (2004) 24. Sobol, I.M.: Sensitivity estimates for nonlinear mathematical models. Mathemati- cal Modelling and Computational Experiments 1(4), 407–414 (1993) 25. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15(2), 49–60 (2014) 26. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Hyperparameter search space pruning–a new component for sequential model-based hyperparameter optimiza- tion. In: Proc. of ECML/PKDD 2015. pp. 104–119. Springer (2015)