Surrogate Benchmarks for Hyperparameter Optimization Katharina Eggensperger1 and Frank Hutter1 and Holger H. Hoos2 and Kevin Leyton-Brown2 Abstract. Since hyperparameter optimization is crucial for achiev- However, a substantial problem remains: performing a hyperpa- ing peak performance with many machine learning algorithms, an rameter optimization experiment requires running the underlying active research community has formed around this problem in the machine learning algorithm, often at least hundreds of times. This is last few years. The evaluation of new hyperparameter optimization infeasible in many cases. The first (mundane, but often significant) techniques against the state of the art requires a set of benchmarks. obstacle is to get someone else’s research code working on one’s own Because such evaluations can be very expensive, early experiments system—including resolving dependencies and acquiring required are often performed using synthetic test functions rather than using software licenses—and to acquire the appropriate input data. Fur- real-world hyperparameter optimization problems. However, there thermore, some code requires specialized hardware; most notably, can be a wide gap between the two kinds of problems. In this work, general-purpose graphics processing units (GPGPUs) have become a we introduce another option: cheap-to-evaluate surrogates of real standard requirement for the effective training of modern deep learn- hyperparameter optimization benchmarks that share the same hyper- ing architectures [20, 21]. Finally, the computational expense of hyper- parameter spaces and feature similar response surfaces. Specifically, parameter optimization can be prohibitive for research groups lacking we train regression models on data describing a machine learning access to large compute clusters. These problems represent a consid- algorithm’s performance under a wide range of hyperparameter con- erable barrier to the evaluation of new hyperparameter optimization figurations, and then cheaply evaluate hyperparameter optimization algorithms on the most challenging and interesting hyperparameter methods using the model’s performance predictions in lieu of the real optimization benchmarks, such as deep belief networks [2], convo- algorithm. We evaluate the effectiveness for using a wide range of lutional neural networks [29, 5], and combined model selection and regression techniques to build these surrogate benchmarks, both in hyperparameter optimization in machine learning frameworks [30]. terms of how well they predict the performance of new configurations Given this high overhead for studying complex hyperparameter and of how much they affect the overall performance of hyperparame- optimization benchmarks, most researchers have drawn on simple, ter optimizers. Overall, we found that surrogate benchmarks based on synthetic test functions from the global continuous optimization com- random forests performed best: for benchmarks with few hyperparam- munity [12]. While these are simple to use, they are often poorly eters they yielded almost perfect surrogates, and for benchmarks with representative of the hyperparameter optimization problem: in con- more complex hyperparameter spaces they still yielded surrogates trast to the response surfaces of actual such problems, these synthetic that were qualitatively similar to the real benchmarks they model. test functions are smooth and often have unrealistic shapes. Further- more, they only involve real-valued parameters and hence do not incorporate the categorical and conditional parameters typical of ac- 1 Introduction tual hyperparameter optimization benchmarks. The performance of many machine learning methods depends criti- In the special case of small, finite hyperparameter spaces, a much cally on hyperparameter settings and thus on the method used to set better alternative is simply to record the performance of every hyper- such hyperparameters. Recently, sequential model-based Bayesian op- parameter configuration, thereby speeding future evaluations via a timization methods, such as SMAC[16], TPE[2], and Spearmint[29] table lookup. The result is a perfect surrogate of an algorithm’s true have been shown to outperform more traditional methods for this prob- performance that takes time O(1) to compute (using a hash) and that lem (such as grid search and random search [3]) and to rival—and in can be used in place of actually running the algorithm and evaluating some cases surpass—human domain experts in finding good hyperpa- its performance. This table-based surrogate can trivially be transported rameter settings [29, 30, 5]. One obstacle to further progress in this to any new system, without the complicating factors involved in run- nascent field is a paucity of reproducible experiments and empirical ning the original algorithm (setup, special hardware requirements, studies. Until recently, a study introducing a new hyperparameter licensing, computational cost, etc.). In fact, several researchers have optimizer would typically also introduce a new set of hyperparameter already applied this approach to simplifying their experiments: for optimization benchmarks, on which the optimizer would be demon- example, Bardenet et al. [1] saved the performances of a parameter strated to achieve state-of-the-art performance (as compared to, e.g., grid with 108 points of Adaboost on 29 datasets, and Snoek et al. [29] human domain experts). The introduction of the hyperparameter opti- saved the performance of parameter grids with 1400 and 288 points mization library (HPOlib [8]), which offers a unified interface to dif- for a structured SVM [31] and an online LDA [13], respectively. The ferent optimizers and benchmarks, has made it easier to reuse previous latter two benchmarks are part of HPOlib and are, in fact, HPOlib’s benchmarks and to systematically compare different approaches [4]. most frequently used benchmarks, due to their simplicity of setup and low computational cost. 1 University of Freiburg, email:{eggenspk,fh}@cs.uni-freiburg.de Of course, the drawback of this table lookup idea is that it is limited 2 University of British Columbia, email: {hoos,kevinlb}@cs.ubc.ca to small, finite hyperparameter spaces. Here, we generalize the idea benchmark for comparing hyperparameter optimization procedures. of machine learning algorithm surrogates to arbitrary, potentially high-dimensional hyperparameter spaces (including, e.g., real-valued, 2 Background: Hyperparameter Optimization categorical, and conditional hyperparameters). As in the table-lookup strategy, we first evaluate many hyperparameter configurations during The construction of machine learning models typically gives rise to an expensive offline phase. We then use the resulting performance two optimization problems. The first is internal optimization, such data to train a regression model to approximate future evaluations as selecting a neural network’s likelihood-maximizing weights; the via model predictions. As before, we obtain a surrogate of algorithm second is tuning the method’s hyperparameters, such as setting a performance that is cheap to evaluate and trivially portable. However, neural network’s regularization parameters or number of neurons. model-based surrogates offer only approximate representations of The former problem is closely coupled with the machine learning performance. Thus, a key component of our work presented in the algorithm at hand and is very well studied; here, we consider the following is an investigation of the quality of these approximations. latter. Let λ1 , . . . , λn denote the hyperparameters of a given ma- We are not the first to propose the use of learned surrogate models chine learning algorithm, and let Λ1 , . . . , Λn denote their respective that stand in for computationally complex functions. In the field of domains. The algorithm’s hyperparameter space is then defined as metalearning [6], regression models have been extensively used to Λ = Λ1 × · · · × Λn . When trained with hyperparameters λ ∈ Λ predict the performance of algorithms across various datasets based on data Dtrain , the algorithm’s loss (e.g., misclassification rate) on on dataset features [11, 26]. The statistics literature on the design data Dvalid is L(λ, Dtrain , Dvalid ). Using k-fold cross-validation, the and analysis of computer experiments (DACE) [27, 28] uses similar optimization problem is then to minimize: surrogate models to guide a sequential experimental design strategy k aiming to achieve either an overall strong model fit or to identify the 1X (i) (i) f (λ) = L(λ, Dtrain , Dvalid ). (1) minimum of a function. Similarly, the SUrrogate MOdeling (SUMO) k i=1 Matlab toolkit[10] provides an environment for building regression A hyperparameter λn can have one of several types, such as contin- models to describe the outputs of expensive computer simulations uous, integer-valued or categorical. For example, the learning rate for based on active learning. Such an approach for finding the minimum a neural network is continuous; the random seed given to initialize an of a blackbox function also underlies the sequential model-based algorithm is integer-valued; and the choice between various prepro- Bayesian optimization framework [7, 16] (SMBO, the framework cessing methods is categorical. Furthermore, there can be conditional underlying all hyperparameter optimizers we study here). While all hyperparameters, which are only active if another hyperparameter of these lines of work incrementally construct surrogate models of a takes a certain value; for example, the hyperparameter “number of function in order to inform an active learning criterion that determines principal components” only needs to be instantiated when the hyper- new inputs to evaluate, our work differs in its goals: We train surro- parameter “preprocessing method” is PCA. gates on a set of data gathered offline (by some arbitrary process—in Evaluating f (λ) for a given λ ∈ Λ is computationally costly, and our case the combination of many complete runs of several different so many techniques have been developed to find good configurations SMBO methods plus random search) and use the resulting surro- λ with few function evaluations. The methods most commonly used gates as stand-in models for the entire hyperparameter optimization in practice are manual search and grid search, but recently, it has benchmark. been shown that even simple random search can yield much better The surrogate benchmarks resulting from our work can be used in results [3]. The state of the art in practical optimization of hyperpa- several different ways. Firstly, like synthetic test functions and table rameters is defined by Bayesian optimization methods [16, 29, 2], lookups, they can be used for extensive debugging and unit testing. which have been successfully applied to problems ranging from deep Since the large computational expense of running hyperparameter neural networks to combined model selection and hyperparameter optimizers is typically dominated by the cost of evaluating algorithm optimization [2, 29, 30, 19, 5]. performance under different selected hyperparameters, our bench- Bayesian optimization methods use a probabilistic model M to marks can also substantially reduce the time required for running model the relationship between a hyperparameter configuration Λ and a hyperparameter optimizer, facilitating whitebox tests of an opti- its performance f (λ). They fit this model using previously gathered mizer using exactly the hyperparameter space of the machine learning data and then use it to select a next point λnew to evaluate, trading off algorithm whose performance is modelled by the surrogate. This func- exploitation and exploration in order to find the minimum of f . They tionality is gained even if the surrogate model only fits algorithm then evaluate f (λnew ), update M with the new data (λnew , f (λnew )) performance quite poorly (e.g., due to a lack of sufficient training and iterate. Throughout this paper, we will use the following three data). Finally, a surrogate benchmark whose model fits algorithm per- instantiations of Bayesian optimization: formance very well can also facilitate the evaluation of new features S PEARMINT [29] is a prototypical Bayesian optimization method that inside the hyperparameter optimizer, or even the (meta-)optimization models pM (f | λ) with Gaussian process (GP) models. It supports of a hyperparameter optimizer’s own hyperparameters (which can be continuous and discrete parameters (by rounding), but no conditional useful even without the use of surrogates, but is typically extremely parameters. expensive [17]). Sequential Model-based Algorithm Configuration (SMAC) [16] The rest of this paper is laid out as follows. We first provide some models pM (f | λ) with random forests. When performing cross background on hyperparameter optimization (Section 2). Then, we dis- validation, SMAC only evaluates as many folds as necessary to show cuss our methodology for building surrogate benchmarks (Section 3) that a configuration is worse than the best one seen so far (or to using several types of machine learning models. Next, we evaluate the replace it). SMAC can handle continuous, categorical, and conditional performance of these surrogates in practice (Section 4). We demon- parameters. strate that random forest models tend to fit the data better than a broad Tree Parzen Estimator (TPE) [2] models pM (f | λ) indirectly. It range of competing models, both in terms of raw predictive model models p(f < f ∗ ), p(λ | f < f ∗ ), and p(λ | f ≥ f ∗ ), where performance and in terms of the usefulness of the resulting surrogate f ∗ is defined as a fixed quantile of the function values observed so far, and the latter two probabilities are defined by tree-structured Table 1. Overview of evaluated regression algorithms. When we used ran- Parzen density estimators. TPE can handle continuous, categorical, dom search to optimize hyperparameters, we considered 100 samples over the and conditional parameters. stated hyperparameters (their names refer to the SCIKIT- LEARN implementa- An empirical evaluation on the three methods on the HPOlib hy- tion [25]); the model was trained on 50% of the data, and the best configuration perparameter optimization benchmarks showed that S PEARMINT per- was chosen based on the performance on the other 50% and then trained on all formed best on benchmarks with few continuous parameters and data. SMAC performed best on benchmarks with many, categorical, and/or conditional parameters, closely followed by TPE. SMAC also per- Model Hyperparameter optimization Impl. formed best on benchmarks that relied on cross-validation [8]. Random Forest None [25] Gradient Boosting None [25] Extra Trees None [25] Gaussian Process MCMC sampling over hyperparameters [29] 3 Methodology SVR Random search for C and gamma [25] NuSVR Random search for C, gamma and nu [25] We now discuss our approach, including the algorithm performance Bayesian Neural Network None [24] data we used, how we preprocessed the data, the types of regression k-nearest-neighbours Random search for n neighbors [25] Linear Regression None [25] models we evaluated, and how we used them to construct surrogate Least Angle Regression None [25] Ridge Regression None [25] benchmarks. 3.1 Data collection 1. We extracted all available configuration/performance pairs from In principle, we could construct surrogate benchmarks using algorithm the runs. For benchmarks that used cross-validation, we encoded performance data gathered by any means. For example, we could use the cross-validation fold of each run as an additional categorical existing data from a manual exploration of the hyperparameter space, parameter (for benchmarks without cross validation, that parameter or from an automated approach, such as grid search, random search or was set to a constant). one of the more sophisticated hyperparameter optimization methods 2. We removed entries with invalid results caused by algorithm discussed in Section 2. crashes. Since some regression models used in preliminary experi- It is more important for surrogate benchmarks to exhibit strong ments could not handle duplicated configurations, we also deleted predictive quality in some parts of the hyperparameter space than in these, keeping the first occurrence. others. Specifically, our ultimate aim is to ensure that hyperparameter 3. For data from benchmarks featuring conditional parameters, we optimizers perform similarly on the surrogate benchmark as on the replaced the values of inactive conditional parameters with a default real benchmark. Since most optimizers spend most of their time in value. high-performance regions of the hyperparameter space, and since 4. To code categorical parameters, we used a one-hot (aka 1-in-k) relative differences between the performance of hyperparameter con- encoding, which replaces any single categorical parameter λ with figurations in such high-performance regions tend to impact which domain Λ = {k1 , . . . kn } by n binary parameters, only the i-th of hyperparameter configuration will ultimately be returned, accuracy which is true for data points where λ is set to ki . in this part of the space is more important than in regions of poor performance. The training data should therefore densely sample high- performance regions. We thus advocate collecting performance data primarily via runs of existing hyperparameter optimization proce- 3.3 Choice of Regression Models dures. As an additional advantage of this strategy, we can obtain this costly performance data as a by-product of executing hyperparameter optimization procedures on the original benchmark. We considered a broad range of commonly used regression algorithms Of course, it is also important to accurately identify poorly per- as candidates for our surrogate benchmarks. To keep the results com- forming parts of the space: if we only trained on performance data parable, all models were trained on data encoded as detailed in the for the very best hyperparameter settings, no machine learning model previous section. If necessary for the algorithm, we also normalized could be expected to infer that performance in the remaining parts the data to have zero mean and unit variance (by subtracting the mean of the space is poor. This would typically lead to underpredictions and dividing by the standard deviation). If not stated otherwise for a of performance in poor parts of the space. We thus also included model, we used the default configuration of its implementation. performance data gathered by a random search. (An alternative is grid Table 1 details the regression models and implementations we search, which can also cover the entire space. We did not adopt this used. We evaluated three different tree-based models, because SMAC approach because it cannot deal effectively with large hyperparameter uses a random forest (RF), and because RFs have been shown to spaces.) To gather the data for each surrogate benchmark in this paper, yield high-quality predictions of algorithm performance data [18]. we therefore executed r = 10 runs of each of the three Bayesian As a specialist for low-dimensional hyperparameter spaces, we used optimization methods described in Section 2 (each time with a dif- S PEARMINT’s Gaussian process (GP) implementation, which per- ferent seed), as well as random search, with each run gathering the forms MCMC to marginalize over hyperparameters. Since SMAC per- performance of a fixed number of configurations. forms particularly well on high-dimensional hyperparameter spaces and S PEARMINT on low-dimensional continuous problems [8], we expected their respective models to mirror that pattern. The remaining 3.2 Data preprocessing prominent model types we experimented with comprised k-nearest- For each benchmark we studied for this paper, after running the neighbours (kNN), linear regression, least angle regression, ridge hyperparameter optimizers and random search, we preprocessed the regression, SVM methods (all as implemented by scikit-learn [25]), data as follows: and Bayesian neural networks (BNN) [24]. includes a 5-fold cross-validation. That means for each configuration Table 2. Properties of our data sets. “Input dim.” is the number of features that the optimizers TPE, S PEARMINT and random search evaluated of the training data; it is greater than the number of hyperparameters because there were 5 data points that only differ in which fold they corre- categorical hyperparameters and the crossvalidation fold are one-hot-encoded. sponded to. Since SMAC saves time by not evaluating all folds for For each benchmark, before preprocessing the number of data points was configurations that appear worse than the optimum, it only evaluated 10 × 4× (#evals. per run). a subset of folds for most of the configurations. The evaluation of hyperparameter Input #evals. #data #λ cond. cat. / cont. dim. per run a single cross-validation fold required roughly 1 minute on a single Branin 2 - -/2 3 200 7402 core of an Intel Xeon E5-2650 v2 CPU. Log. Reg. 5CV 4 - -/4 9 500 18521 HP - NNET convex 14 4 7/7 25 200 7750 The high-dimensional benchmarks comprised a simple and a deep HP - DBNET mrbi 36 27 19 / 17 82 200 7466 neural network, HP - NNET and HP - DBNET (both taken from [2]) to classify the MRBI and convex datasets, respectively [22]. Their di- mensionalities are 14 and 36, respectively, and many categorical hy- 3.4 Construction and Use of Surrogate Benchmarks perparameters further increase the input dimension to the regression To construct surrogates for a hyperparameter optimization benchmark model. Evaluating a single HP - NNET configuration required roughly X, we trained the previously mentioned models on the performance 12 minutes using 2 cores of an Intel Xeon E5-2650 v2 with OpenBlas. data gathered on benchmark X. The surrogate benchmark XM 0 based The HP - DBNET required a GPGPU to run efficiently; on a modern on model M is identical to the original benchmark X, except that Geforce GTX780 GPU, it took roughly 15 minutes to evaluate a single evaluations of the machine learning algorithm to be optimized in configuration. In contrast, using the surrogate benchmark model we benchmark X are replaced by a performance prediction obtained built, one configuration can be evaluated in less than a second on a from model M . In particular, the surrogate’s configuration space standard CPU. (including all parameter types and domains) and function evaluation For some model types, training with all the data from Table 2 budget are identical to the original benchmark. was computationally infeasible, and we had to subsample 2 000 data Importantly, the wall clock time to run an algorithm on XM 0 is points (uniformly at random3 ) for training. This was the case for much lower than that required on X, since expensive evaluations nuSVR, SVR, and the Bayesian neural network. For the GP model, of the machine learning algorithm underlying X are replaced by we had to limit the dataset even further to 1 500 data points. On this cheap model predictions. The model M is simply saved to disk and is reduced training set, the GP model required 255 minutes to train on queried when needed. We could implement each evaluation in XM 0 the most expensive data set (HP - DBNET MRBI), and the Bayesian as loading M from disk and then using it for prediction, but to avoid neural networks required 36 minutes; all other models required less the repeated cost of loading M , we also allow for storing M in an than one minute for training. independent process and communicate with it via a local socket. We used HPO LIB to run the experiments for all optimizers with To evaluate the performance of a surrogate benchmark scenario a single format, both for the original hyperparameter optimization XM 0 we ran the same optimization experiments as on X, using the benchmarks and for our surrogates. To make our results reproducible, same settings and seeds. In addition to evaluating the raw predictive we fixed the pseudo-random number seed in each function evaluation performance of model M , we assessed the quality of surrogate bench- to 1. The version of the S PEARMINT package we used crashed for mark XM 0 by measuring the similarity of hyperparameter optimization about 1% of all runs due to a numerical problem. In evaluations where performance on X and XM 0 . we require entire trajectories, for these crashed S PEARMINT runs, we imputed the best function value found before the crash for all evaluations after the crash. 4 Experiments and Results In this section, we experimentally evaluate the performance of our 4.2 Evaluation of Raw Model Performance surrogates. We describe the data upon which our surrogates are based, evaluate the raw performance of our regression models on this data, We first studied the raw predictive performance of the models we and then evaluate the quality of the resulting surrogate benchmarks. considered on our preprocessed data. 4.1 Experimental Setup 4.2.1 Using all data We collected data for four benchmarks from the hyperparameter opti- To evaluate the raw predictive performance of the models listed in mization benchmark library, HPO LIB [8]. For each benchmark, we Table 1, we used 5-fold cross-validation performance and computed executed 10 runs of SMAC, S PEARMINT, TPE and random search the cross-validated root mean squared error (RMSE) and Spearman’s (using the same Hyperopt implementation of random search as for rank correlation coefficient (CC) between model predictions and the TPE), yielding the data detailed in Table 2. The four benchmarks true responses in the test fold. Here, the responses correspond to comprised two low-dimensional and two high-dimensional hyperpa- validation error rate in all benchmarks except for the Branin one rameter spaces. (where they correspond to the value of the Branin function). The two low-dimensional benchmarks were the synthetic Branin Table 3 presents these results, showing that the GP and the SVR test function and a logistic regression [29] on the MNIST dataset [23]. approaches performed best on the smooth low-dimensional synthetic Both of these have been extensively used before to benchmark hy- Branin test function, but that RF-based models are better for pre- perparameter optimization methods. While the 2-dimensional Branin dicting the performance of actual machine learning algorithms. This test function is trivial to evaluate and therefore does not require a 3 For a given dataset and fold, all models based on the same number of data surrogate, we nevertheless included it to study how closely a surro- points used the same subsampled data set. We note that model performance gate can approximate the function. The logistic regression is an actual sometimes was quite noisy with respect to the pseudorandom number seed hyperparameter optimization benchmark with 4 hyperparameters that for this subsampling step and we thus used a fixed seed. Table 3. Average RMSE and CC for a 5-fold cross validation for different RF GB GP NuSVR kNN regression models. For each entry, bold face indicates the best performance on this dataset, and underlined values are not statistically significantly different SMAC from the best according to a paired t-test (with p = 0.05). Models marked with an ∗ (+ ) are trained on only a subset of 2000 (1500) data points per fold. Branin Log.Reg. 5CV HP - NNET convex HP - DBNET mrbi TPE Model RMSE CC RMSE CC RMSE CC RMSE CC RF 1.86 1.00 0.03 0.98 0.04 0.94 0.06 0.90 Gr.Boost 7.10 0.96 0.07 0.94 0.05 0.89 0.06 0.86 Ex.Trees 1.10 1.00 0.04 0.98 0.03 0.95 0.06 0.90 S PEARMINT GP + 0.03 1.0 0.13 0.88 0.04 0.92 0.10 0.78 SVR ∗ 0.06 1.0 0.13 0.87 0.06 0.82 0.08 0.80 nuSVR ∗ 0.02 1.0 0.18 0.84 0.06 0.85 0.08 0.82 BNN ∗ 6.84 0.91 0.10 0.91 0.05 0.84 0.10 0.72 kNN 1.78 1.00 0.14 0.88 0.06 0.85 0.08 0.78 Random Lin.Reg. 45.33 0.28 0.23 0.78 0.08 0.60 0.1 0.70 LeastAngleReg. 45.33 0.28 0.23 0.78 0.08 0.60 0.1 0.7 Ridge Reg. 46.01 0.30 0.26 0.77 0.09 0.61 0.10 0.67 Figure 1. True performance (x-axis) vs. regression model predictions (y- axis) for the HP - DBNET mrbi dataset. All plots have the same axes, showing Table 4. Average RMSE and CC of 5 regression algorithms in the leave-one- error rates ranging from 0.4 to 1.1. Each marker represents the performance optimizer-out setting. Bold face indicates the best value across all regression of one configuration; green and red crosses indicate 1/3 best and worst true models on this dataset. Branin Log.Reg. 5CV HP - NNET convex HP - DBNET mrbi performance, respectively. Configurations on the diagonal are predicted per- Model RMSE CC RMSE CC RMSE CC RMSE CC fectly, error predictions above the diagonal are too high, and predictions for RF 2.04 0.95 0.10 0.93 0.04 0.82 0.07 0.85 configurations below the diagonal are better than the configuration’s actual Gr.Boost 6.96 0.85 0.12 0.84 0.05 0.81 0.07 0.83 performance. The first column shows which data was left out for training and GP 0.12 0.99 0.16 0.76 0.05 0.84 0.10 0.64 nuSVR 0.03 0.98 0.16 0.74 0.10 0.61 0.10 0.73 used for testing. kNN 2.08 0.96 0.19 0.72 0.07 0.73 0.09 0.67 consequently, the results were slightly worse, but the best-performing strong performance was to be expected for the higher-dimensional models stayed the same: nuSVR and GP for low dimensional and RFs hyperparameter space of the neural networks, since RFs perform au- for higher dimensional datasets. tomatic feature selection.4 The logistic regression example is rather Figure 1 studies the predictive performance in more detail for the low-dimensional, but the categorical cross-validation fold is likely HP - DBNET mrbi benchmark, demonstrating that tree-based models harder to model for GPs than for RFs.5 Extra Trees predicted the also performed best in a qualitative sense. The figure also shows performance of the actual machine learning algorithms nearly as that the models tended to make the largest mistakes for the worst good as the RF, Gradient boost was slightly worse. Bayesian neu- configurations; especially the non-tree-based models predicted some ral networks, k-nearest-neighbours and our linear regression models of these to be better than some of the best configurations. The same could not achieve comparable performance. Based on these results, patterns also held for the logistic regression 5CV and the HP - NNET we decided to focus the remainder of our study on a diverse sub- convex data (not shown). The models also completely failed to identify set of models: two tree-based approaches (RFs and gradient boost), neural network configurations which did not converge within the time Gaussian processes, nuSVR, and, as an example of a popular, yet limit and therefore received an error rate of 1.0. Interestingly, the poorly-performing model, k-nearest-neighbours. We paid special at- GP failed almost entirely on the high-dimensional HP - DBNET MRBI tention to RFs and Gaussian processes, since these have been used benchmark in two cases, predicting all data points around the data most prominently in Bayesian hyperparameter optimization methods. mean. 4.2.2 Leave one optimizer out 4.3 Evaluation of Surrogate Benchmarks In practice, we will want to use our surrogate models to predict the 0 We now study the performance of the surrogate benchmarks XM performance of a machine learning algorithm with hyperparameter obtained for random forest (RF) and Gaussian process (GP) models configurations selected by some new optimization method. The config- 0 M . We assess the quality of XM by comparing the performance of urations it evaluates might be quite different from those considered by 0 various hyperparameter optimizers on XM and the real benchmark the optimizers whose data we trained on. Next to the standard cross- X. validation setting from above, we therefore evaluated our models in the leave-one-optimizer-out setting, which means that the regression model learns from data drawn from all but one optimizer, and its 4.3.1 Using all data performance is measured on the held out data. Table 4 reports RMSE and CC analogous to those of Table 3, but for We first analyzed the performance of surrogate benchmarks based the leave-one-optimizer-out setting. This setting is more difficult and, on models trained on the entire data we have available. We note that in this first experiment, a surrogate that perfectly remembers the 4 We note that RFs could also handle the categorical hyperparameters in training data would achieve perfect performance, because we used the these benchmarks natively. We used the one-hot encoding for comparability same hyperparameter optimizers for evaluation as we did to gather with other methods. It is remarkable that even with this encoding, they outperformed all other methods. the training data. However, after the first imperfect prediction, the 5 As we will see later (Figure 2), the RF-based optimizer SMAC also per- trajectories of the optimizers will diverge. Thus, since none of our formed better on this benchmark than the GP-based optimizer S PEARMINT. models is perfect on training data, this initial experiment serves as an evaluation of surrogate benchmarks based on training data gathered well, even when the training data did not include data gathered with through the same mechanism as at test time. optimizer o. In contrast, surrogates based on Gaussian process models We performed experiments for our three actual hyperparameter performed poorly: the Gaussian process again underestimated the optimization benchmarks, logistic regression, a simple and a deep error, predicting better performance in some regions than possible on neural network. For each of them, we repeated the 10 runs for TPE, the real benchmark.6 Again, these regions with “better” performance SMAC and S PEARMINT we previously conducted on the real bench- appear hard to find: only SMAC and S PEARMINT found them, caus- marks, but now used surrogate benchmarks based on RFs and GPs, ing their performances on the GP-based surrogate benchmark to differ respectively. substantially from their performance on the true benchmark. Figure 2 shows that the surrogate benchmarks based on Gaussian Results for HP - NNET convex were also better for the surrogate process models differed substantially from the true benchmarks. The benchmark based on RFs (especially for S PEARMINT), but not as figures show the best function values found by the various optimizers much better as for the logistic regression 5CV case. As was already over time. Visually comparing the first column (real benchmark) the case when the surrogate was based on all training data, the RF- to the third (surrogate benchmark based on GP model), the most based surrogate benchmarks only approximately captured the strong obvious difference is that the surrogate benchmark fails completely performance TPE showed on the real benchmark. on HP - DBNET MRBI: since the GP model is unable to properly fit Results on HP - DBNET MRBI show a fairly close correspondence the high-dimensional data (predicting all configurations to perform between the real benchmark and the RF-based surrogate benchmarks. roughly equally, around the data mean) all optimizers basically stay In contrast, the GP-based surrogate was dismal, once again due to the at the same performance level (the data mean). Note in the plot for GP’s near-constant predictions (close to the data mean). the true benchmark that the GP-based optimizer S PEARMINT also After this qualitative evaluation of the surrogate benchmarks, Ta- performed very poorly on this benchmark. ble 5 offers a quantitative evaluation. We judge the quality of a surro- 0 In the other two cases (logistic regression 5CV and HP - NNET con- gate benchmark XM by how closely it resembles the real benchmark vex), the performance of the optimizers appears visually similar to the X it was derived from, in terms of the absolute error between the best true benchmark at first glance. However, for the logistic regression found values for our four optimizers (SMAC, TPE, S PEARMINT, 5CV the GP model predicts some parts of the hyperparameter space and random search) after evaluating i configurations. For logistic to be better than the actual best part of the space, leading to the final regression 5CV, in line with our qualitative results we obtained a very optimization results on the surrogate benchmark to appear better than small error for the RF-based surrogate. The GP-based surrogate un- optimization results on the true benchmark. Another difference is a derestimated the achievable error rates, resulting in larger differences zig-zag pattern in the trajectories for logistic regression surrogates: between performances on the true and the surrogate runs. After 50 these are also (mildly) present in the real benchmark (mild enough evaluations the GP-based surrogate trained on all data yielded a quite to only be detectable when zooming into the figure) and are due to high error because it underestimated the performance for configura- the slightly different performance in the 5 folds of cross validation; tions selected by the optimizer SMAC. Training the GP-surrogate on the impact of the folds is very small, but the GP model predicts it to the leave-one-optimizer-out dataset causes worse performance for the be large, causing the zig-zag. Interestingly, for the GP model trained optimizer S PEARMINT and too much variation for SMAC resulting on the HP - NNET convex dataset, regions with “better” performance in a higher error as well. This misprediction decreases with more appear hard to find: only SMAC and TPE identified them, causing a evaluated configurations. larger gap between S PEARMINT and SMAC/TPE than on the real The results for the HP - NNET convex look quite similar, with a some- benchmark. what smaller difference between RF-based and GP-based surrogates. Conversely, the RF surrogates yielded results much closer to those Indeed, SMAC and TPE behaved similarly on both RF-based and obtained on the real benchmark. Visually, the first column (true bench- GP-based surrogates as on the real benchmark; only S PEARMINT be- mark) and second column appear very similar, indicating that the RF haved very differently on the GP-based surrogate, causing an overall captured the overall pattern well. There are some differences in the higher error than for the RF-based surrogates. details. For example, on HP - NNET convex, the surrogate does not On the high dimensional HP - NNET mrbi the surrogates performed capture that TPE finds very good configurations before SMAC and differently. Whereas the RF-based surrogate could still reproduce sim- yields the overall best performance. Nevertheless, overall, our results ilar optimizer behavior as on the real benchmark, the GP completely for the RF surrogates qualitatively resemble those for the true bench- failed to do so. Remarkably, overall quantitative performance was marks, and for the logistic regression example, the correspondence is similar for surrogate benchmarks trained on all data and those trained almost perfect. on leave-one-optimizer-out datasets. Overall, these results confirmed our expectation from previous findings in Section 3.3 and the raw regression model performance 4.3.2 Leave one optimizer out results in Table 3: good regression models facilitate good surrogate benchmarks. In our case, RFs performed best for both tasks. We note Next, we studied the use of a surrogate benchmark to evaluate a new that using the surrogate benchmarks reduced the time requirements optimizer. For each optimizer o and each of the three hyperparame- substantially; for example, evaluating a surrogate 100 times instead ter optimization benchmarks X, we trained RF and GP models M of the HP - NNET convex or HP - DBNET MRBI took less than 1 minute on the respective leave-one-optimizer-out training data discussed in on a single CPU, compared to roughly 10 hours on two CPUs (HP - Section 4.2.2 and compared the performance of optimizer o on X 7 0 NNET convex) and over a day on a modern GPU ( HP - DBNET MRBI). and XM . Figure 3 reports the results of this experiment, showing that 6 We noticed similar behavior for the nuSVR, which even returned negative surrogate benchmarks based on RF models qualitatively resembled the real benchmarks. values for configurations and caused the optimizer to search completely different areas of the configuration space (data not shown here). The results for the logistic regression 5CV benchmark (top row 7 Of course, the overhead due to the used hyperparameter optimizer comes on of Figure 3) show that surrogate benchmarks based on RF models top of this; e.g., S PEARMINT’s overhead for a run with 200 evaluations was mirrored the performance of each optimizer o on the real benchmark roughly one hour, whereas SMAC’s overhead was less than one minute. Results on True Benchmark Results on RF Surrogate Benchmark Results on GP Surrogate Benchmark 1.0 1.0 1.0 SMAC_REAL SMAC_rf SMAC_gp Best validation error achieved Best validation error achieved Best validation error achieved SPEARMINT_REAL SPEARMINT_rf SPEARMINT_gp 0.8 TPE_REAL 0.8 TPE_rf 0.8 TPE_gp Log.Reg. 5CV 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0 0.0 0 0.0 0 10 101 102 10 101 102 10 101 102 #Function evaluations #Function evaluations #Function evaluations 0.50 0.50 0.50 SMAC_REAL SMAC_rf SMAC_gp Best validation error achieved Best validation error achieved Best validation error achieved 0.45 SPEARMINT_REAL 0.45 SPEARMINT_rf 0.45 SPEARMINT_gp TPE_REAL TPE_rf TPE_gp 0.40 0.40 0.40 HP - NNET convex 0.35 0.35 0.35 0.30 0.30 0.30 0.25 0.25 0.25 0.20 0.20 0.20 0.15 0.15 0.15 0.10 0 0.10 0 0.10 0 10 101 102 10 101 102 10 101 102 #Function evaluations #Function evaluations #Function evaluations 0.90 0.90 0.90 SMAC_REAL SMAC_rf SMAC_gp 0.85 0.85 0.85 Best validation error achieved Best validation error achieved Best validation error achieved SPEARMINT_REAL SPEARMINT_rf SPEARMINT_gp 0.80 TPE_REAL 0.80 TPE_rf 0.80 TPE_gp HP - DBNET MRBI 0.75 0.75 0.75 0.70 0.70 0.70 0.65 0.65 0.65 0.60 0.60 0.60 0.55 0.55 0.55 0.50 0.50 0.50 0.45 0 0.45 0 0.45 0 10 101 102 10 101 102 10 101 102 #Function evaluations #Function evaluations #Function evaluations Figure 2. Median and quartile of best performance over time on the real benchmark (left column) and on surrogates (middle: based on RF models; right: based on GP models). Both types of surrogate benchmarks were trained on all available data. For logistic regression 5CV each fold is plotted as a separate function evaluation. benchmarks that behave similarly to the actual benchmarks they are Table 5. Quantitative evaluation of surrogate benchmarks at three different derived from, but are far cheaper and simpler to use. The key idea is time steps each. We show the mean difference between the best found values to collect (configuration, performance) pairs from the actual bench- for corresponding runs (having the same seed) of the four optimizers (SMAC, mark and to learn a regression model that can predict the performance TPE, S PEARMINT, and random search) after i function evaluations on the real of a new configuration and therefore stand in for the expensive-to- and surrogate benchmark. For each experiment and optimizer we conducted 10 evaluate algorithm. These surrogates reduce the algorithm overhead runs and report the mean error averaged over 4 × 10 = 40 comparisons. We to a minimum, which allows extensive runs and analyses of new hy- evaluated RF-based and GP-based surrogates. For each problem we measured perparameter optimization techniques. We empirically demonstrated the error for surrogates trained on all and the leave-one-optimizer-out (leave- that we can obtain surrogate benchmarks that closely resemble the ooo) data; e.g., the TPE trajectories are from optimizing on a surrogate that is real benchmarks they were derived from. trained on all training data except that gathered using TPE. Bold face indicates In future work, we intend to study the use of surrogates for general the best performance for this dataset and i function evaluations. Results are algorithm configuration. In particular, we plan to support optimization underlined when the one-sigma confidence intervals of the best and this result across a set of problem instances, each of which can be described overlaps. by a fixed-length vector of characteristics, and to assess the result- #Function evaluations 50 200 500 ing surrogates for several problems that algorithm configuration has Surrogate RF GP RF GP RF GP tackled successfully, such as propositional satisfiability [14], mixed Log.Reg. 5CV all 0.02 0.06 0.00 0.04 0.00 0.05 integer programming [15], and AI planning [9]. Finally, good surro- Log.Reg. 5CV leave-ooo 0.02 0.07 0.01 0.03 0.00 0.03 gate benchmarks should enable us to explore the configuration options of the optimizers themselves, and we plan to use surrogate bench- #Function evaluations 50 100 200 marks to enable efficient meta-optimization of the hyperparameter Surrogate RF GP RF GP RF GP optimization and algorithm configuration methods themselves. HP - NNET convex all 0.01 0.03 0.01 0.03 0.01 0.02 HP - NNET convex leave-ooo 0.02 0.03 0.02 0.04 0.02 0.03 REFERENCES HP - DBNET MRBI all 0.05 0.13 0.05 0.16 0.05 0.17 HP - DBNET MRBI leave-ooo 0.04 0.13 0.04 0.16 0.05 0.17 [1] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag, ‘Collaborative hyper- parameter tuning’, in Proc. of ICML’13, (2013). [2] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, ‘Algorithms for hyper- parameter optimization’, in Proc. of NIPS’11, (2011). 5 Conclusion and Future Work [3] J. Bergstra and Y. Bengio, ‘Random search for hyper-parameter opti- mization’, JMLR, 13, 281–305, (2012). [4] J. Bergstra, B. Komer, C. Eliasmith, and D. Warde-Farley, ‘Preliminary To tackle the high computational cost and overhead of performing evaluation of hyperopt algorithms on HPOLib’, in ICML workshop on hyperparameter optimization benchmarking, we proposed surrogate AutoML, (2014). SMAC S PEARMINT TPE 1.0 1.0 1.0 SMAC_REAL SPEARMINT_REAL TPE_REAL Best validation error achieved Best validation error achieved Best validation error achieved SMAC_gp SPEARMINT_gp TPE_gp 0.8 SMAC_rf 0.8 SPEARMINT_rf 0.8 TPE_rf 0.6 0.6 0.6 Log.Reg. 5CV 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0 0.0 0 0.0 0 10 101 102 10 101 102 10 101 102 #Function evaluations #Function evaluations #Function evaluations 0.50 0.50 0.50 SMAC_REAL SPEARMINT_REAL TPE_REAL Best validation error achieved Best validation error achieved Best validation error achieved 0.45 SMAC_gp 0.45 SPEARMINT_gp 0.45 TPE_gp SMAC_rf SPEARMINT_rf TPE_rf 0.40 0.40 0.40 0.35 0.35 0.35 0.30 0.30 0.30 HP - NNET convex 0.25 0.25 0.25 0.20 0.20 0.20 0.15 0.15 0.15 0.10 0 0.10 0 0.10 0 10 101 102 10 101 102 10 101 102 #Function evaluations #Function evaluations #Function evaluations 0.90 0.90 0.90 SMAC_REAL SPEARMINT_REAL TPE_REAL 0.85 0.85 0.85 Best validation error achieved Best validation error achieved Best validation error achieved SMAC_gp SPEARMINT_gp TPE_gp 0.80 SMAC_rf 0.80 SPEARMINT_rf 0.80 TPE_rf 0.75 0.75 0.75 0.70 0.70 0.70 HP - DBNET MRBI 0.65 0.65 0.65 0.60 0.60 0.60 0.55 0.55 0.55 0.50 0.50 0.50 0.45 0 0.45 0 0.45 0 10 101 102 10 101 102 10 101 102 #Function evaluations #Function evaluations #Function evaluations Figure 3. Median and quartile of optimization trajectories for surrogates trained in the leave-one-optimizer-out setting. Black trajectories correspond to true benchmarks, coloured trajectories to optimization runs on a surrogate. The first row names the optimizer used to obtain the trajectories; their data was left out for training the regression models. [5] J. Bergstra, D. Yamins, and D. D. Cox, ‘Making a science of model [18] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, ‘Algorithm runtime search: Hyperparameter optimization in hundreds of dimensions for prediction: Methods and evaluation’, JAIR, 206(0), 79 – 111, (2014). vision architectures’, in Proc. of ICML’13, (2013). [19] B. Komer, J. Bergstra, and C. Eliasmith, ‘Hyperopt-sklearn: Automatic [6] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta, Metalearning: hyperparameter configuration for scikit-learn’, in ICML workshop on Applications to Data Mining, Springer, 2008. AutoML, (2014). [7] E. Brochu, V. M. Cora, and N. de Freitas, ‘A tutorial on Bayesian opti- [20] A. Krizhevsky, ‘Learning multiple layers of features from tiny images’, mization of expensive cost functions, with application to active user mod- Technical report, University of Toronto, (2009). eling and hierarchical reinforcement learning’, CoRR, abs/1012.2599, [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘Imagenet classification (2010). with deep convolutional neural networks’, in Proc. of NIPS’12, pp. 1097– [8] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. H. 1105, (2012). Hoos, and K. Leyton-Brown, ‘Towards an empirical foundation for [22] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, ‘An assessing bayesian optimization of hyperparameters’, in NIPS workshop empirical evaluation of deep architectures on problems with many factors on Bayesian Optimization, (2013). of variation’, in Proc. of ICML’07, (2007). [9] C. Fawcett, M. Helmert, H. H. Hoos, E. Karpas, G. Röger, and J. Seipp, [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘Gradient-based learning ‘FD-Autotune: Domain-specific configuration using fast-downward’, in applied to document recognition’, Proc. of the IEEE, 86(11), 2278–2324, Proc. of ICAPS-PAL, (2011). (1998). [10] D. Gorissen, I. Couckuyt, P. Demeester, T. Dhaene, and K. Crombecq, ‘A [24] R. M. Neal, Bayesian learning for neural networks, Ph.D. dissertation, surrogate modeling and adaptive sampling toolbox for computer based University of Toronto, 1995. design’, JMLR, 11, 2051–2055, (2010). [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, [11] S. B. Guerra, R. B. C. Prudêncio, and T. B. Ludermir, ‘Predicting the O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- performance of learning algorithms using support vector machines as plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- meta-regressors’, in Proc. of ICANN’08, volume 5163, pp. 523–532, nay, ‘Scikit-learn: Machine learning in Python’, JMLR, 12, 2825–2830, (2008). (2011). [12] N. Hansen, A. Auger, S. Finck, R. Ros, et al. Real-parameter black-box [26] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel, ‘Automatic optimization benchmarking 2010: Experimental setup, 2010. classifier selection for non-experts’, PAA, 17(1), 83–96, (2014). [13] M. D. Hoffman, D. M. Blei, and F. R. Bach, ‘Online learning for latent [27] J. Sacks, W. J. Welch, T. J. Welch, and H. P. Wynn, ‘Design and analysis dirichlet allocation.’, in Proc. of NIPS’10, (2010). of computer experiments’, Statistical Science, 4(4), 409–423, (November [14] F. Hutter, D. Babić, H.H. Hoos, and A.J. Hu, ‘Boosting Verification by 1989). Automatic Tuning of Decision Procedures’, in Proc. of FMCAD’07, pp. [28] T. J. Santner, B. J. Williams, and W. I. Notz, The design and analysis of 27–34, Washington, DC, USA, (2007). IEEE Computer Society. computer experiments, Springer, 2003. [15] F. Hutter, H. H. Hoos, and K. Leyton-Brown, ‘Automated configuration [29] J. Snoek, H. Larochelle, and R.P. Adams, ‘Practical Bayesian optimiza- of mixed integer programming solvers’, in Proc. of CPAIOR-10, pp. tion of machine learning algorithms’, in Proc. of NIPS’12, (2012). 186–202, (2010). [30] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, ‘Auto-WEKA: [16] F. Hutter, H. H. Hoos, and K. Leyton-Brown, ‘Sequential model-based Combined selection and hyperparameter optimization of classification optimization for general algorithm configuration’, in Proc. of LION-5, algorithms’, in Proc. of KDD’13, (2013). (2011). [31] C. N. J. Yu and T. Joachims, ‘Learning structural svms with latent [17] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stützle, ‘ParamILS: an variables’, in Proc. of ICML’09, pp. 1169–1176, (2009). automatic algorithm configuration framework’, JAIR, 36(1), 267–306, (2009).