Surrogate Benchmarks
                               for Hyperparameter Optimization
        Katharina Eggensperger1 and Frank Hutter1 and Holger H. Hoos2 and Kevin Leyton-Brown2


Abstract. Since hyperparameter optimization is crucial for achiev-            However, a substantial problem remains: performing a hyperpa-
ing peak performance with many machine learning algorithms, an            rameter optimization experiment requires running the underlying
active research community has formed around this problem in the           machine learning algorithm, often at least hundreds of times. This is
last few years. The evaluation of new hyperparameter optimization         infeasible in many cases. The first (mundane, but often significant)
techniques against the state of the art requires a set of benchmarks.     obstacle is to get someone else’s research code working on one’s own
Because such evaluations can be very expensive, early experiments         system—including resolving dependencies and acquiring required
are often performed using synthetic test functions rather than using      software licenses—and to acquire the appropriate input data. Fur-
real-world hyperparameter optimization problems. However, there           thermore, some code requires specialized hardware; most notably,
can be a wide gap between the two kinds of problems. In this work,        general-purpose graphics processing units (GPGPUs) have become a
we introduce another option: cheap-to-evaluate surrogates of real         standard requirement for the effective training of modern deep learn-
hyperparameter optimization benchmarks that share the same hyper-         ing architectures [20, 21]. Finally, the computational expense of hyper-
parameter spaces and feature similar response surfaces. Specifically,     parameter optimization can be prohibitive for research groups lacking
we train regression models on data describing a machine learning          access to large compute clusters. These problems represent a consid-
algorithm’s performance under a wide range of hyperparameter con-         erable barrier to the evaluation of new hyperparameter optimization
figurations, and then cheaply evaluate hyperparameter optimization        algorithms on the most challenging and interesting hyperparameter
methods using the model’s performance predictions in lieu of the real     optimization benchmarks, such as deep belief networks [2], convo-
algorithm. We evaluate the effectiveness for using a wide range of        lutional neural networks [29, 5], and combined model selection and
regression techniques to build these surrogate benchmarks, both in        hyperparameter optimization in machine learning frameworks [30].
terms of how well they predict the performance of new configurations          Given this high overhead for studying complex hyperparameter
and of how much they affect the overall performance of hyperparame-       optimization benchmarks, most researchers have drawn on simple,
ter optimizers. Overall, we found that surrogate benchmarks based on      synthetic test functions from the global continuous optimization com-
random forests performed best: for benchmarks with few hyperparam-        munity [12]. While these are simple to use, they are often poorly
eters they yielded almost perfect surrogates, and for benchmarks with     representative of the hyperparameter optimization problem: in con-
more complex hyperparameter spaces they still yielded surrogates          trast to the response surfaces of actual such problems, these synthetic
that were qualitatively similar to the real benchmarks they model.        test functions are smooth and often have unrealistic shapes. Further-
                                                                          more, they only involve real-valued parameters and hence do not
                                                                          incorporate the categorical and conditional parameters typical of ac-
1    Introduction                                                         tual hyperparameter optimization benchmarks.
The performance of many machine learning methods depends criti-               In the special case of small, finite hyperparameter spaces, a much
cally on hyperparameter settings and thus on the method used to set       better alternative is simply to record the performance of every hyper-
such hyperparameters. Recently, sequential model-based Bayesian op-       parameter configuration, thereby speeding future evaluations via a
timization methods, such as SMAC[16], TPE[2], and Spearmint[29]           table lookup. The result is a perfect surrogate of an algorithm’s true
have been shown to outperform more traditional methods for this prob-     performance that takes time O(1) to compute (using a hash) and that
lem (such as grid search and random search [3]) and to rival—and in       can be used in place of actually running the algorithm and evaluating
some cases surpass—human domain experts in finding good hyperpa-          its performance. This table-based surrogate can trivially be transported
rameter settings [29, 30, 5]. One obstacle to further progress in this    to any new system, without the complicating factors involved in run-
nascent field is a paucity of reproducible experiments and empirical      ning the original algorithm (setup, special hardware requirements,
studies. Until recently, a study introducing a new hyperparameter         licensing, computational cost, etc.). In fact, several researchers have
optimizer would typically also introduce a new set of hyperparameter      already applied this approach to simplifying their experiments: for
optimization benchmarks, on which the optimizer would be demon-           example, Bardenet et al. [1] saved the performances of a parameter
strated to achieve state-of-the-art performance (as compared to, e.g.,    grid with 108 points of Adaboost on 29 datasets, and Snoek et al. [29]
human domain experts). The introduction of the hyperparameter opti-       saved the performance of parameter grids with 1400 and 288 points
mization library (HPOlib [8]), which offers a unified interface to dif-   for a structured SVM [31] and an online LDA [13], respectively. The
ferent optimizers and benchmarks, has made it easier to reuse previous    latter two benchmarks are part of HPOlib and are, in fact, HPOlib’s
benchmarks and to systematically compare different approaches [4].        most frequently used benchmarks, due to their simplicity of setup and
                                                                          low computational cost.
1 University of Freiburg, email:{eggenspk,fh}@cs.uni-freiburg.de
                                                                              Of course, the drawback of this table lookup idea is that it is limited
2 University of British Columbia, email: {hoos,kevinlb}@cs.ubc.ca
to small, finite hyperparameter spaces. Here, we generalize the idea        benchmark for comparing hyperparameter optimization procedures.
of machine learning algorithm surrogates to arbitrary, potentially
high-dimensional hyperparameter spaces (including, e.g., real-valued,       2    Background: Hyperparameter Optimization
categorical, and conditional hyperparameters). As in the table-lookup
strategy, we first evaluate many hyperparameter configurations during       The construction of machine learning models typically gives rise to
an expensive offline phase. We then use the resulting performance           two optimization problems. The first is internal optimization, such
data to train a regression model to approximate future evaluations          as selecting a neural network’s likelihood-maximizing weights; the
via model predictions. As before, we obtain a surrogate of algorithm        second is tuning the method’s hyperparameters, such as setting a
performance that is cheap to evaluate and trivially portable. However,      neural network’s regularization parameters or number of neurons.
model-based surrogates offer only approximate representations of            The former problem is closely coupled with the machine learning
performance. Thus, a key component of our work presented in the             algorithm at hand and is very well studied; here, we consider the
following is an investigation of the quality of these approximations.       latter. Let λ1 , . . . , λn denote the hyperparameters of a given ma-
   We are not the first to propose the use of learned surrogate models      chine learning algorithm, and let Λ1 , . . . , Λn denote their respective
that stand in for computationally complex functions. In the field of        domains. The algorithm’s hyperparameter space is then defined as
metalearning [6], regression models have been extensively used to           Λ = Λ1 × · · · × Λn . When trained with hyperparameters λ ∈ Λ
predict the performance of algorithms across various datasets based         on data Dtrain , the algorithm’s loss (e.g., misclassification rate) on
on dataset features [11, 26]. The statistics literature on the design       data Dvalid is L(λ, Dtrain , Dvalid ). Using k-fold cross-validation, the
and analysis of computer experiments (DACE) [27, 28] uses similar           optimization problem is then to minimize:
surrogate models to guide a sequential experimental design strategy                                     k
aiming to achieve either an overall strong model fit or to identify the                              1X          (i)      (i)
                                                                                           f (λ) =         L(λ, Dtrain , Dvalid ).               (1)
minimum of a function. Similarly, the SUrrogate MOdeling (SUMO)                                      k i=1
Matlab toolkit[10] provides an environment for building regression
                                                                               A hyperparameter λn can have one of several types, such as contin-
models to describe the outputs of expensive computer simulations
                                                                            uous, integer-valued or categorical. For example, the learning rate for
based on active learning. Such an approach for finding the minimum
                                                                            a neural network is continuous; the random seed given to initialize an
of a blackbox function also underlies the sequential model-based
                                                                            algorithm is integer-valued; and the choice between various prepro-
Bayesian optimization framework [7, 16] (SMBO, the framework
                                                                            cessing methods is categorical. Furthermore, there can be conditional
underlying all hyperparameter optimizers we study here). While all
                                                                            hyperparameters, which are only active if another hyperparameter
of these lines of work incrementally construct surrogate models of a
                                                                            takes a certain value; for example, the hyperparameter “number of
function in order to inform an active learning criterion that determines
                                                                            principal components” only needs to be instantiated when the hyper-
new inputs to evaluate, our work differs in its goals: We train surro-
                                                                            parameter “preprocessing method” is PCA.
gates on a set of data gathered offline (by some arbitrary process—in
                                                                               Evaluating f (λ) for a given λ ∈ Λ is computationally costly, and
our case the combination of many complete runs of several different
                                                                            so many techniques have been developed to find good configurations
SMBO methods plus random search) and use the resulting surro-
                                                                            λ with few function evaluations. The methods most commonly used
gates as stand-in models for the entire hyperparameter optimization
                                                                            in practice are manual search and grid search, but recently, it has
benchmark.
                                                                            been shown that even simple random search can yield much better
   The surrogate benchmarks resulting from our work can be used in
                                                                            results [3]. The state of the art in practical optimization of hyperpa-
several different ways. Firstly, like synthetic test functions and table
                                                                            rameters is defined by Bayesian optimization methods [16, 29, 2],
lookups, they can be used for extensive debugging and unit testing.
                                                                            which have been successfully applied to problems ranging from deep
Since the large computational expense of running hyperparameter
                                                                            neural networks to combined model selection and hyperparameter
optimizers is typically dominated by the cost of evaluating algorithm
                                                                            optimization [2, 29, 30, 19, 5].
performance under different selected hyperparameters, our bench-
                                                                               Bayesian optimization methods use a probabilistic model M to
marks can also substantially reduce the time required for running
                                                                            model the relationship between a hyperparameter configuration Λ and
a hyperparameter optimizer, facilitating whitebox tests of an opti-
                                                                            its performance f (λ). They fit this model using previously gathered
mizer using exactly the hyperparameter space of the machine learning
                                                                            data and then use it to select a next point λnew to evaluate, trading off
algorithm whose performance is modelled by the surrogate. This func-
                                                                            exploitation and exploration in order to find the minimum of f . They
tionality is gained even if the surrogate model only fits algorithm
                                                                            then evaluate f (λnew ), update M with the new data (λnew , f (λnew ))
performance quite poorly (e.g., due to a lack of sufficient training
                                                                            and iterate. Throughout this paper, we will use the following three
data). Finally, a surrogate benchmark whose model fits algorithm per-
                                                                            instantiations of Bayesian optimization:
formance very well can also facilitate the evaluation of new features
                                                                            S PEARMINT [29] is a prototypical Bayesian optimization method that
inside the hyperparameter optimizer, or even the (meta-)optimization
                                                                            models pM (f | λ) with Gaussian process (GP) models. It supports
of a hyperparameter optimizer’s own hyperparameters (which can be
                                                                            continuous and discrete parameters (by rounding), but no conditional
useful even without the use of surrogates, but is typically extremely
                                                                            parameters.
expensive [17]).
                                                                            Sequential Model-based Algorithm Configuration (SMAC) [16]
   The rest of this paper is laid out as follows. We first provide some
                                                                            models pM (f | λ) with random forests. When performing cross
background on hyperparameter optimization (Section 2). Then, we dis-
                                                                            validation, SMAC only evaluates as many folds as necessary to show
cuss our methodology for building surrogate benchmarks (Section 3)
                                                                            that a configuration is worse than the best one seen so far (or to
using several types of machine learning models. Next, we evaluate the
                                                                            replace it). SMAC can handle continuous, categorical, and conditional
performance of these surrogates in practice (Section 4). We demon-
                                                                            parameters.
strate that random forest models tend to fit the data better than a broad
                                                                            Tree Parzen Estimator (TPE) [2] models pM (f | λ) indirectly. It
range of competing models, both in terms of raw predictive model
                                                                            models p(f < f ∗ ), p(λ | f < f ∗ ), and p(λ | f ≥ f ∗ ), where
performance and in terms of the usefulness of the resulting surrogate
                                                                            f ∗ is defined as a fixed quantile of the function values observed so
far, and the latter two probabilities are defined by tree-structured
                                                                          Table 1. Overview of evaluated regression algorithms. When we used ran-
Parzen density estimators. TPE can handle continuous, categorical,
                                                                          dom search to optimize hyperparameters, we considered 100 samples over the
and conditional parameters.
                                                                          stated hyperparameters (their names refer to the SCIKIT- LEARN implementa-
   An empirical evaluation on the three methods on the HPOlib hy-
                                                                          tion [25]); the model was trained on 50% of the data, and the best configuration
perparameter optimization benchmarks showed that S PEARMINT per-
                                                                          was chosen based on the performance on the other 50% and then trained on all
formed best on benchmarks with few continuous parameters and
                                                                          data.
SMAC performed best on benchmarks with many, categorical, and/or
conditional parameters, closely followed by TPE. SMAC also per-             Model                       Hyperparameter optimization                Impl.
formed best on benchmarks that relied on cross-validation [8].              Random Forest               None                                        [25]
                                                                            Gradient Boosting           None                                        [25]
                                                                            Extra Trees                 None                                        [25]
                                                                            Gaussian Process            MCMC sampling over hyperparameters          [29]
3     Methodology                                                           SVR                         Random search for C and gamma               [25]
                                                                            NuSVR                       Random search for C, gamma and nu           [25]
We now discuss our approach, including the algorithm performance            Bayesian Neural Network     None                                        [24]

data we used, how we preprocessed the data, the types of regression         k-nearest-neighbours        Random search for n neighbors               [25]
                                                                            Linear Regression           None                                        [25]
models we evaluated, and how we used them to construct surrogate            Least Angle Regression      None                                        [25]
                                                                            Ridge Regression            None                                        [25]
benchmarks.


3.1    Data collection                                                    1. We extracted all available configuration/performance pairs from
In principle, we could construct surrogate benchmarks using algorithm        the runs. For benchmarks that used cross-validation, we encoded
performance data gathered by any means. For example, we could use            the cross-validation fold of each run as an additional categorical
existing data from a manual exploration of the hyperparameter space,         parameter (for benchmarks without cross validation, that parameter
or from an automated approach, such as grid search, random search or         was set to a constant).
one of the more sophisticated hyperparameter optimization methods         2. We removed entries with invalid results caused by algorithm
discussed in Section 2.                                                      crashes. Since some regression models used in preliminary experi-
   It is more important for surrogate benchmarks to exhibit strong           ments could not handle duplicated configurations, we also deleted
predictive quality in some parts of the hyperparameter space than in         these, keeping the first occurrence.
others. Specifically, our ultimate aim is to ensure that hyperparameter   3. For data from benchmarks featuring conditional parameters, we
optimizers perform similarly on the surrogate benchmark as on the            replaced the values of inactive conditional parameters with a default
real benchmark. Since most optimizers spend most of their time in            value.
high-performance regions of the hyperparameter space, and since           4. To code categorical parameters, we used a one-hot (aka 1-in-k)
relative differences between the performance of hyperparameter con-          encoding, which replaces any single categorical parameter λ with
figurations in such high-performance regions tend to impact which            domain Λ = {k1 , . . . kn } by n binary parameters, only the i-th of
hyperparameter configuration will ultimately be returned, accuracy           which is true for data points where λ is set to ki .
in this part of the space is more important than in regions of poor
performance. The training data should therefore densely sample high-
performance regions. We thus advocate collecting performance data
primarily via runs of existing hyperparameter optimization proce-         3.3     Choice of Regression Models
dures. As an additional advantage of this strategy, we can obtain this
costly performance data as a by-product of executing hyperparameter
optimization procedures on the original benchmark.                        We considered a broad range of commonly used regression algorithms
   Of course, it is also important to accurately identify poorly per-     as candidates for our surrogate benchmarks. To keep the results com-
forming parts of the space: if we only trained on performance data        parable, all models were trained on data encoded as detailed in the
for the very best hyperparameter settings, no machine learning model      previous section. If necessary for the algorithm, we also normalized
could be expected to infer that performance in the remaining parts        the data to have zero mean and unit variance (by subtracting the mean
of the space is poor. This would typically lead to underpredictions       and dividing by the standard deviation). If not stated otherwise for a
of performance in poor parts of the space. We thus also included          model, we used the default configuration of its implementation.
performance data gathered by a random search. (An alternative is grid        Table 1 details the regression models and implementations we
search, which can also cover the entire space. We did not adopt this      used. We evaluated three different tree-based models, because SMAC
approach because it cannot deal effectively with large hyperparameter     uses a random forest (RF), and because RFs have been shown to
spaces.) To gather the data for each surrogate benchmark in this paper,   yield high-quality predictions of algorithm performance data [18].
we therefore executed r = 10 runs of each of the three Bayesian           As a specialist for low-dimensional hyperparameter spaces, we used
optimization methods described in Section 2 (each time with a dif-        S PEARMINT’s Gaussian process (GP) implementation, which per-
ferent seed), as well as random search, with each run gathering the       forms MCMC to marginalize over hyperparameters. Since SMAC per-
performance of a fixed number of configurations.                          forms particularly well on high-dimensional hyperparameter spaces
                                                                          and S PEARMINT on low-dimensional continuous problems [8], we
                                                                          expected their respective models to mirror that pattern. The remaining
3.2    Data preprocessing                                                 prominent model types we experimented with comprised k-nearest-
For each benchmark we studied for this paper, after running the           neighbours (kNN), linear regression, least angle regression, ridge
hyperparameter optimizers and random search, we preprocessed the          regression, SVM methods (all as implemented by scikit-learn [25]),
data as follows:                                                          and Bayesian neural networks (BNN) [24].
                                                                                 includes a 5-fold cross-validation. That means for each configuration
Table 2. Properties of our data sets. “Input dim.” is the number of features
                                                                                 that the optimizers TPE, S PEARMINT and random search evaluated
of the training data; it is greater than the number of hyperparameters because
                                                                                 there were 5 data points that only differ in which fold they corre-
categorical hyperparameters and the crossvalidation fold are one-hot-encoded.
                                                                                 sponded to. Since SMAC saves time by not evaluating all folds for
For each benchmark, before preprocessing the number of data points was
                                                                                 configurations that appear worse than the optimum, it only evaluated
10 × 4× (#evals. per run).
                                                                                 a subset of folds for most of the configurations. The evaluation of
                            hyperparameter          Input   #evals.   #data
                       #λ     cond.  cat. / cont.   dim.    per run              a single cross-validation fold required roughly 1 minute on a single
    Branin              2       -       -/2            3    200        7402      core of an Intel Xeon E5-2650 v2 CPU.
    Log. Reg. 5CV       4       -       -/4            9    500       18521
    HP - NNET convex   14       4       7/7           25    200        7750         The high-dimensional benchmarks comprised a simple and a deep
    HP - DBNET mrbi    36      27      19 / 17        82    200        7466      neural network, HP - NNET and HP - DBNET (both taken from [2]) to
                                                                                 classify the MRBI and convex datasets, respectively [22]. Their di-
                                                                                 mensionalities are 14 and 36, respectively, and many categorical hy-
3.4       Construction and Use of Surrogate Benchmarks                           perparameters further increase the input dimension to the regression
To construct surrogates for a hyperparameter optimization benchmark              model. Evaluating a single HP - NNET configuration required roughly
X, we trained the previously mentioned models on the performance                 12 minutes using 2 cores of an Intel Xeon E5-2650 v2 with OpenBlas.
data gathered on benchmark X. The surrogate benchmark XM     0
                                                                based            The HP - DBNET required a GPGPU to run efficiently; on a modern
on model M is identical to the original benchmark X, except that                 Geforce GTX780 GPU, it took roughly 15 minutes to evaluate a single
evaluations of the machine learning algorithm to be optimized in                 configuration. In contrast, using the surrogate benchmark model we
benchmark X are replaced by a performance prediction obtained                    built, one configuration can be evaluated in less than a second on a
from model M . In particular, the surrogate’s configuration space                standard CPU.
(including all parameter types and domains) and function evaluation                 For some model types, training with all the data from Table 2
budget are identical to the original benchmark.                                  was computationally infeasible, and we had to subsample 2 000 data
   Importantly, the wall clock time to run an algorithm on XM    0
                                                                    is           points (uniformly at random3 ) for training. This was the case for
much lower than that required on X, since expensive evaluations                  nuSVR, SVR, and the Bayesian neural network. For the GP model,
of the machine learning algorithm underlying X are replaced by                   we had to limit the dataset even further to 1 500 data points. On this
cheap model predictions. The model M is simply saved to disk and is              reduced training set, the GP model required 255 minutes to train on
queried when needed. We could implement each evaluation in XM      0             the most expensive data set (HP - DBNET MRBI), and the Bayesian
as loading M from disk and then using it for prediction, but to avoid            neural networks required 36 minutes; all other models required less
the repeated cost of loading M , we also allow for storing M in an               than one minute for training.
independent process and communicate with it via a local socket.                     We used HPO LIB to run the experiments for all optimizers with
   To evaluate the performance of a surrogate benchmark scenario                 a single format, both for the original hyperparameter optimization
XM 0
     we ran the same optimization experiments as on X, using the                 benchmarks and for our surrogates. To make our results reproducible,
same settings and seeds. In addition to evaluating the raw predictive            we fixed the pseudo-random number seed in each function evaluation
performance of model M , we assessed the quality of surrogate bench-             to 1. The version of the S PEARMINT package we used crashed for
mark XM 0
          by measuring the similarity of hyperparameter optimization             about 1% of all runs due to a numerical problem. In evaluations where
performance on X and XM    0
                             .                                                   we require entire trajectories, for these crashed S PEARMINT runs,
                                                                                 we imputed the best function value found before the crash for all
                                                                                 evaluations after the crash.
4      Experiments and Results
In this section, we experimentally evaluate the performance of our               4.2     Evaluation of Raw Model Performance
surrogates. We describe the data upon which our surrogates are based,
evaluate the raw performance of our regression models on this data,              We first studied the raw predictive performance of the models we
and then evaluate the quality of the resulting surrogate benchmarks.             considered on our preprocessed data.


4.1       Experimental Setup                                                     4.2.1    Using all data
We collected data for four benchmarks from the hyperparameter opti-              To evaluate the raw predictive performance of the models listed in
mization benchmark library, HPO LIB [8]. For each benchmark, we                  Table 1, we used 5-fold cross-validation performance and computed
executed 10 runs of SMAC, S PEARMINT, TPE and random search                      the cross-validated root mean squared error (RMSE) and Spearman’s
(using the same Hyperopt implementation of random search as for                  rank correlation coefficient (CC) between model predictions and the
TPE), yielding the data detailed in Table 2. The four benchmarks                 true responses in the test fold. Here, the responses correspond to
comprised two low-dimensional and two high-dimensional hyperpa-                  validation error rate in all benchmarks except for the Branin one
rameter spaces.                                                                  (where they correspond to the value of the Branin function).
   The two low-dimensional benchmarks were the synthetic Branin                     Table 3 presents these results, showing that the GP and the SVR
test function and a logistic regression [29] on the MNIST dataset [23].          approaches performed best on the smooth low-dimensional synthetic
Both of these have been extensively used before to benchmark hy-                 Branin test function, but that RF-based models are better for pre-
perparameter optimization methods. While the 2-dimensional Branin                dicting the performance of actual machine learning algorithms. This
test function is trivial to evaluate and therefore does not require a            3 For a given dataset and fold, all models based on the same number of data
surrogate, we nevertheless included it to study how closely a surro-               points used the same subsampled data set. We note that model performance
gate can approximate the function. The logistic regression is an actual            sometimes was quite noisy with respect to the pseudorandom number seed
hyperparameter optimization benchmark with 4 hyperparameters that                  for this subsampling step and we thus used a fixed seed.
Table 3. Average RMSE and CC for a 5-fold cross validation for different                                      RF           GB           GP        NuSVR          kNN
regression models. For each entry, bold face indicates the best performance on
this dataset, and underlined values are not statistically significantly different               SMAC
from the best according to a paired t-test (with p = 0.05). Models marked
with an ∗ (+ ) are trained on only a subset of 2000 (1500) data points per fold.
                          Branin       Log.Reg. 5CV     HP - NNET convex    HP - DBNET mrbi
                                                                                                TPE
  Model             RMSE       CC      RMSE     CC      RMSE     CC         RMSE     CC

  RF                1.86       1.00     0.03    0.98    0.04     0.94       0.06     0.90
  Gr.Boost          7.10       0.96    0.07     0.94    0.05     0.89       0.06     0.86
  Ex.Trees          1.10       1.00    0.04     0.98    0.03     0.95       0.06     0.90
                                                                                                S PEARMINT
  GP +              0.03       1.0     0.13     0.88    0.04     0.92       0.10     0.78
  SVR ∗             0.06       1.0     0.13     0.87    0.06     0.82       0.08     0.80
  nuSVR ∗           0.02       1.0     0.18     0.84    0.06     0.85       0.08     0.82
  BNN ∗             6.84       0.91    0.10     0.91    0.05     0.84       0.10     0.72
  kNN               1.78       1.00    0.14     0.88    0.06     0.85       0.08     0.78       Random
  Lin.Reg.          45.33      0.28    0.23     0.78    0.08     0.60       0.1      0.70
  LeastAngleReg.    45.33      0.28    0.23     0.78    0.08     0.60       0.1      0.7
  Ridge Reg.        46.01      0.30    0.26     0.77    0.09     0.61       0.10     0.67
                                                                                              Figure 1. True performance (x-axis) vs. regression model predictions (y-
                                                                                              axis) for the HP - DBNET mrbi dataset. All plots have the same axes, showing
Table 4. Average RMSE and CC of 5 regression algorithms in the leave-one-
                                                                                              error rates ranging from 0.4 to 1.1. Each marker represents the performance
optimizer-out setting. Bold face indicates the best value across all regression
                                                                                              of one configuration; green and red crosses indicate 1/3 best and worst true
models on this dataset.
                     Branin           Log.Reg. 5CV     HP - NNET convex    HP - DBNET mrbi
                                                                                              performance, respectively. Configurations on the diagonal are predicted per-
  Model            RMSE      CC       RMSE     CC      RMSE     CC         RMSE    CC         fectly, error predictions above the diagonal are too high, and predictions for
  RF               2.04      0.95     0.10     0.93    0.04       0.82     0.07      0.85     configurations below the diagonal are better than the configuration’s actual
  Gr.Boost         6.96      0.85     0.12     0.84    0.05       0.81     0.07      0.83
                                                                                              performance. The first column shows which data was left out for training and
  GP               0.12      0.99     0.16     0.76    0.05       0.84     0.10      0.64
  nuSVR            0.03      0.98     0.16     0.74    0.10       0.61     0.10      0.73     used for testing.
  kNN              2.08      0.96     0.19     0.72    0.07       0.73     0.09      0.67
                                                                                              consequently, the results were slightly worse, but the best-performing
strong performance was to be expected for the higher-dimensional                              models stayed the same: nuSVR and GP for low dimensional and RFs
hyperparameter space of the neural networks, since RFs perform au-                            for higher dimensional datasets.
tomatic feature selection.4 The logistic regression example is rather                            Figure 1 studies the predictive performance in more detail for the
low-dimensional, but the categorical cross-validation fold is likely                          HP - DBNET mrbi benchmark, demonstrating that tree-based models
harder to model for GPs than for RFs.5 Extra Trees predicted the                              also performed best in a qualitative sense. The figure also shows
performance of the actual machine learning algorithms nearly as                               that the models tended to make the largest mistakes for the worst
good as the RF, Gradient boost was slightly worse. Bayesian neu-                              configurations; especially the non-tree-based models predicted some
ral networks, k-nearest-neighbours and our linear regression models                           of these to be better than some of the best configurations. The same
could not achieve comparable performance. Based on these results,                             patterns also held for the logistic regression 5CV and the HP - NNET
we decided to focus the remainder of our study on a diverse sub-                              convex data (not shown). The models also completely failed to identify
set of models: two tree-based approaches (RFs and gradient boost),                            neural network configurations which did not converge within the time
Gaussian processes, nuSVR, and, as an example of a popular, yet                               limit and therefore received an error rate of 1.0. Interestingly, the
poorly-performing model, k-nearest-neighbours. We paid special at-                            GP failed almost entirely on the high-dimensional HP - DBNET MRBI
tention to RFs and Gaussian processes, since these have been used                             benchmark in two cases, predicting all data points around the data
most prominently in Bayesian hyperparameter optimization methods.                             mean.


4.2.2       Leave one optimizer out                                                           4.3     Evaluation of Surrogate Benchmarks
In practice, we will want to use our surrogate models to predict the                                                                                        0
                                                                                              We now study the performance of the surrogate benchmarks XM
performance of a machine learning algorithm with hyperparameter                               obtained for random forest (RF) and Gaussian process (GP) models
configurations selected by some new optimization method. The config-                                                        0
                                                                                              M . We assess the quality of XM  by comparing the performance of
urations it evaluates might be quite different from those considered by                                                              0
                                                                                              various hyperparameter optimizers on XM   and the real benchmark
the optimizers whose data we trained on. Next to the standard cross-                          X.
validation setting from above, we therefore evaluated our models in
the leave-one-optimizer-out setting, which means that the regression
model learns from data drawn from all but one optimizer, and its                              4.3.1     Using all data
performance is measured on the held out data.
   Table 4 reports RMSE and CC analogous to those of Table 3, but for                         We first analyzed the performance of surrogate benchmarks based
the leave-one-optimizer-out setting. This setting is more difficult and,                      on models trained on the entire data we have available. We note
                                                                                              that in this first experiment, a surrogate that perfectly remembers the
4 We note that RFs could also handle the categorical hyperparameters in
                                                                                              training data would achieve perfect performance, because we used the
  these benchmarks natively. We used the one-hot encoding for comparability                   same hyperparameter optimizers for evaluation as we did to gather
  with other methods. It is remarkable that even with this encoding, they
  outperformed all other methods.                                                             the training data. However, after the first imperfect prediction, the
5 As we will see later (Figure 2), the RF-based optimizer SMAC also per-                      trajectories of the optimizers will diverge. Thus, since none of our
  formed better on this benchmark than the GP-based optimizer S PEARMINT.                     models is perfect on training data, this initial experiment serves as an
evaluation of surrogate benchmarks based on training data gathered          well, even when the training data did not include data gathered with
through the same mechanism as at test time.                                 optimizer o. In contrast, surrogates based on Gaussian process models
   We performed experiments for our three actual hyperparameter             performed poorly: the Gaussian process again underestimated the
optimization benchmarks, logistic regression, a simple and a deep           error, predicting better performance in some regions than possible on
neural network. For each of them, we repeated the 10 runs for TPE,          the real benchmark.6 Again, these regions with “better” performance
SMAC and S PEARMINT we previously conducted on the real bench-              appear hard to find: only SMAC and S PEARMINT found them, caus-
marks, but now used surrogate benchmarks based on RFs and GPs,              ing their performances on the GP-based surrogate benchmark to differ
respectively.                                                               substantially from their performance on the true benchmark.
   Figure 2 shows that the surrogate benchmarks based on Gaussian              Results for HP - NNET convex were also better for the surrogate
process models differed substantially from the true benchmarks. The         benchmark based on RFs (especially for S PEARMINT), but not as
figures show the best function values found by the various optimizers       much better as for the logistic regression 5CV case. As was already
over time. Visually comparing the first column (real benchmark)             the case when the surrogate was based on all training data, the RF-
to the third (surrogate benchmark based on GP model), the most              based surrogate benchmarks only approximately captured the strong
obvious difference is that the surrogate benchmark fails completely         performance TPE showed on the real benchmark.
on HP - DBNET MRBI: since the GP model is unable to properly fit               Results on HP - DBNET MRBI show a fairly close correspondence
the high-dimensional data (predicting all configurations to perform         between the real benchmark and the RF-based surrogate benchmarks.
roughly equally, around the data mean) all optimizers basically stay        In contrast, the GP-based surrogate was dismal, once again due to the
at the same performance level (the data mean). Note in the plot for         GP’s near-constant predictions (close to the data mean).
the true benchmark that the GP-based optimizer S PEARMINT also                 After this qualitative evaluation of the surrogate benchmarks, Ta-
performed very poorly on this benchmark.                                    ble 5 offers a quantitative evaluation. We judge the quality of a surro-
                                                                                                 0
   In the other two cases (logistic regression 5CV and HP - NNET con-       gate benchmark XM      by how closely it resembles the real benchmark
vex), the performance of the optimizers appears visually similar to the     X it was derived from, in terms of the absolute error between the best
true benchmark at first glance. However, for the logistic regression        found values for our four optimizers (SMAC, TPE, S PEARMINT,
5CV the GP model predicts some parts of the hyperparameter space            and random search) after evaluating i configurations. For logistic
to be better than the actual best part of the space, leading to the final   regression 5CV, in line with our qualitative results we obtained a very
optimization results on the surrogate benchmark to appear better than       small error for the RF-based surrogate. The GP-based surrogate un-
optimization results on the true benchmark. Another difference is a         derestimated the achievable error rates, resulting in larger differences
zig-zag pattern in the trajectories for logistic regression surrogates:     between performances on the true and the surrogate runs. After 50
these are also (mildly) present in the real benchmark (mild enough          evaluations the GP-based surrogate trained on all data yielded a quite
to only be detectable when zooming into the figure) and are due to          high error because it underestimated the performance for configura-
the slightly different performance in the 5 folds of cross validation;      tions selected by the optimizer SMAC. Training the GP-surrogate on
the impact of the folds is very small, but the GP model predicts it to      the leave-one-optimizer-out dataset causes worse performance for the
be large, causing the zig-zag. Interestingly, for the GP model trained      optimizer S PEARMINT and too much variation for SMAC resulting
on the HP - NNET convex dataset, regions with “better” performance          in a higher error as well. This misprediction decreases with more
appear hard to find: only SMAC and TPE identified them, causing a           evaluated configurations.
larger gap between S PEARMINT and SMAC/TPE than on the real                    The results for the HP - NNET convex look quite similar, with a some-
benchmark.                                                                  what smaller difference between RF-based and GP-based surrogates.
   Conversely, the RF surrogates yielded results much closer to those       Indeed, SMAC and TPE behaved similarly on both RF-based and
obtained on the real benchmark. Visually, the first column (true bench-     GP-based surrogates as on the real benchmark; only S PEARMINT be-
mark) and second column appear very similar, indicating that the RF         haved very differently on the GP-based surrogate, causing an overall
captured the overall pattern well. There are some differences in the        higher error than for the RF-based surrogates.
details. For example, on HP - NNET convex, the surrogate does not              On the high dimensional HP - NNET mrbi the surrogates performed
capture that TPE finds very good configurations before SMAC and             differently. Whereas the RF-based surrogate could still reproduce sim-
yields the overall best performance. Nevertheless, overall, our results     ilar optimizer behavior as on the real benchmark, the GP completely
for the RF surrogates qualitatively resemble those for the true bench-      failed to do so. Remarkably, overall quantitative performance was
marks, and for the logistic regression example, the correspondence is       similar for surrogate benchmarks trained on all data and those trained
almost perfect.                                                             on leave-one-optimizer-out datasets.
                                                                               Overall, these results confirmed our expectation from previous
                                                                            findings in Section 3.3 and the raw regression model performance
4.3.2   Leave one optimizer out                                             results in Table 3: good regression models facilitate good surrogate
                                                                            benchmarks. In our case, RFs performed best for both tasks. We note
Next, we studied the use of a surrogate benchmark to evaluate a new
                                                                            that using the surrogate benchmarks reduced the time requirements
optimizer. For each optimizer o and each of the three hyperparame-
                                                                            substantially; for example, evaluating a surrogate 100 times instead
ter optimization benchmarks X, we trained RF and GP models M
                                                                            of the HP - NNET convex or HP - DBNET MRBI took less than 1 minute
on the respective leave-one-optimizer-out training data discussed in
                                                                            on a single CPU, compared to roughly 10 hours on two CPUs (HP -
Section 4.2.2 and compared the performance of optimizer o on X                                                                                     7
       0                                                                    NNET convex) and over a day on a modern GPU ( HP - DBNET MRBI).
and XM   . Figure 3 reports the results of this experiment, showing that
                                                                            6 We noticed similar behavior for the nuSVR, which even returned negative
surrogate benchmarks based on RF models qualitatively resembled
the real benchmarks.                                                          values for configurations and caused the optimizer to search completely
                                                                              different areas of the configuration space (data not shown here).
   The results for the logistic regression 5CV benchmark (top row           7 Of course, the overhead due to the used hyperparameter optimizer comes on
of Figure 3) show that surrogate benchmarks based on RF models                top of this; e.g., S PEARMINT’s overhead for a run with 200 evaluations was
mirrored the performance of each optimizer o on the real benchmark            roughly one hour, whereas SMAC’s overhead was less than one minute.
                                                          Results on True Benchmark                             Results on RF Surrogate Benchmark                                                          Results on GP Surrogate Benchmark
                                                        1.0                                                                                          1.0                                                                                     1.0
                                                                                              SMAC_REAL                                                                                     SMAC_rf                                                                           SMAC_gp


                       Best validation error achieved


                                                                                                                    Best validation error achieved


                                                                                                                                                                                                            Best validation error achieved
                                                                                              SPEARMINT_REAL                                                                                SPEARMINT_rf                                                                      SPEARMINT_gp
                                                        0.8                                   TPE_REAL                                               0.8                                    TPE_rf                                           0.8                              TPE_gp
     Log.Reg. 5CV
                                                        0.6                                                                                          0.6                                                                                     0.6

                                                        0.4                                                                                          0.4                                                                                     0.4

                                                        0.2                                                                                          0.2                                                                                     0.2

                                                        0.0 0                                                                                        0.0 0                                                                                   0.0 0
                                                         10            101                  102                                                       10         101                  102                                                     10       101                  102
                                                                      #Function evaluations                                                                     #Function evaluations                                                                 #Function evaluations
                                                        0.50                                                                                         0.50                                                                                    0.50
                                                                                              SMAC_REAL                                                                                     SMAC_rf                                                                           SMAC_gp
                       Best validation error achieved


                                                                                                                    Best validation error achieved


                                                                                                                                                                                                            Best validation error achieved
                                                        0.45                                  SPEARMINT_REAL                                         0.45                                   SPEARMINT_rf                                     0.45                             SPEARMINT_gp
                                                                                              TPE_REAL                                                                                      TPE_rf                                                                            TPE_gp
                                                        0.40                                                                                         0.40                                                                                    0.40
    HP - NNET convex
                                                        0.35                                                                                         0.35                                                                                    0.35
                                                        0.30                                                                                         0.30                                                                                    0.30
                                                        0.25                                                                                         0.25                                                                                    0.25
                                                        0.20                                                                                         0.20                                                                                    0.20
                                                        0.15                                                                                         0.15                                                                                    0.15
                                                        0.10 0                                                                                       0.10 0                                                                                  0.10 0
                                                           10             101                       102                                                 10           101                       102                                              10        101                     102
                                                                      #Function evaluations                                                                      #Function evaluations                                                                #Function evaluations
                                                        0.90                                                                                         0.90                                                                                    0.90
                                                                                              SMAC_REAL                                                                                     SMAC_rf                                                                           SMAC_gp
                                                        0.85                                                                                         0.85                                                                                    0.85
                       Best validation error achieved


                                                                                                                    Best validation error achieved


                                                                                                                                                                                                            Best validation error achieved
                                                                                              SPEARMINT_REAL                                                                                SPEARMINT_rf                                                                      SPEARMINT_gp
                                                        0.80                                  TPE_REAL                                               0.80                                   TPE_rf                                           0.80                             TPE_gp
    HP - DBNET MRBI                                     0.75                                                                                         0.75                                                                                    0.75
                                                        0.70                                                                                         0.70                                                                                    0.70
                                                        0.65                                                                                         0.65                                                                                    0.65
                                                        0.60                                                                                         0.60                                                                                    0.60
                                                        0.55                                                                                         0.55                                                                                    0.55
                                                        0.50                                                                                         0.50                                                                                    0.50
                                                        0.45 0                                                                                       0.45 0                                                                                  0.45 0
                                                           10             101                       102                                                 10           101                       102                                              10        101                     102
                                                                      #Function evaluations                                                                      #Function evaluations                                                                #Function evaluations

Figure 2. Median and quartile of best performance over time on the real benchmark (left column) and on surrogates (middle: based on RF models; right: based
on GP models). Both types of surrogate benchmarks were trained on all available data. For logistic regression 5CV each fold is plotted as a separate function
evaluation.

                                                                                                                                                                benchmarks that behave similarly to the actual benchmarks they are
Table 5. Quantitative evaluation of surrogate benchmarks at three different
                                                                                                                                                                derived from, but are far cheaper and simpler to use. The key idea is
time steps each. We show the mean difference between the best found values
                                                                                                                                                                to collect (configuration, performance) pairs from the actual bench-
for corresponding runs (having the same seed) of the four optimizers (SMAC,
                                                                                                                                                                mark and to learn a regression model that can predict the performance
TPE, S PEARMINT, and random search) after i function evaluations on the real
                                                                                                                                                                of a new configuration and therefore stand in for the expensive-to-
and surrogate benchmark. For each experiment and optimizer we conducted 10
                                                                                                                                                                evaluate algorithm. These surrogates reduce the algorithm overhead
runs and report the mean error averaged over 4 × 10 = 40 comparisons. We
                                                                                                                                                                to a minimum, which allows extensive runs and analyses of new hy-
evaluated RF-based and GP-based surrogates. For each problem we measured
                                                                                                                                                                perparameter optimization techniques. We empirically demonstrated
the error for surrogates trained on all and the leave-one-optimizer-out (leave-
                                                                                                                                                                that we can obtain surrogate benchmarks that closely resemble the
ooo) data; e.g., the TPE trajectories are from optimizing on a surrogate that is
                                                                                                                                                                real benchmarks they were derived from.
trained on all training data except that gathered using TPE. Bold face indicates
                                                                                                                                                                   In future work, we intend to study the use of surrogates for general
the best performance for this dataset and i function evaluations. Results are
                                                                                                                                                                algorithm configuration. In particular, we plan to support optimization
underlined when the one-sigma confidence intervals of the best and this result
                                                                                                                                                                across a set of problem instances, each of which can be described
overlaps.
                                                                                                                                                                by a fixed-length vector of characteristics, and to assess the result-
       #Function evaluations                                          50                      200                   500                                         ing surrogates for several problems that algorithm configuration has
                  Surrogate                                      RF        GP         RF             GP        RF                                          GP   tackled successfully, such as propositional satisfiability [14], mixed
 Log.Reg. 5CV all                                                0.02 0.06             0.00 0.04               0.00 0.05                                        integer programming [15], and AI planning [9]. Finally, good surro-
 Log.Reg. 5CV leave-ooo                                          0.02 0.07             0.01 0.03               0.00 0.03                                        gate benchmarks should enable us to explore the configuration options
                                                                                                                                                                of the optimizers themselves, and we plan to use surrogate bench-
       #Function evaluations                                          50                      100                   200                                         marks to enable efficient meta-optimization of the hyperparameter
                  Surrogate                                      RF        GP         RF             GP        RF                                          GP   optimization and algorithm configuration methods themselves.
 HP - NNET convex all                                            0.01 0.03             0.01 0.03               0.01 0.02
 HP - NNET convex leave-ooo                                      0.02 0.03             0.02 0.04               0.02 0.03
                                                                                                                                                                REFERENCES
 HP - DBNET MRBI all                                             0.05 0.13             0.05 0.16               0.05 0.17
 HP - DBNET MRBI leave-ooo                                       0.04 0.13             0.04 0.16               0.05 0.17                                        [1] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag, ‘Collaborative hyper-
                                                                                                                                                                    parameter tuning’, in Proc. of ICML’13, (2013).
                                                                                                                                                                [2] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, ‘Algorithms for hyper-
                                                                                                                                                                    parameter optimization’, in Proc. of NIPS’11, (2011).
5     Conclusion and Future Work                                                                                                                                [3] J. Bergstra and Y. Bengio, ‘Random search for hyper-parameter opti-
                                                                                                                                                                    mization’, JMLR, 13, 281–305, (2012).
                                                                                                                                                                [4] J. Bergstra, B. Komer, C. Eliasmith, and D. Warde-Farley, ‘Preliminary
To tackle the high computational cost and overhead of performing                                                                                                    evaluation of hyperopt algorithms on HPOLib’, in ICML workshop on
hyperparameter optimization benchmarking, we proposed surrogate                                                                                                     AutoML, (2014).
                                                                 SMAC                                                                             S PEARMINT                                                                             TPE
                                                       1.0                                                                               1.0                                                                                 1.0
                                                                                            SMAC_REAL                                                                      SPEARMINT_REAL                                                                         TPE_REAL


                      Best validation error achieved


                                                                                                        Best validation error achieved


                                                                                                                                                                                            Best validation error achieved
                                                                                            SMAC_gp                                                                        SPEARMINT_gp                                                                           TPE_gp
                                                       0.8                                  SMAC_rf                                      0.8                               SPEARMINT_rf                                      0.8                                  TPE_rf


                                                       0.6                                                                               0.6                                                                                 0.6
   Log.Reg. 5CV                                        0.4                                                                               0.4                                                                                 0.4

                                                       0.2                                                                               0.2                                                                                 0.2

                                                       0.0 0                                                                             0.0 0                                                                               0.0 0
                                                        10       101                  102                                                 10        101                  102                                                  10       101                  102
                                                                #Function evaluations                                                              #Function evaluations                                                              #Function evaluations
                                                       0.50                                                                              0.50                                                                                0.50
                                                                                            SMAC_REAL                                                                      SPEARMINT_REAL                                                                         TPE_REAL
                      Best validation error achieved


                                                                                                        Best validation error achieved


                                                                                                                                                                                            Best validation error achieved
                                                       0.45                                 SMAC_gp                                      0.45                              SPEARMINT_gp                                      0.45                                 TPE_gp
                                                                                            SMAC_rf                                                                        SPEARMINT_rf                                                                           TPE_rf
                                                       0.40                                                                              0.40                                                                                0.40
                                                       0.35                                                                              0.35                                                                                0.35
                                                       0.30                                                                              0.30                                                                                0.30
  HP - NNET convex
                                                       0.25                                                                              0.25                                                                                0.25
                                                       0.20                                                                              0.20                                                                                0.20
                                                       0.15                                                                              0.15                                                                                0.15
                                                       0.10 0                                                                            0.10 0                                                                              0.10 0
                                                          10        101                      102                                            10         101                       102                                            10        101                     102
                                                                #Function evaluations                                                              #Function evaluations                                                              #Function evaluations
                                                       0.90                                                                              0.90                                                                                0.90
                                                                                            SMAC_REAL                                                                      SPEARMINT_REAL                                                                         TPE_REAL
                                                       0.85                                                                              0.85                                                                                0.85
                      Best validation error achieved


                                                                                                        Best validation error achieved


                                                                                                                                                                                            Best validation error achieved
                                                                                            SMAC_gp                                                                        SPEARMINT_gp                                                                           TPE_gp
                                                       0.80                                 SMAC_rf                                      0.80                              SPEARMINT_rf                                      0.80                                 TPE_rf

                                                       0.75                                                                              0.75                                                                                0.75
                                                       0.70                                                                              0.70                                                                                0.70
  HP - DBNET MRBI                                      0.65                                                                              0.65                                                                                0.65
                                                       0.60                                                                              0.60                                                                                0.60
                                                       0.55                                                                              0.55                                                                                0.55
                                                       0.50                                                                              0.50                                                                                0.50
                                                       0.45 0                                                                            0.45 0                                                                              0.45 0
                                                          10        101                      102                                            10         101                       102                                            10        101                     102
                                                                #Function evaluations                                                              #Function evaluations                                                              #Function evaluations

Figure 3. Median and quartile of optimization trajectories for surrogates trained in the leave-one-optimizer-out setting. Black trajectories correspond to true
benchmarks, coloured trajectories to optimization runs on a surrogate. The first row names the optimizer used to obtain the trajectories; their data was left out for
training the regression models.

 [5] J. Bergstra, D. Yamins, and D. D. Cox, ‘Making a science of model                                                                               [18] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, ‘Algorithm runtime
      search: Hyperparameter optimization in hundreds of dimensions for                                                                                   prediction: Methods and evaluation’, JAIR, 206(0), 79 – 111, (2014).
     vision architectures’, in Proc. of ICML’13, (2013).                                                                                             [19] B. Komer, J. Bergstra, and C. Eliasmith, ‘Hyperopt-sklearn: Automatic
 [6] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta, Metalearning:                                                                              hyperparameter configuration for scikit-learn’, in ICML workshop on
     Applications to Data Mining, Springer, 2008.                                                                                                         AutoML, (2014).
 [7] E. Brochu, V. M. Cora, and N. de Freitas, ‘A tutorial on Bayesian opti-                                                                         [20] A. Krizhevsky, ‘Learning multiple layers of features from tiny images’,
      mization of expensive cost functions, with application to active user mod-                                                                          Technical report, University of Toronto, (2009).
      eling and hierarchical reinforcement learning’, CoRR, abs/1012.2599,                                                                           [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘Imagenet classification
     (2010).                                                                                                                                              with deep convolutional neural networks’, in Proc. of NIPS’12, pp. 1097–
 [8] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. H.                                                                                  1105, (2012).
      Hoos, and K. Leyton-Brown, ‘Towards an empirical foundation for                                                                                [22] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, ‘An
      assessing bayesian optimization of hyperparameters’, in NIPS workshop                                                                               empirical evaluation of deep architectures on problems with many factors
      on Bayesian Optimization, (2013).                                                                                                                   of variation’, in Proc. of ICML’07, (2007).
 [9] C. Fawcett, M. Helmert, H. H. Hoos, E. Karpas, G. Röger, and J. Seipp,                                                                         [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘Gradient-based learning
     ‘FD-Autotune: Domain-specific configuration using fast-downward’, in                                                                                 applied to document recognition’, Proc. of the IEEE, 86(11), 2278–2324,
     Proc. of ICAPS-PAL, (2011).                                                                                                                          (1998).
[10] D. Gorissen, I. Couckuyt, P. Demeester, T. Dhaene, and K. Crombecq, ‘A                                                                          [24] R. M. Neal, Bayesian learning for neural networks, Ph.D. dissertation,
      surrogate modeling and adaptive sampling toolbox for computer based                                                                                 University of Toronto, 1995.
      design’, JMLR, 11, 2051–2055, (2010).                                                                                                          [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
[11] S. B. Guerra, R. B. C. Prudêncio, and T. B. Ludermir, ‘Predicting the                                                                               O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
      performance of learning algorithms using support vector machines as                                                                                 plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-
      meta-regressors’, in Proc. of ICANN’08, volume 5163, pp. 523–532,                                                                                   nay, ‘Scikit-learn: Machine learning in Python’, JMLR, 12, 2825–2830,
     (2008).                                                                                                                                              (2011).
[12] N. Hansen, A. Auger, S. Finck, R. Ros, et al. Real-parameter black-box                                                                          [26] M. Reif, F. Shafait, M. Goldstein, T. Breuel, and A. Dengel, ‘Automatic
      optimization benchmarking 2010: Experimental setup, 2010.                                                                                           classifier selection for non-experts’, PAA, 17(1), 83–96, (2014).
[13] M. D. Hoffman, D. M. Blei, and F. R. Bach, ‘Online learning for latent                                                                          [27] J. Sacks, W. J. Welch, T. J. Welch, and H. P. Wynn, ‘Design and analysis
      dirichlet allocation.’, in Proc. of NIPS’10, (2010).                                                                                                of computer experiments’, Statistical Science, 4(4), 409–423, (November
[14] F. Hutter, D. Babić, H.H. Hoos, and A.J. Hu, ‘Boosting Verification by                                                                              1989).
     Automatic Tuning of Decision Procedures’, in Proc. of FMCAD’07, pp.                                                                             [28] T. J. Santner, B. J. Williams, and W. I. Notz, The design and analysis of
      27–34, Washington, DC, USA, (2007). IEEE Computer Society.                                                                                          computer experiments, Springer, 2003.
[15] F. Hutter, H. H. Hoos, and K. Leyton-Brown, ‘Automated configuration                                                                            [29] J. Snoek, H. Larochelle, and R.P. Adams, ‘Practical Bayesian optimiza-
      of mixed integer programming solvers’, in Proc. of CPAIOR-10, pp.                                                                                   tion of machine learning algorithms’, in Proc. of NIPS’12, (2012).
     186–202, (2010).                                                                                                                                [30] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, ‘Auto-WEKA:
[16] F. Hutter, H. H. Hoos, and K. Leyton-Brown, ‘Sequential model-based                                                                                  Combined selection and hyperparameter optimization of classification
      optimization for general algorithm configuration’, in Proc. of LION-5,                                                                              algorithms’, in Proc. of KDD’13, (2013).
     (2011).                                                                                                                                         [31] C. N. J. Yu and T. Joachims, ‘Learning structural svms with latent
[17] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stützle, ‘ParamILS: an                                                                               variables’, in Proc. of ICML’09, pp. 1169–1176, (2009).
      automatic algorithm configuration framework’, JAIR, 36(1), 267–306,
     (2009).