<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Data Set Similarities for Hyperparameter Optimization Initializations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Wistuba</string-name>
          <email>wistuba@ismll.uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Schilling</string-name>
          <email>schilling@ismll.uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars Schmidt-Thieme</string-name>
          <email>schmidt-thieme@ismll.uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Systems and Machine Learning Lab Universittsplatz 1</institution>
          ,
          <addr-line>31141 Hildesheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Current research has introduced new automatic hyperparameter optimization strategies that are able to accelerate this optimization process and outperform manual and grid or random search in terms of time and prediction accuracy. Currently, meta-learning methods that transfer knowledge from previous experiments to a new experiment arouse particular interest among researchers because it allows to improve the hyperparameter optimization. In this work we further improve the initialization techniques for sequential model-based optimization, the current state of the art hyperparameter optimization framework. Instead of using a static similarity prediction between data sets, we use the few evaluations on the new data sets to create new features. These features allow a better prediction of the data set similarity. Furthermore, we propose a technique that is inspired by active learning. In contrast to the current state of the art, it does not greedily choose the best hyperparameter conguration but considers that a time budget is available. Therefore, the rst evaluations on the new data set are used for learning a better prediction function for predicting the similarity between data sets such that we are able to prot from this in future evaluations. We empirically compare the distance function by applying it in the scenario of the initialization of SMBO by meta-learning. Our two proposed approaches are compared against three competitor methods on one meta-data set with respect to the average rank between these methods and show that they are able to outperform them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Most machine learning algorithms depend on hyperparameters and their
optimization is an important part of machine learning techniques applied in practice.
Automatic hyperparameter tuning is catching more and more attention by the
machine learning community for two simple but important reasons. Firstly, the
omnipresence of hyperparameters aects the whole community such that
everyone is aected by the time-consuming task of optimizing hyperparameters either
by manually tuning them or by applying a grid search. Secondly, in many cases
the nal hyperparameter congurations decides whether an algorithm is state of
the art or just moderate such that the task of hyperparameter optimization is as
important as developing new models [
        <xref ref-type="bibr" rid="ref13 ref17 ref2 ref20 ref4">2,4,13,17,20</xref>
        ]. Furthermore,
hyperparameter optimization has shown to be able to also automatically perform algorithm
and preprocessing selection by considering this as a further hyperparameter [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Sequential model-based optimization (SMBO) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is the current state of the
art for hyperparameter optimization and has proven to outperform alternatives
such as grid search or random search [
        <xref ref-type="bibr" rid="ref17 ref2 ref20">17,20,2</xref>
        ]. Recent research try to improve
the SMBO framework by applying meta-learning on the hyperparameter
optimization problem. The key concept of meta-learning is to transfer knowledge
gained for an algorithm on past experiments on dierent data sets to new
experiments. Currently, two dierent, orthogonal ways of transferring this knowledge
exist. One possibility is to initialize SMBO by using hyperparameter
congurations that have been best on previous experiments [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ]. Another possibility is
to use surrogate models that are able to learn across data sets [
        <xref ref-type="bibr" rid="ref1 ref19 ref23">1,19,23</xref>
        ].
      </p>
      <p>We improve the former strategy by using an adaptive initialization strategy.
We predict the similarity between data sets by using meta-features and features
that express the knowledge gathered about the new data set so far. Having a
more accurate approximation of the similarity between data sets, we are able
to provide a better initialization. We provide empirical evidence that the new
features provide better initializations in two dierent experiments.</p>
      <p>Furthermore, we propose an initialization strategy that is based on the active
learning idea. We try to evaluate hyperparameter congurations that are
nonoptimal for the short term but promise better results than choosing greedily the
hyperparameter conguration that will provide the best result in expectation.
To the best of our knowledge, we are the rst that propose this idea in context
of hyperparameter optimization for SMBO.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Initializing hyperparameter optimization through meta-learning was proven to
be eective [
        <xref ref-type="bibr" rid="ref15 ref5 ref6 ref7">5,7,15,6</xref>
        ]. Reif et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] suggests to choose those hyperparameter
congurations for a new data set that were best on a similar data set in the
context of evolutionary parameter optimization. Here, the similarity was
dened through the distance among meta-features, descriptive data set features.
Recently, Feurer et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] followed their lead and proposed the same
initialization for sequential model-based optimization (SMBO), the current state of the
art hyperparameter optimization framework. Later, they extended their work
by learning a regression model on the meta-features that predicts the similarity
between data sets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Learning a surrogate model, the component in the SMBO framework that
tries to predict the performance for a specic hyperparameter conguration on
a data set, that is not only learned on knowledge of the new data set but
additionally across knowledge from experiments on other data sets [
        <xref ref-type="bibr" rid="ref1 ref16 ref19 ref23">1,19,23,16</xref>
        ] is
another option to transfer knowledge as well as pruning [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. This idea is related
but orthogonal to our work and can benet from a good initialization and is no
replacement for a good initialization strategy.
      </p>
      <p>
        We propose to add features based on the performance of an algorithm on a
data set for a specic hyperparameter conguration. Pfahringer et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
propose to use landmark features. These features are estimated by evaluating simple
algorithms on the data sets of the past and new experiment. In comparison to
our features, these features are no by-product of the optimization process but
have to be computed and hence need additional time. Even though these are
simple algorithms, these features are problem-dependent (classication, regression,
ranking, structured prediction all need their own landmark features).
Furthermore, simple classiers such as nearest neighbors as proposed by the authors
can become very time-consuming for large data sets which are those data sets
we are interested in.
      </p>
      <p>
        Relative landmarks proposed by Leite et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] are not and cannot be used
as meta-features. They are used within the hyperparameter optimization
strategy that is used instead of SMBO but are similar in that way that they are
also given as a by-product and are computed using the relationship between
hyperparameter congurations on each data set.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>In this section the hyperparameter optimization problem is formally dened. For
the sake of completeness, also the sequential model-based optimization
framework is presented.
3.1</p>
      <p>Hyperparameter Optimization Problem Setup
A machine learning algorithm A is a mapping A : D ! M where D is the
set of all data sets, M is the space of all models and 2 is the chosen
hyperparameter conguration with = 1 : : : P being the P-dimensional
hyperparameter space. The learning algorithm estimates a model M 2 M that
minimizes the objective function that linearly combines the loss function L and
the regularization term R:</p>
      <p>A</p>
      <p>D(train) := arg min L</p>
      <p>M 2M</p>
      <sec id="sec-3-1">
        <title>M ; D(train)</title>
        <p>+ R (M ) :
Then, the task of hyperparameter optimization is nding the hyperparameter
conguration that minimizes the loss function on the validation data set i.e.
:= arg min L A
2</p>
      </sec>
      <sec id="sec-3-2">
        <title>D(train) ; D(valid)</title>
        <p>=: arg min fD ( ) :
2
3.2</p>
        <p>
          Sequential Model-based Optimization
Exhaustive hyperparameter search methods such as grid search are becoming
more and more expensive. Data sets are growing, models are getting more
complex and have high-dimensional hyperparameter spaces Sequential model-based
(1)
(2)
optimization (SMBO) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a black-box optimization framework that replaces
the time-consuming function f to evaluate with a cheap-to-evaluate surrogate
function that approximates f . With the help of an acquisition function such
as expected improvement [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], sequentially new points are chosen such that a
balance between exploitation and exploration is met and f is optimized. In our
scenario, evaluating f is equivalent to learning a machine learning algorithm on
some training data for a given hyperparameter conguration and estimate the
models performance on a hold-out data set.
        </p>
        <p>
          Algorithm 1 outlines the SMBO framework. It starts with an observation
history H that equals the empty set in cases where no knowledge from past
experiments is used [
          <xref ref-type="bibr" rid="ref17 ref2 ref8">2,8,17</xref>
          ] or is non-empty in cases where past experiments
are used [
          <xref ref-type="bibr" rid="ref1 ref19 ref23">1,19,23</xref>
          ] or SMBO has been initialized [
          <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
          ]. First, the surrogate model
is tted to H where can be any regression model. Since the acquisition
function a usually needs some certainty about the prediction, common choices
are Gaussian processes [
          <xref ref-type="bibr" rid="ref1 ref17 ref19 ref23">1,17,19,23</xref>
          ] or ensembles such as random forests [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The
acquisition function chooses the next candidate to evaluate. A common choice
for the acquisition function is expected improvement [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] but further acquisition
functions exist such as probability of improvement [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the conditional entropy
of the minimizer [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] or a multi-armed bandit based criterion [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The
evaluated candidate is nally added to the set of observations. After T -many SMBO
iterations, the best currently found hyperparameter conguration is returned.
        </p>
        <sec id="sec-3-2-1">
          <title>Algorithm 1 Sequential Model-based Optimization</title>
          <p>Input: Hyperparameter space , observation history H, number of iterations T ,
acquisition function a, surrogate model .</p>
          <p>Output: Best hyperparameter conguration found.
1: for t = 1 to T do
2: Fit to H
3: arg max 2 a ( ; )
4: Evaluate f ( )
5: H H [ f( ; f ( ))g
6: return arg min( ;f( ))2H f ( )
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Adaptive Initialization</title>
      <p>
        Recent initialization techniques for sequential model-based optimization
compute a static initialization hyperparameter conguration sequence [
        <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
        ]. The
advantage of this idea is that during the initialization there is no time overhead
for computing the next hyperparameter conguration. The disadvantage is that
knowledge gained during the initialization about the new data set is not used for
further initialization queries. Hence, we propose to use an adaptive initialization
technique. Firstly, we propose to add some additional meta-features generated
from this knowledge, which follows the idea of landmark features [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
relative landmarks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], respectively. Secondly, we apply the idea of active learning
and try to choose the hyperparameter congurations that will allow to learn a
precise ranking of data sets with respect to their similarity to the new data set.
4.1
      </p>
      <p>
        Adaptive Initialization Using Additional Meta-Features
Feurer et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] propose to improve the SMBO framework by an initialization
strategy as shown in Algorithm 2. The idea is to use the hyperparameter
congurations that have been best on other data sets. Those hyperparameter
congurations are ranked with respect to the predicted similarity to the new data set
Dnew for which the best hyperparameter conguration needs to be found. The
true distance function d : D D ! R between data sets is unknown such that
Feurer et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] propose to approximate it by d^(mi; mj ) = kmi mj kp where
mi is the vector of meta-features of data set Di. In their extended work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], they
propose to use a random forest to learn d^ using training instances of the form
(mi; mj )T ; d (Di; Dj ) . This initialization does not consider the performance
of the hyperparameters congurations already evaluated on the new data set.
      </p>
      <p>
        We propose to keep Feurer’s initialization strategy [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] untouched and only
add additional meta-features that capture the information gained on the new
data set. Meta-features as dened in Equation 3 are added to the set of
metafeatures for all hyperparameter congurations k; l that are evaluated on the
new data set. The symbol denotes an exclusive or. An additional dierence is
that now after each step the meta-features and the model d^ needs to be updated
(before Line 2).
      </p>
      <p>mDi;Dj; k; l = I fDi ( k) &gt; fDi ( l)
fDj ( k) &gt; fDj ( l)
(3)
We make here the assumption that the same set of hyperparameter
congurations were evaluated across all training data sets. If this is not the case, this
problem can be overcome by approximating the respective value by learning
surrogate models for the training data sets as well. Since for these data sets much
information is available, the prediction will be reliable enough. For simplicity,
we assume that the former is the case.</p>
      <p>
        The target d can be any similarity measure that reects the true similarity
between data sets. Feurer et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] propose to use the Spearman correlation
coecient while we are using the number of discordant pairs in our experiments,
i.e.
      </p>
      <p>d (Di; Dj ) :=</p>
      <p>P
k; l2</p>
      <p>I fDi ( k) &gt; fDi ( l)
fDj ( k) &gt; fDj ( l)</p>
      <p>(4)
j j (j j
1)
where is the set of hyperparameter congurations observed on the training
data sets. This change will have no inuence on the prediction quality for the
traditional meta-features but is better suited for the proposed landmark features
in Equation 3.</p>
      <p>Algorithm 2 Sequential Model-based Optimization with Initialization
Input: Hyperparameter space , observation history H, number of iterations T ,
acquisition function a, surrogate model , set of data sets D, number of initial
hyperparameter congurations I, prediction function for distances between data sets
^
d.</p>
      <p>Output: Best hyperparameter conguration found.
1: for i = 1 to I do
2: Predict the distances d^(Dj; Dnew) for all Dj 2 D.
3: Select best hyperparameter conguration on the i-th closest data set.
4: Evaluate f ( )
5: H H [ f( ; f ( ))g
6: return SMBO( ; H; T I; a; )
4.2</p>
      <p>Adaptive Initialization Using Active Learning
We propose to extend the method from the last section by investing few
initialization steps by carefully selecting hyperparameter congurations that will lead
to good additional meta-features and provide a better prediction function d^. An
additional meta-feature is useful if the resulting regression model d^ predicts the
distances of the training data sets to the new data set such that the ordering
with respect to the predicted distances reects the ordering with respect to the
true distances. If I is the number of initialization steps and K &lt; I is the number
of steps to choose additional meta-features, then the K hyperparameter
congurations need to be chosen such that the precision at I K with respect to the
ordering is optimized. The precision at n is dened as
prec@n := jfn closest data sets to Dnew wrt. dg\fn closest data sets to Dnew wrt. d^gj (5)
n
Algorithm 3 presents the method we used to nd the best rst K
hyperparameter congurations. In a leave-one-out cross-validation over all training data sets
D the pair of hyperparameter congurations ( j ; k) is sought that achieves the
best precision at I K on average (Lines 1 to 7). Since testing all dierent
combinations of K dierent hyperparameter congurations is too expensive, only the
best pair is searched. The remaining K 2 hyperparameter congurations are
greedily added to the nal set of initial hyperparameter congurations active as
described in Lines 8 to 15. The hyperparameter conguration is added to active
that performs best on average with all hyperparameter congurations chosen so
far. After choosing K hyperparameter congurations as described in Algorithm
3, the remaining I K hyperparameter congurations are chosen as described in
Section 4.1. The time needed for computing the rst K hyperparameter highly
depends on the size of . To speed up the process, we reduced to those
hyperparameter congurations that have been best on at least one training data set.
14:
15: Add
16: return</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Evaluation</title>
      <sec id="sec-5-1">
        <title>Initialization Strategies</title>
        <p>
          Following initialization strategies will be considered in our experiments.
Random Best Initialization (RBI) This initialization is a very simple
initialization. I training data sets from the training data sets D are chosen at random
and its best hyperparameter congurations are used for the initialization.
Nearest Best Initialization (NBI) This is the initialization strategy proposed by
Reif et al. and Feurer et al. [
          <xref ref-type="bibr" rid="ref15 ref5">5,15</xref>
          ]. Instead of choosing I training data sets at
random, they are chosen with respect to the similarity between the meta-features
listed in Table 1. Then, like for RBI, the best hyperparameter congurations on
these data sets are chosen for initialization.
        </p>
        <p>
          Predictive Best Initialization (PBI) Feurer et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] propose to learn the distance
between data sets using a regression model based on the meta-features. These
predicted distances are used to nd the most similar data sets to the new data
set and like before, the best hyperparameter congurations on these data sets
are chosen for initialization.
        </p>
        <p>Adaptive Predictive Best Initialization (aPBI) This is our extension to PBI
presented in Section 4.1 that adapts to the new data set during initialization by
including the features dened in Equation 3.</p>
        <p>
          Active Adaptive Predictive Best Initialization (aaPBI) Active Adaptive
Predictive Best Initialization is described in Section 4.2 and extends aPBI by using
the rst K steps to choose hyperparameter congurations that will result in
promising meta-features. After the rst K iterations, it behaves equivalent to
aPBI.
Meta-features are supposed to be discriminative for a data set and can be
estimated without evaluating f . Many surrogate models [
          <xref ref-type="bibr" rid="ref1 ref19 ref23">1,19,23</xref>
          ] and initialization
strategies [
          <xref ref-type="bibr" rid="ref15 ref5">5,15</xref>
          ] use them to predict the similarity between data sets. For the
experiments, the meta-features listed in Table 1 are used. For an in-depth
explanation we refer the reader to [
          <xref ref-type="bibr" rid="ref1 ref11">1,11</xref>
          ].
5.3
        </p>
        <p>
          Meta-Data Set
For creating the two meta-data sets, 50 classication data sets from the UCI
repository are chosen at random. All instances are merged in cases there were
already train/test splits, shued and split into 80% train and 20% test. A
support vector machine (SVM) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] was used to create the meta-data set. The
SVM was trained using three dierent kernels (linear, polynomial and
Gaussian) and the labels of the meta-instances were estimated by evaluating the
trained model on the test split. The total hyperparameter dimension is six,
three dimensions for indicator variables that indicates which kernel was
chosen, one for the trade-o parameter C, one for the width of the Gaussian
kernel and one for the degree of the polynomial kernel d. If the
hyperparameter is not involved, e.g. the degree if the Gaussian kernel was used, it is set
to 0. The test accuracy is precomputed on a grid C 2 2 5; : : : ; 26 , 2
10 4; 10 3; 10 2; 0:05; 0:1; 0:5; 1; 2; 5; 10; 20; 50; 102; 103 and d 2 f2; : : : ; 10g
resulting into 288 meta-instances per data set. The meta-data sets is extended
by the meta-features listed in Table 1.
Two dierent experiments are conducted. First, state of the art initialization
strategies are compared with respect to the average rank after I initial
hyperparameter congurations. Second, the long term eect on the hyperparameter
optimization is compared. Even though the initial hyperparameter conguration
lead to good results after I results, the ultimate aim is to have good results at
the end of the hyperparameter optimization after T iterations.
        </p>
        <p>We evaluated all methods in a leave-one-out cross-validation per data set.
All data sets but one are used for training and the data set not used for training
is the new data set. The results reported are the average over 100 repetitions.
Due to the randomness, we used 1,000 repetitions whenever RBI was used.</p>
        <p>
          Comparison to Other Initialization Strategies For all our experiments
10 initialization steps are used, I = 10. In Figure 1 the dierent initialization
strategies are compared with respect to the average rank in the left plot. Our
initialization strategy aPBI benets from the new, adaptive features and is able
to outperform the other initialization strategies. Our second strategy aaPBI uses
three active learning steps, K = 3. This explains the behavior in the beginning.
After these initial steps it is able to catch up and nally surpass PBI but it
is not as good as aPBI. Furthermore, a distance function learned on the
metafeatures (PBI) provides better results than a xed distance function (NBI) which
conrms the results by Feurer et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. All initialization strategies are able to
outperform RBI which conrms that the meta-features contain information that
concludes information about the similarity between data sets. The right plot
shows the average misclassication (MCR) rate where the MCR is scaled to 0
and 1 for each data set.
        </p>
        <p>
          Comparison with Respect to the Long Term Eect To have a look at
the long term eect of the initialization strategies, we compare the dierent
initialization strategies in the SMBO framework using two common surrogate
models. One is a Gaussian process with a squared exponential kernel with
automatic relevance determination [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The kernel parameters are estimated by
maximizing the marginal likelihood on the meta-training set [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The second
is a random forest [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Again, aPBI provides strictly better results than PBI
for both surrogate models and outperforms any other initialization strategy for
the Gaussian process. Our alternative strategy aaPBI performs mediocre for the
Gaussian process but good for the random forest.
        </p>
        <p>Gaussian Process
4.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Predicting the similarity between data sets is an important topic for
hyperparameter optimization since it allows to successfully transfer knowledge from past
experiments to a new experiment. We have presented an easy way of achieving
an adaptive initialization strategy by adding a new kind of landmark features.
We have shown for two popular surrogate models that these new features
improve over the same strategy without these features. Finally, we introduced a
new idea that in contrast to the current methods considers that there is a limit
of evaluations. It tries to exploit this knowledge by applying a guided
exploitation at rst that will lead to worse decisions for the short term but will deliver
better results when the end of the initialization is reached. Unfortunately, the
results for this method are not fully convincing but we believe that it can be a
good idea to choose hyperparameter congurations in a smarter way but always
assuming that the next hyperparameter conguration chosen is the last one.</p>
        <p>
          In this work we were able to provide a more accurate prediction of the
similarity between data sets and used this knowledge to improve the initialization.
Since not only initialization strategies but also surrogate models rely on an exact
similarity prediction, we plan to investigate the impact on these models. For
example Yogatama and Mann [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] use a kernel that measures the distance between
data sets by using the p-norm between meta-features of data sets only. A better
distance function may help to improve the prediction and will help to improve
the SMBO beyond the initialization.
        </p>
        <p>Acknowledgments. The authors gratefully acknowledge the co-funding of their
work by the German Research Foundation (Deutsche Forschungsgesellschaft)
under grant SCHM 2583/6-1.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bardenet</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brendel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>KØgl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebag</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Collaborative hyperparameter tuning</article-title>
          . In: Dasgupta,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Mcallester</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the 30th International Conference on Machine Learning (ICML-13)</source>
          . vol.
          <volume>28</volume>
          , pp.
          <fpage>199207</fpage>
          . JMLR Workshop and Conference Proceedings (May
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bergstra</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bardenet</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>KØgl</surname>
          </string-name>
          , B.:
          <article-title>Algorithms for hyper-parameter optimization</article-title>
          . In:
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K</given-names>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>24</volume>
          , pp.
          <fpage>25462554</fpage>
          . Curran Associates, Inc. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <issue>3</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          ,
          <issue>27</issue>
          :
          <fpage>127</fpage>
          :
          <fpage>27</fpage>
          (
          <year>2011</year>
          ), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Coates</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An analysis of single-layer networks in unsupervised feature learning</article-title>
          . In: Gordon,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Dunson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Dudk</surname>
          </string-name>
          , M. (eds.)
          <source>Proceedings of the Fourteenth International Conference on Articial Intelligence and Statistics. JMLR Workshop and Conference Proceedings</source>
          , vol.
          <volume>15</volume>
          , pp.
          <fpage>215223</fpage>
          . JMLR
          <string-name>
            <given-names>W</given-names>
            &amp;
            <surname>CP</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Feurer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Using meta-learning to initialize bayesian optimization of hyperparameters</article-title>
          .
          <source>In: ECAI workshop on Metalearning and Algorithm Selection (MetaSel)</source>
          . pp.
          <volume>310</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Feurer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Initializing bayesian hyperparameter optimization via meta-learning</article-title>
          .
          <source>In: Proceedings of the Twenty-Ninth AAAI Conference on Articial Intelligence, January 25-30</source>
          ,
          <year>2015</year>
          , Austin, Texas, USA. pp.
          <volume>11281135</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gomes</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>PrudŒncio</given-names>
            , R.B.,
            <surname>Soares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.L.</given-names>
            ,
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Combining meta-learning and search techniques to select parameters for support vector machines</article-title>
          .
          <source>Neurocomputing</source>
          <volume>75</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          <lpage>13</lpage>
          (
          <year>2012</year>
          ),
          <source>brazilian Symposium on Neural Networks (SBRN 2010) International Conference on Hybrid Articial Intelligence Systems (HAIS</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoos</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leyton-Brown</surname>
          </string-name>
          , K.:
          <article-title>Sequential model-based optimization for general algorithm conguration</article-title>
          .
          <source>In: Proceedings of the 5th International Conference on Learning and Intelligent Optimization</source>
          . pp.
          <fpage>507523</fpage>
          . LION'05,
          <string-name>
            <surname>SpringerVerlag</surname>
          </string-name>
          , Berlin, Heidelberg (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schonlau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welch</surname>
          </string-name>
          , W.J.:
          <article-title>Ecient global optimization of expensive black-box functions</article-title>
          .
          <source>J. of Global Optimization</source>
          <volume>13</volume>
          (
          <issue>4</issue>
          ),
          <volume>455492</volume>
          (Dec
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Leite</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanschoren</surname>
          </string-name>
          , J.:
          <article-title>Selecting classication algorithms with active testing</article-title>
          . In: Perner,
          <string-name>
            <surname>P</surname>
          </string-name>
          . (ed.)
          <source>Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science</source>
          , vol.
          <volume>7376</volume>
          , pp.
          <fpage>117131</fpage>
          . Springer Berlin Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Michie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spiegelhalter</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , C.C.,
          <string-name>
            <surname>Campbell</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          . (eds.):
          <source>Machine Learning, Neural and Statistical Classication. Ellis Horwood</source>
          , Upper Saddle River, NJ, USA (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bensusan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giraud-Carrier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Meta-learning by landmarking various learning algorithms</article-title>
          .
          <source>In: In Proceedings of the Seventeenth International Conference on Machine Learning</source>
          . pp.
          <fpage>743750</fpage>
          . Morgan Kaufmann (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doukhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , DiCarlo,
          <string-name>
            <given-names>J.J.</given-names>
            ,
            <surname>Cox</surname>
          </string-name>
          , D.D.
          <article-title>: A high-throughput screening approach to discovering good forms of biologically inspired visual representation</article-title>
          .
          <source>PLoS Computational Biology</source>
          <volume>5</volume>
          (
          <issue>11</issue>
          ),
          <year>e1000579</year>
          (
          <year>2009</year>
          ),
          <source>PMID: 19956750</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rasmussen</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.I.</given-names>
          </string-name>
          :
          <article-title>Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)</article-title>
          . The MIT Press (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Reif</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shafait</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Meta-learning for evolutionary parameter optimization of classiers</article-title>
          .
          <source>Machine Learning</source>
          <volume>87</volume>
          (
          <issue>3</issue>
          ),
          <volume>357380</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Schilling</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wistuba</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drumond</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt-Thieme</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Hyperparameter Optimization with Factorized Multilayer Perceptrons</article-title>
          .
          <source>In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD</source>
          <year>2015</year>
          , Porto, Portugal, September 7-
          <issue>11</issue>
          ,
          <year>2015</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>R.P.</given-names>
          </string-name>
          :
          <article-title>Practical bayesian optimization of machine learning algorithms</article-title>
          . In: Pereira,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>25</volume>
          , pp.
          <fpage>29512959</fpage>
          . Curran Associates, Inc. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Srinivas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seeger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kakade</surname>
            ,
            <given-names>S.M.:</given-names>
          </string-name>
          <article-title>Gaussian process optimization in the bandit setting: No regret and experimental design</article-title>
          . In: Frnkranz,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Joachims</surname>
          </string-name>
          , T. (eds.)
          <source>Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          . pp.
          <fpage>10151022</fpage>
          .
          <string-name>
            <surname>Omnipress</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Swersky</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>R.P.:</given-names>
          </string-name>
          <article-title>Multi-task bayesian optimization</article-title>
          . In: Burges,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ghahramani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>26</volume>
          , pp.
          <fpage>20042012</fpage>
          . Curran Associates, Inc. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Thornton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoos</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leyton-Brown</surname>
          </string-name>
          , K.:
          <article-title>Auto-weka: Combined selection and hyperparameter optimization of classication algorithms</article-title>
          .
          <source>In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <fpage>847855</fpage>
          . KDD '13,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Villemonteix</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vazquez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walter</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>An informational approach to the global optimization of expensive-to-evaluate functions</article-title>
          .
          <source>Journal of Global Optimization</source>
          <volume>44</volume>
          (
          <issue>4</issue>
          ),
          <volume>509534</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Wistuba</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schilling</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt-Thieme</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Hyperparameter Search Space Pruning - A New Component for Sequential Model-Based Hyperparameter Optimization</article-title>
          .
          <source>In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD</source>
          <year>2015</year>
          , Porto, Portugal, September 7-
          <issue>11</issue>
          ,
          <year>2015</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mann</surname>
          </string-name>
          , G.:
          <article-title>Ecient transfer learning method for automatic hyperparameter tuning</article-title>
          .
          <source>In: International Conference on Articial Intelligence and Statistics (AISTATS</source>
          <year>2014</year>
          )
          <article-title>(</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>