<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Learning Data Set Similarities for Hyperparameter Optimization Initializations</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Martin</forename><surname>Wistuba</surname></persName>
							<email>wistuba@ismll.uni-hildesheim.de</email>
							<affiliation key="aff0">
								<orgName type="department">Information Systems and Machine Learning Lab Universitätsplatz 1</orgName>
								<address>
									<postCode>31141</postCode>
									<settlement>Hildesheim</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicolas</forename><surname>Schilling</surname></persName>
							<email>schilling@ismll.uni-hildesheim.de</email>
							<affiliation key="aff0">
								<orgName type="department">Information Systems and Machine Learning Lab Universitätsplatz 1</orgName>
								<address>
									<postCode>31141</postCode>
									<settlement>Hildesheim</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lars</forename><surname>Schmidt-Thieme</surname></persName>
							<email>schmidt-thieme@ismll.uni-hildesheim.de</email>
							<affiliation key="aff0">
								<orgName type="department">Information Systems and Machine Learning Lab Universitätsplatz 1</orgName>
								<address>
									<postCode>31141</postCode>
									<settlement>Hildesheim</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Learning Data Set Similarities for Hyperparameter Optimization Initializations</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4BB53E6C0EFFF67A635D1C7A24E1764C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T02:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Current research has introduced new automatic hyperparameter optimization strategies that are able to accelerate this optimization process and outperform manual and grid or random search in terms of time and prediction accuracy. Currently, meta-learning methods that transfer knowledge from previous experiments to a new experiment arouse particular interest among researchers because it allows to improve the hyperparameter optimization. In this work we further improve the initialization techniques for sequential model-based optimization, the current state of the art hyperparameter optimization framework. Instead of using a static similarity prediction between data sets, we use the few evaluations on the new data sets to create new features. These features allow a better prediction of the data set similarity. Furthermore, we propose a technique that is inspired by active learning. In contrast to the current state of the art, it does not greedily choose the best hyperparameter conguration but considers that a time budget is available. Therefore, the rst evaluations on the new data set are used for learning a better prediction function for predicting the similarity between data sets such that we are able to prot from this in future evaluations. We empirically compare the distance function by applying it in the scenario of the initialization of SMBO by meta-learning. Our two proposed approaches are compared against three competitor methods on one meta-data set with respect to the average rank between these methods and show that they are able to outperform them.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Most machine learning algorithms depend on hyperparameters and their optimization is an important part of machine learning techniques applied in practice.</p><p>Automatic hyperparameter tuning is catching more and more attention by the machine learning community for two simple but important reasons. Firstly, the omnipresence of hyperparameters aects the whole community such that everyone is aected by the time-consuming task of optimizing hyperparameters either by manually tuning them or by applying a grid search. Secondly, in many cases the nal hyperparameter congurations decides whether an algorithm is state of the art or just moderate such that the task of hyperparameter optimization is as important as developing new models <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b19">20]</ref>. Furthermore, hyperparameter optimization has shown to be able to also automatically perform algorithm and preprocessing selection by considering this as a further hyperparameter <ref type="bibr" target="#b19">[20]</ref>.</p><p>Sequential model-based optimization (SMBO) <ref type="bibr" target="#b8">[9]</ref> is the current state of the art for hyperparameter optimization and has proven to outperform alternatives such as grid search or random search <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b1">2]</ref>. Recent research try to improve the SMBO framework by applying meta-learning on the hyperparameter optimization problem. The key concept of meta-learning is to transfer knowledge gained for an algorithm on past experiments on dierent data sets to new experiments. Currently, two dierent, orthogonal ways of transferring this knowledge exist. One possibility is to initialize SMBO by using hyperparameter congurations that have been best on previous experiments <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. Another possibility is to use surrogate models that are able to learn across data sets <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b22">23]</ref>.</p><p>We improve the former strategy by using an adaptive initialization strategy.</p><p>We predict the similarity between data sets by using meta-features and features that express the knowledge gathered about the new data set so far. Having a more accurate approximation of the similarity between data sets, we are able to provide a better initialization. We provide empirical evidence that the new features provide better initializations in two dierent experiments. Furthermore, we propose an initialization strategy that is based on the active learning idea. We try to evaluate hyperparameter congurations that are nonoptimal for the short term but promise better results than choosing greedily the hyperparameter conguration that will provide the best result in expectation.</p><p>To the best of our knowledge, we are the rst that propose this idea in context of hyperparameter optimization for SMBO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Initializing hyperparameter optimization through meta-learning was proven to be eective <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b5">6]</ref>. Reif et al. <ref type="bibr" target="#b14">[15]</ref> suggests to choose those hyperparameter congurations for a new data set that were best on a similar data set in the context of evolutionary parameter optimization. Here, the similarity was dened through the distance among meta-features, descriptive data set features.</p><p>Recently, Feurer et al. <ref type="bibr" target="#b4">[5]</ref> followed their lead and proposed the same initialization for sequential model-based optimization (SMBO), the current state of the art hyperparameter optimization framework. Later, they extended their work by learning a regression model on the meta-features that predicts the similarity between data sets <ref type="bibr" target="#b5">[6]</ref>.</p><p>Learning a surrogate model, the component in the SMBO framework that tries to predict the performance for a specic hyperparameter conguration on a data set, that is not only learned on knowledge of the new data set but additionally across knowledge from experiments on other data sets <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b15">16]</ref> is another option to transfer knowledge as well as pruning <ref type="bibr" target="#b21">[22]</ref>. This idea is related but orthogonal to our work and can benet from a good initialization and is no replacement for a good initialization strategy.</p><p>We propose to add features based on the performance of an algorithm on a data set for a specic hyperparameter conguration. Pfahringer et al. <ref type="bibr" target="#b11">[12]</ref> propose to use landmark features. These features are estimated by evaluating simple algorithms on the data sets of the past and new experiment. In comparison to our features, these features are no by-product of the optimization process but have to be computed and hence need additional time. Even though these are simple algorithms, these features are problem-dependent (classication, regression, ranking, structured prediction all need their own landmark features). Furthermore, simple classiers such as nearest neighbors as proposed by the authors can become very time-consuming for large data sets which are those data sets we are interested in.</p><p>Relative landmarks proposed by Leite et al. <ref type="bibr" target="#b9">[10]</ref> are not and cannot be used as meta-features. They are used within the hyperparameter optimization strategy that is used instead of SMBO but are similar in that way that they are also given as a by-product and are computed using the relationship between hyperparameter congurations on each data set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Background</head><p>In this section the hyperparameter optimization problem is formally dened. For the sake of completeness, also the sequential model-based optimization framework is presented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Hyperparameter Optimization Problem Setup</head><p>A machine learning algorithm A λ is a mapping A λ : D → M where D is the set of all data sets, M is the space of all models and λ ∈ Λ is the chosen hyperparameter conguration with Λ = Λ 1 × . . . × Λ P being the P-dimensional hyperparameter space. The learning algorithm estimates a model M λ ∈ M that minimizes the objective function that linearly combines the loss function L and the regularization term R:</p><formula xml:id="formula_0">A λ D (train) := arg min M λ ∈M L M λ , D (train) + R (M λ ) . (1)</formula><p>Then, the task of hyperparameter optimization is nding the hyperparameter conguration λ * that minimizes the loss function on the validation data set i.e.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Sequential Model-based Optimization</head><p>Exhaustive hyperparameter search methods such as grid search are becoming more and more expensive. Data sets are growing, models are getting more complex and have high-dimensional hyperparameter spaces Sequential model-based optimization (SMBO) <ref type="bibr" target="#b8">[9]</ref> is a black-box optimization framework that replaces the time-consuming function f to evaluate with a cheap-to-evaluate surrogate function Ψ that approximates f . With the help of an acquisition function such as expected improvement <ref type="bibr" target="#b8">[9]</ref>, sequentially new points are chosen such that a balance between exploitation and exploration is met and f is optimized. In our scenario, evaluating f is equivalent to learning a machine learning algorithm on some training data for a given hyperparameter conguration and estimate the models performance on a hold-out data set.</p><p>Algorithm 1 outlines the SMBO framework. It starts with an observation history H that equals the empty set in cases where no knowledge from past experiments is used <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b16">17]</ref> or is non-empty in cases where past experiments are used <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b22">23]</ref> or SMBO has been initialized <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. First, the surrogate model Ψ is tted to H where Ψ can be any regression model. Since the acquisition function a usually needs some certainty about the prediction, common choices are Gaussian processes <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b22">23]</ref> or ensembles such as random forests <ref type="bibr" target="#b7">[8]</ref>. The acquisition function chooses the next candidate to evaluate. A common choice for the acquisition function is expected improvement <ref type="bibr" target="#b8">[9]</ref> but further acquisition functions exist such as probability of improvement <ref type="bibr" target="#b8">[9]</ref>, the conditional entropy of the minimizer <ref type="bibr" target="#b20">[21]</ref> or a multi-armed bandit based criterion <ref type="bibr" target="#b17">[18]</ref>. The evaluated candidate is nally added to the set of observations. After T -many SMBO iterations, the best currently found hyperparameter conguration is returned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 Sequential Model-based Optimization</head><p>Input: Hyperparameter space Λ, observation history H, number of iterations T , acquisition function a, surrogate model Ψ . Output: Best hyperparameter conguration found.</p><p>1</p><formula xml:id="formula_1">: for t = 1 to T do 2: Fit Ψ to H 3: λ * ← arg max λ * ∈Λ a (λ * , Ψ ) 4: Evaluate f (λ * ) 5: H ← H ∪ {(λ * , f (λ * ))} 6: return arg min (λ * ,f (λ * ))∈H f (λ * )</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Adaptive Initialization</head><p>Recent initialization techniques for sequential model-based optimization compute a static initialization hyperparameter conguration sequence <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. The advantage of this idea is that during the initialization there is no time overhead for computing the next hyperparameter conguration. The disadvantage is that knowledge gained during the initialization about the new data set is not used for further initialization queries. Hence, we propose to use an adaptive initialization technique. Firstly, we propose to add some additional meta-features generated from this knowledge, which follows the idea of landmark features <ref type="bibr" target="#b11">[12]</ref> and relative landmarks <ref type="bibr" target="#b9">[10]</ref>, respectively. Secondly, we apply the idea of active learning and try to choose the hyperparameter congurations that will allow to learn a precise ranking of data sets with respect to their similarity to the new data set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Adaptive Initialization Using Additional Meta-Features</head><p>Feurer et al. <ref type="bibr" target="#b4">[5]</ref> propose to improve the SMBO framework by an initialization strategy as shown in Algorithm 2. The idea is to use the hyperparameter congurations that have been best on other data sets. Those hyperparameter congurations are ranked with respect to the predicted similarity to the new data set D new for which the best hyperparameter conguration needs to be found. The true distance function d : D × D → R between data sets is unknown such that Feurer et al. <ref type="bibr" target="#b4">[5]</ref> propose to approximate it by d (m i , m j ) = m i − m j p where m i is the vector of meta-features of data set D i . In their extended work <ref type="bibr" target="#b5">[6]</ref>, they propose to use a random forest to learn d using training instances of the form (m i , m j ) T , d (D i , D j ) . This initialization does not consider the performance of the hyperparameters congurations already evaluated on the new data set.</p><p>We propose to keep Feurer's initialization strategy <ref type="bibr" target="#b5">[6]</ref> untouched and only add additional meta-features that capture the information gained on the new data set. Meta-features as dened in Equation 3 are added to the set of metafeatures for all hyperparameter congurations λ k , λ l that are evaluated on the new data set. The symbol ⊕ denotes an exclusive or. An additional dierence is that now after each step the meta-features and the model d needs to be updated</p><formula xml:id="formula_2">(before Line 2). m Di,Dj ,λ k ,λ l = I f Di (λ k ) &gt; f Di (λ l ) ⊕ f Dj (λ k ) &gt; f Dj (λ l )<label>(3)</label></formula><p>We make here the assumption that the same set of hyperparameter congurations were evaluated across all training data sets. If this is not the case, this problem can be overcome by approximating the respective value by learning surrogate models for the training data sets as well. Since for these data sets much information is available, the prediction will be reliable enough. For simplicity, we assume that the former is the case.</p><p>The target d can be any similarity measure that reects the true similarity between data sets. Feurer et al. <ref type="bibr" target="#b5">[6]</ref> propose to use the Spearman correlation coecient while we are using the number of discordant pairs in our experiments, i.e.</p><formula xml:id="formula_3">d (D i , D j ) := λ k ,λ l ∈Λ I f Di (λ k ) &gt; f Di (λ l ) ⊕ f Dj (λ k ) &gt; f Dj (λ l ) |Λ| (|Λ| − 1)<label>(4)</label></formula><p>where Λ is the set of hyperparameter congurations observed on the training data sets. This change will have no inuence on the prediction quality for the traditional meta-features but is better suited for the proposed landmark features in Equation <ref type="formula" target="#formula_2">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 2 Sequential Model-based Optimization with Initialization</head><p>Input: Hyperparameter space Λ, observation history H, number of iterations </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Adaptive Initialization Using Active Learning</head><p>We propose to extend the method from the last section by investing few initialization steps by carefully selecting hyperparameter congurations that will lead to good additional meta-features and provide a better prediction function d. An additional meta-feature is useful if the resulting regression model d predicts the distances of the training data sets to the new data set such that the ordering with respect to the predicted distances reects the ordering with respect to the true distances. If I is the number of initialization steps and K &lt; I is the number of steps to choose additional meta-features, then the K hyperparameter congurations need to be chosen such that the precision at I − K with respect to the ordering is optimized. The precision at n is dened as prec@n := |{n closest data sets to Dnew wrt. d}∩{n closest data sets to Dnew wrt. d}| n</p><p>Algorithm 3 presents the method we used to nd the best rst K hyperparameter congurations. In a leave-one-out cross-validation over all training data sets D the pair of hyperparameter congurations (λ j , λ k ) is sought that achieves the best precision at I − K on average (Lines 1 to 7). Since testing all dierent combinations of K dierent hyperparameter congurations is too expensive, only the best pair is searched. The remaining K − 2 hyperparameter congurations are greedily added to the nal set of initial hyperparameter congurations Λ active as described in Lines 8 to 15. The hyperparameter conguration is added to Λ active that performs best on average with all hyperparameter congurations chosen so far. After choosing K hyperparameter congurations as described in Algorithm 3, the remaining I − K hyperparameter congurations are chosen as described in Section 4.1. The time needed for computing the rst K hyperparameter highly depends on the size of Λ.  <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b14">15]</ref>. Instead of choosing I training data sets at random, they are chosen with respect to the similarity between the meta-features listed in Table <ref type="table" target="#tab_2">1</ref>. Then, like for RBI, the best hyperparameter congurations on these data sets are chosen for initialization.</p><p>Predictive Best Initialization (PBI) Feurer et al. <ref type="bibr" target="#b5">[6]</ref> propose to learn the distance between data sets using a regression model based on the meta-features. These predicted distances are used to nd the most similar data sets to the new data set and like before, the best hyperparameter congurations on these data sets are chosen for initialization.</p><p>Adaptive Predictive Best Initialization (aPBI) This is our extension to PBI presented in Section 4.1 that adapts to the new data set during initialization by including the features dened in Equation <ref type="formula" target="#formula_2">3</ref>.</p><p>Active Adaptive Predictive Best Initialization (aaPBI) Active Adaptive Predictive Best Initialization is described in Section 4.2 and extends aPBI by using the rst K steps to choose hyperparameter congurations that will result in promising meta-features. After the rst K iterations, it behaves equivalent to aPBI. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Meta-Features</head><p>Meta-features are supposed to be discriminative for a data set and can be estimated without evaluating f . Many surrogate models <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b22">23]</ref> and initialization strategies <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b14">15]</ref> use them to predict the similarity between data sets. For the experiments, the meta-features listed in Table <ref type="table" target="#tab_2">1</ref> are used. For an in-depth explanation we refer the reader to <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b10">11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Meta-Data Set</head><p>For creating the two meta-data sets, 50 classication data sets from the UCI repository are chosen at random. All instances are merged in cases there were already train/test splits, shued and split into 80% train and 20% test. A support vector machine (SVM) <ref type="bibr" target="#b2">[3]</ref> was used to create the meta-data set. The SVM was trained using three dierent kernels (linear, polynomial and Gaussian) and the labels of the meta-instances were estimated by evaluating the trained model on the test split. The total hyperparameter dimension is six, three dimensions for indicator variables that indicates which kernel was chosen, one for the trade-o parameter C, one for the width γ of the Gaussian kernel and one for the degree of the polynomial kernel d. We evaluated all methods in a leave-one-out cross-validation per data set.</p><p>All data sets but one are used for training and the data set not used for training is the new data set. The results reported are the average over 100 repetitions.</p><p>Due to the randomness, we used 1,000 repetitions whenever RBI was used. q q q q q q q q q q 2.0 Number of Initial Hyperparameter Configurations Average Rank q q q q q q q q q q 0.1 Number of Initial Hyperparameter Configurations Average Normalized MCR q RBI NBI PBI aPBI aaPBI Fig. <ref type="figure">1</ref>. Five dierent tuning strategies are compared. Our strategy aPBI is able to outperform the state of the art. Our second strategy aaPBI is executed under the assumption that it has 10 initial steps. The rst steps are used for active learning and hence the bad results in the beginning are not surprising.</p><p>Comparison to Other Initialization Strategies For all our experiments 10 initialization steps are used, I = 10. In Figure <ref type="figure">1</ref> the dierent initialization strategies are compared with respect to the average rank in the left plot. Our initialization strategy aPBI benets from the new, adaptive features and is able to outperform the other initialization strategies. Our second strategy aaPBI uses three active learning steps, K = 3. This explains the behavior in the beginning.</p><p>After these initial steps it is able to catch up and nally surpass PBI but it is not as good as aPBI. Furthermore, a distance function learned on the metafeatures (PBI) provides better results than a xed distance function (NBI) which conrms the results by Feurer et al. <ref type="bibr" target="#b5">[6]</ref>. All initialization strategies are able to outperform RBI which conrms that the meta-features contain information that concludes information about the similarity between data sets. The right plot shows the average misclassication (MCR) rate where the MCR is scaled to 0 and 1 for each data set.</p><p>Comparison with Respect to the Long Term Eect To have a look at the long term eect of the initialization strategies, we compare the dierent initialization strategies in the SMBO framework using two common surrogate models. One is a Gaussian process with a squared exponential kernel with automatic relevance determination <ref type="bibr" target="#b16">[17]</ref>. The kernel parameters are estimated by maximizing the marginal likelihood on the meta-training set <ref type="bibr" target="#b13">[14]</ref>. The second is a random forest <ref type="bibr" target="#b7">[8]</ref>. Again, aPBI provides strictly better results than PBI for both surrogate models and outperforms any other initialization strategy for the Gaussian process. Our alternative strategy aaPBI performs mediocre for the Gaussian process but good for the random forest.</p><p>q qq q q q qqqqq qq q q qq qqqq qq q q qq qqqqqq qqqqqqq qq q qqq qq q qqq qqqq qqqq Gaussian Process q q q q qq q q qqq qq q q q q q qq q qq qq qq q qq q qqqqq q qq qqqqqq q q q q q q q qqq q q q qq 2.5 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random Forest</head><p>Fig. <ref type="figure">2</ref>. The eect of the initialization strategies on the further optimization process in the SMBO framework using a Gaussian process (left) and a random forest (right) as a surrogate model is investigated. Our initialization strategy aPBI seems to be strictly better than PBI while aaPBI performs especially well for the random forest in the later phase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future Work</head><p>Predicting the similarity between data sets is an important topic for hyperparameter optimization since it allows to successfully transfer knowledge from past experiments to a new experiment. We have presented an easy way of achieving an adaptive initialization strategy by adding a new kind of landmark features.</p><p>We have shown for two popular surrogate models that these new features improve over the same strategy without these features. Finally, we introduced a new idea that in contrast to the current methods considers that there is a limit of evaluations. It tries to exploit this knowledge by applying a guided exploitation at rst that will lead to worse decisions for the short term but will deliver better results when the end of the initialization is reached. Unfortunately, the results for this method are not fully convincing but we believe that it can be a good idea to choose hyperparameter congurations in a smarter way but always assuming that the next hyperparameter conguration chosen is the last one.</p><p>In this work we were able to provide a more accurate prediction of the similarity between data sets and used this knowledge to improve the initialization.</p><p>Since not only initialization strategies but also surrogate models rely on an exact similarity prediction, we plan to investigate the impact on these models. For example Yogatama and Mann <ref type="bibr" target="#b22">[23]</ref> use a kernel that measures the distance between data sets by using the p-norm between meta-features of data sets only. A better distance function may help to improve the prediction and will help to improve the SMBO beyond the initialization.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>λ</head><label></label><figDesc>* := arg min λ∈Λ L A λ D (train) , D (valid) =: arg min λ∈Λ f D (λ) .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>T , acquisition function a, surrogate model Ψ , set of data sets D, number of initial hyperparameter congurations I, prediction function for distances between data sets d.</figDesc><table><row><cell cols="2">Output: Best hyperparameter conguration found.</cell></row><row><cell cols="2">1: for i = 1 to I do</cell></row><row><cell>2:</cell><cell>Predict the distances d (Dj, Dnew) for all Dj ∈ D.</cell></row><row><cell>3:</cell><cell>λ *  ← Select best hyperparameter conguration on the i-th closest data set.</cell></row><row><cell>4:</cell><cell>Evaluate f (λ * )</cell></row><row><cell>5:</cell><cell>H ← H ∪ {(λ * , f (λ * ))}</cell></row><row><cell cols="2">6: return SMBO(Λ, H, T − I, a, Ψ )</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>To speed up the process, we reduced Λ to those hyperparameter congurations that have been best on at least one training data set. Algorithm 3 Active Initialization Input: Set of training data sets D, set of observed hyperparameter congurations on D Λ, number of active learning steps K, number of initial hyperparameter congurations I. Dtrain using the additional feature m •,•,λ j ,λ k (Eq. 3) 6: Estimate the precision at I − K of d on Dtest. 7: Λactive ← {λj, λ k } with highest average precision. 8: for k = 1 to K − 2 do 9: for i = 1 . . . |D| do 10: Dtrain using the additional features (Eq. 3) implied by Λactive ∪ {λ}. 14: Estimate the precision at I − K of d on Dtest. 15: Add λ with highest average precision to Λactive. 16: return Λactive I training data sets from the training data sets D are chosen at random and its best hyperparameter congurations are used for the initialization. Nearest Best Initialization (NBI) This is the initialization strategy proposed by Reif et al. and Feurer et al.</figDesc><table><row><cell cols="2">Output: List of hyperparameter congurations that will lead to good predictors for</cell></row><row><cell>d.</cell><cell></cell></row><row><cell cols="2">1: for i = 1 to |D| do</cell></row><row><cell>2:</cell><cell>Dtest ← {Di}</cell></row><row><cell>3:</cell><cell>Dtrain ← D \ {Di}</cell></row><row><cell>4:</cell><cell>for all λj, λ k ∈ Λ, j = k do</cell></row><row><cell>5:</cell><cell>Train d on Dtest ← {Di}</cell></row><row><cell>11:</cell><cell>Dtrain ← D \ {Di}</cell></row><row><cell>12:</cell><cell>for all λ ∈ Λ \ Λactive do</cell></row><row><cell cols="2">13: Train d on 5 Experimental Evaluation</cell></row><row><cell cols="2">5.1 Initialization Strategies</cell></row><row><cell cols="2">Following initialization strategies will be considered in our experiments.</cell></row><row><cell cols="2">Random Best Initialization (RBI) This initialization is a very simple initializa-</cell></row><row><cell>tion.</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 .</head><label>1</label><figDesc>List of all meta-features used.</figDesc><table><row><cell>Number of Classes</cell><cell>Class Probability Max</cell></row><row><cell>Number of Instances</cell><cell>Class Probability Mean</cell></row><row><cell>Log Number of Instances</cell><cell>Class Probability Standard Deviation</cell></row><row><cell>Number of Features</cell><cell>Kurtosis Min</cell></row><row><cell>Log Number of Features</cell><cell>Kurtosis Max</cell></row><row><cell>Data Set Dimensionality</cell><cell>Kurtosis Mean</cell></row><row><cell>Log Data Set Dimensionality</cell><cell>Kurtosis Standard Deviation</cell></row><row><cell>Inverse Data Set Dimensionality</cell><cell>Skewness Min</cell></row><row><cell>Log Inverse Data Set Dimensionality</cell><cell>Skewness Max</cell></row><row><cell>Class Cross Entropy</cell><cell>Skewness Mean</cell></row><row><cell>Class Probability Min</cell><cell>Skewness Standard Deviation</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head></head><label></label><figDesc>If the hyperparameter is not involved, e.g. the degree if the Gaussian kernel was used, it is set to 0. The test accuracy is precomputed on a grid C ∈ 2 −5 , . . . , 2 6 , γ ∈ 10 −4 , 10 −3 , 10 −2 , 0.05, 0.1, 0.5, 1, 2, 5, 10, 20, 50, 10 2 , 10 3 and d ∈ {2, . . . , 10}resulting into 288 meta-instances per data set. The meta-data sets is extended by the meta-features listed in Table1.5.4 ExperimentsTwo dierent experiments are conducted. First, state of the art initialization strategies are compared with respect to the average rank after I initial hyperparameter congurations. Second, the long term eect on the hyperparameter optimization is compared. Even though the initial hyperparameter conguration lead to good results after I results, the ultimate aim is to have good results at the end of the hyperparameter optimization after T iterations.</figDesc><table /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgments. The authors gratefully acknowledge the co-funding of their work by the German Research Foundation (Deutsche Forschungsgesellschaft) under grant SCHM 2583/6-1.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Collaborative hyperparameter tuning</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bardenet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brendel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kégl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sebag</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 30th International Conference on Machine Learning (ICML-13)</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Dasgupta</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Mcallester</surname></persName>
		</editor>
		<meeting>the 30th International Conference on Machine Learning (ICML-13)</meeting>
		<imprint>
			<date type="published" when="2013">199207. May 2013</date>
			<biblScope unit="volume">28</biblScope>
		</imprint>
	</monogr>
	<note>JMLR Workshop and Conference Proceedings</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Algorithms for hyper-parameter optimization</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Bergstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bardenet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kégl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Shawe-Taylor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Bartlett</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="page">25462554</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">LIBSVM: A library for support vector machines</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="http://www.csie.ntu.edu.tw/~cjlin/libsvm" />
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Intelligent Systems and Technology</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page">27</biblScope>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">An analysis of single-layer networks in unsupervised feature learning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Coates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourteenth International Conference on Articial Intelligence and Statistics. JMLR Workshop and Conference Proceedings</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Gordon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Dunson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Dudík</surname></persName>
		</editor>
		<meeting>the Fourteenth International Conference on Articial Intelligence and Statistics. JMLR Workshop and Conference Proceedings</meeting>
		<imprint>
			<publisher>JMLR W&amp;CP</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page">215223</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Using meta-learning to initialize bayesian optimization of hyperparameters</title>
		<author>
			<persName><forename type="first">M</forename><surname>Feurer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Springenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECAI workshop on Metalearning and Algorithm Selection (MetaSel)</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page">310</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Initializing bayesian hyperparameter optimization via meta-learning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Feurer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Springenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Ninth AAAI Conference on Articial Intelligence</title>
				<meeting>the Twenty-Ninth AAAI Conference on Articial Intelligence<address><addrLine>Austin, Texas, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">January 25-30, 2015. 2015</date>
			<biblScope unit="page">11281135</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Combining meta-learning and search techniques to select parameters for support vector machines</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">A</forename><surname>Gomes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">B</forename><surname>Prudêncio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Soares</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Rossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Carvalho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">brazilian Symposium on Neural Networks (SBRN 2010) International Conference on Hybrid Articial Intelligence Systems</title>
				<meeting><address><addrLine>HAIS</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2010">2012. 2010</date>
			<biblScope unit="volume">75</biblScope>
			<biblScope unit="page" from="3" to="13" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Sequential model-based optimization for general algorithm conguration</title>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Hoos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Leyton-Brown</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th International Conference on Learning and Intelligent Optimization</title>
				<meeting>the 5th International Conference on Learning and Intelligent Optimization<address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page">507523</biblScope>
		</imprint>
	</monogr>
	<note>LION&apos;05</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Ecient global optimization of expensive black-box functions</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schonlau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">J</forename><surname>Welch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. of Global Optimization</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page">455492</biblScope>
			<date type="published" when="1998-12">Dec 1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Selecting classication algorithms with active testing</title>
		<author>
			<persName><forename type="first">R</forename><surname>Leite</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Brazdil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanschoren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning and Data Mining in Pattern Recognition</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">P</forename><surname>Perner</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">7376</biblScope>
			<biblScope unit="page">117131</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m">Machine Learning, Neural and Statistical Classication</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Michie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Spiegelhalter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Taylor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Campbell</surname></persName>
		</editor>
		<meeting><address><addrLine>Upper Saddle River, NJ, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Ellis Horwood</publisher>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Meta-learning by landmarking various learning algorithms</title>
		<author>
			<persName><forename type="first">B</forename><surname>Pfahringer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bensusan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Giraud-Carrier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventeenth International Conference on Machine Learning</title>
				<meeting>the Seventeenth International Conference on Machine Learning</meeting>
		<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page">743750</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A high-throughput screening approach to discovering good forms of biologically inspired visual representation</title>
		<author>
			<persName><forename type="first">N</forename><surname>Pinto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Doukhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Dicarlo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Cox</surname></persName>
		</author>
		<idno type="PMID">19956750</idno>
	</analytic>
	<monogr>
		<title level="j">PLoS Computational Biology</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page">e1000579</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">E</forename><surname>Rasmussen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K I</forename><surname>Williams</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2005">2005</date>
			<publisher>The MIT Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Meta-learning for evolutionary parameter optimization of classiers</title>
		<author>
			<persName><forename type="first">M</forename><surname>Reif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Shafait</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dengel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">87</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">357380</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Hyperparameter Optimization with Factorized Multilayer Perceptrons</title>
		<author>
			<persName><forename type="first">N</forename><surname>Schilling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wistuba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Drumond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt-Thieme</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning and Knowledge Discovery in Databases -European Conference, ECML PKDD 2015</title>
				<meeting><address><addrLine>Porto, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">September 7-11, 2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Practical bayesian optimization of machine learning algorithms</title>
		<author>
			<persName><forename type="first">J</forename><surname>Snoek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Adams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Burges</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page">29512959</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Gaussian process optimization in the bandit setting: No regret and experimental design</title>
		<author>
			<persName><forename type="first">N</forename><surname>Srinivas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Seeger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Kakade</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 27th International Conference on Machine Learning (ICML-10)</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Fürnkranz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Joachims</surname></persName>
		</editor>
		<meeting>the 27th International Conference on Machine Learning (ICML-10)</meeting>
		<imprint>
			<publisher>Omnipress</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page">10151022</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Multi-task bayesian optimization</title>
		<author>
			<persName><forename type="first">K</forename><surname>Swersky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Snoek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">P</forename><surname>Adams</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Burges</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Ghahramani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page">20042012</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Auto-weka: Combined selection and hyperparameter optimization of classication algorithms</title>
		<author>
			<persName><forename type="first">C</forename><surname>Thornton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Hoos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Leyton-Brown</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</title>
				<meeting>the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page">847855</biblScope>
		</imprint>
	</monogr>
	<note>KDD &apos;13</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">An informational approach to the global optimization of expensive-to-evaluate functions</title>
		<author>
			<persName><forename type="first">J</forename><surname>Villemonteix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Vazquez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Walter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Global Optimization</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page">509534</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Hyperparameter Search Space Pruning -A New Component for Sequential Model-Based Hyperparameter Optimization</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wistuba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schilling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schmidt-Thieme</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning and Knowledge Discovery in Databases -European Conference, ECML PKDD 2015</title>
				<meeting><address><addrLine>Porto, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">September 7-11, 2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Ecient transfer learning method for automatic hyperparameter tuning</title>
		<author>
			<persName><forename type="first">D</forename><surname>Yogatama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Articial Intelligence and Statistics (AISTATS</title>
				<imprint>
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
