<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bayesian Optimization Combined with Incremental Evaluation for Neural Network Architecture Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Wistuba</string-name>
          <email>martin.wistuba@ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The choice of hyperparameters and the selection of algorithms is a crucial part in machine learning. Bayesian optimization methods and successive halving have been applied successfully to optimize hyperparameters automatically. Therefore, we propose to combine both methods by estimating the initial population of incremental evaluation, our variation of successive halving, by means of Bayesian optimization. We apply the proposed methodology to the challenging problem of optimizing neural network architectures automatically and investigate how state of the art hyperparameter optimization methods perform for this task. In our evaluation of these automatic methods, we are able to achieve human expert performance on the MNIST data set but we are not able to achieve similar good results for the CIFAR-10 data set. However, the automated methods find shallow convolutional neural networks that outperform human crafted shallow neural networks with respect to classification error and training time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Deep neural networks are powerful models that are able to extract features from
complex data such as images, audio and text automatically. They enabled to
improve the state of the art on many different domains such as image
classification [
        <xref ref-type="bibr" rid="ref10 ref18">10,18,30,39</xref>
        ], object detection [
        <xref ref-type="bibr" rid="ref8">8,27</xref>
        ], speech recognition [
        <xref ref-type="bibr" rid="ref4">4,23</xref>
        ] and machine
translation [35]. Rather than manually extracting features from the data, the
humans’ task changed to create neural network architectures and components
which are able to learn how to extract useful features. However, the creation of
neural networks requires a lot of domain expertise. Considering the domain of
image classification only, hundreds of researchers have created several
components and architectures before reaching top performances on all the well-known
benchmark data sets. Therefore, an automatic way to find the optimal neural
network architecture would be extremely useful.
      </p>
      <p>Snoek et al. proposed to apply Bayesian optimization to tackle the
hyperparameter optimization problem automatically [31]. In numerous occasions
Bayesian optimization provided better hyperparameter configurations than
human data scientists. Snoek et al. pushed the state of the art in deep learning to
new levels on the CIFAR-10 data set in 2012 [31] and again in 2015 for CIFAR-10
and CIFAR-100 [32]. However, a fixed, good performing neural network
architecture was chosen and only its hyperparameters were optimized.</p>
      <p>
        Since deep learning is becoming more and more important in various
domains, it is not utterly surprising that there are attempts to automate the search
for suitable neural network architectures [
        <xref ref-type="bibr" rid="ref1">1, 26, 46</xref>
        ]. Reinforcement learning and
evolutionary algorithms are currently the dominating methods and are able to
achieve state of the art results.
      </p>
      <p>
        In this paper, we conduct a study to investigate whether the state of the
art for hyperparameter optimization can be applied to the task of neural
network architecture optimization. We propose incremental evaluation, a variation
of successive halving [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], with two different initialization methods: Bayesian
optimization and random search. We compare Bayesian optimization, incremental
evaluation and random search [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We evaluate all approaches on the two popular
image classification benchmark data sets MNIST and CIFAR-10.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Neural network architecture optimization got recently a lot of attention thanks
to new automatic optimizers which achieve human level performances. Zoph and
Le proposed to use a reinforcement learning method to learn optimal network
architectures [46]. They propose to learn a controller based on a recurrent neural
network which generates a sequence that encodes a neural network architecture.
Real et al. propose the use of evolutionary algorithms [26]. In contrast to the
typical operations of evolutionary algorithms, they are not using cross-over but
rely on mutations only. Both approaches achieve impressive results on CIFAR-10
but at the cost of high computation time. Zoph and Le use 800 GPUs in parallel
and evalute 12,800 different architectures before they reach their best performing
neural network architecture. Real et al. use 250 parallel workers, each of them
for more than 256 hours.</p>
      <p>
        One of the current state of the art methods for hyperparameter
optimization is Bayesian optimization [31]. Various extensions are proposed to consider
different important aspects such as new surrogate models [
        <xref ref-type="bibr" rid="ref11">11, 32</xref>
        ],
initializations [
        <xref ref-type="bibr" rid="ref7">7,42,43</xref>
        ], pruning [41] and learning from samples [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], learning curves [
        <xref ref-type="bibr" rid="ref6">6,38</xref>
        ]
as well as related tasks [
        <xref ref-type="bibr" rid="ref2">2, 28, 37, 45</xref>
        ], or addressing particular problems [
        <xref ref-type="bibr" rid="ref22">22, 44</xref>
        ].
Recently, successive halving [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have been proposed for the same task and was
further extended to Hyperband [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Klein et al. propose a similar extension to
Hyperband. Using a special Bayesian neural network for learning curve
prediction, they determine the initial population after few random trials [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Swersky et al. propose to use a specific kernel for conditional parameter spaces
and applies it in Bayesian optimization for neural network optimization [36].
Mendoza et al. also apply Bayesian optimization for this task but replace the
surrogate model with a random forest [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
which maps a network architecture x 2 X to its loss on the validation data set
Dvalid. Evaluating the function f at value x involves training the neural network
encoded by x on the training data set Dtrain and evaluating it on the validation
data set Dvalid. Hence, the network architecture x which minimizes f ,
f : X ! R ;
x = arg min f (x) ;
      </p>
      <p>x2X</p>
    </sec>
    <sec id="sec-3">
      <title>Problem Definition</title>
      <p>We define the problem of finding the optimal neural network architecture as
a black-box optimization problem. Considering an arbitrary machine learning
task, the black-box function f is defined as
is the optimal neural network architecture for our machine learning task.</p>
      <p>In the following, we will assume that we want to find the optimal network
architecture for an image classification task. Hence, we want to find the
convolutional neural network architecture which minimizes the classification error.
4</p>
      <p>
        Combining Bayesian Optimization and Incremental
Evaluation
Bayesian optimization is an optimization method for black-box functions. It
consists of two components, a surrogate model and an acquisition function. The
surrogate model tries to approximate the function to be minimized and
additionally provides uncertainty estimations about its predictions. The acquisition
function uses the prediction of the surrogate model in order to estimate the
expected utility of a possible candidate. The candidate which maximizes the
expected utility is selected to be evaluated. In our experiments, we use a
Gaussian process with Matérn 5/2 kernel or a Random Forest [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] combined with the
expected improvement acquisition function [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Kernel hyperparameters are
optimized by maximizing the log-likelihood [25]. For more information and related
work, we refer the interested reader to the recent review by Shahriari et al. [29].
      </p>
      <p>
        Successive halving [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is a simple, however, effective bandit-based approach
to optimize hyperparameters for algorithms with iterative learning procedures. It
starts with a set of random hyperparameter configurations. Every iteration has
the same budget that is uniformly distributed among the remaining candidates in
order to continue the learning process. At the end of an iteration the population is
reduced by a factor r, keeping only the candidates with the smallest, current loss.
Since the budget per iteration is fix, the budget keeps increasing per remaining
candidate by a factor of r 1.
      </p>
      <p>We propose incremental evaluation which is motivated by our observations
when training neural networks. We observed that it is difficult to distinguish
between good and very good architectures after a small training time because
(1)
(2)
deeper architectures need more time than shallow ones before providing good
predictions. Therefore, in contrast to succcessive halving, we propose to spend
more budget in the beginning and less later before deciding which candidates
to discard. This algorithm is outlined in Algorithm 1. Given a total budget B,
we choose a set of candidates based on an initialization strategy using an initial
budget binit for each candidate. Then, we keep training all remaining
candidates each with a budget of b before reducing the set of candidates by a factor
of r. Thus, the budget spent per iteration reduces exponentially. We propose
two different initialization strategies: random search and Bayesian optimization.
Random search picks the network depth at random and then chooses the
hyperparameters of all layers at random. Each candidate network is trained for binit
epochs. The initialization with Bayesian optimization is simply running Bayesian
optimization but limiting the number of epochs to binit. We set binit = 2, b = 1
and the reduction factor to r = 0:25.</p>
      <p>Algorithm 1 Incremental Evaluation
Input: Budget B, reduction factor r, budget per candidate for the initialization binit
and later phase b, current loss of candidate i li. The function “initPopulation”
denotes the selection strategy for the initial population. Its first argument is the
population size and the second the budget spend for each candidate.</p>
      <p>Output: Best neural network architecture found.
1: P0 initPopulation bB binit + b 1 1 r r 1c; binit
2: for k = 0; 1; : : : ; dlogr 1 (jP0j)e 1 do
3: Let k be a bijection on Pk such that l k(1)
4: Pk+1 i 2 Pk j l k(i) l k(brjPkjc) .
5: for i 2 Pk+1 do
6: Continue training i for a budget of b.
7: return Remaining element in Pdlogr 1 (jP0j)e.
5
5.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Search Space</title>
        <p>: : : l k(jPkj).</p>
        <p>
          We defined the search space X of neural network architectures as follows. Each
network has between one and five convolutional layers. For each convolutional
layer the search algorithm needs to choose a quadratic filter with dimensions
between three and ten and the number of filters within the interval [2; 128]. The
activation functions of the convolutional layers are rectified linear units [24].
Each convolutional layer is followed by a dropout layer [33] with dropout rates
between 0 and 0.9. There is an optional batch normalization layer [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] after
each convolutional layer. The stride is fixed to one and we use zero padding to
ensure the dimensionality of the feature maps is not reduced. Finally, there are
optional max pooling layers after each convolutional layers. Again, zero padding
is applied but for the max pooling layer after the first and last convolutional
layer in the neural network. All networks are flattened after the convolutional
part and a fully connected layer with number of nodes equal to the number of
classes is added. We follow the lead by Zoph and Le [46] and do not include
the hyperparameters of the optimizer into our search. We use the momentum
optimizer with a learning rate of 0.1, weight decay of 10 4, momentum of 0.9 and
Nesterov momentum [34] with a batch size of 50. The reason for this decision is
that we want to focus on finding the optimal architecture and avoid to increase
the search space even more. However, learning these hyperparameters is possible
and we will consider it for future work. Each network is trained for up to 20
epochs, early stopping is applied to stop a learning process early in case the loss
on the validation set did not improve for two epochs.
        </p>
        <p>
          Bayesian optimization expects as input a vector of constant size, even if the
network depth changes. We used following encoding, similar to the one proposed
by Diaz et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We force all entries for layers that are unused to zero and add
another entry to the vector that encodes the depth of the network.
        </p>
        <p>
          We conduct experiments on the famous benchmark data sets MNIST [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
and CIFAR-10 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Each optimizer has one hour for the MNIST data set and
six hours for the CIFAR-10 data set. We use the original train and test split
and create our own train/validation split using the original train split. Our train
split is used in order to train the network and the validation split to evaluate
the networks performance. We report the mean test error of five repetitions.
The test error is estimated with the neural network that performed best on our
validation split. We do not use any data augmentation techniques.
5.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Neural Network Architecture Optimizers</title>
        <p>
          In this work we compare six different neural network architecture optimizers.
Random search [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is selecting neural network architectures at random. It seems
to be a simple baseline, however, random search has provided good results,
especially in the domain of neural networks where the number of hyperparameters
is large but many of them are insensitive to smaller changes. Furthermore, we
investigate the performance of the methods we discussed in Section 4: Bayesian
optimization with Gaussian process [31] or Random Forest [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], iterative
evaluation in combination with random search [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] or Bayesian optimization.
5.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Neural Network Architecture Search for MNIST</title>
        <p>
          MNIST is a 20 years old digit recognition benchmark data set which has been
used by hundreds of researchers to evaluate their image classification methods.
The data set contains 60,000 28x28 grayscale images of the 10 digits 0-9, along
with a test set of 10,000 images. The results are reported in Table 1. We see
no significant differences between all methods. It seems that one GPU hour is
already enough to find top performing networks and the solution is probably close
to the optimum within our search space. To set our results into context, the best
reported result on MNIST is an error of 0.21% [40]. This result is achieved using
a neural network architecture which is not an element of our search space. During
the automatic search we found an architecture with only two convolutional layers
which is capable of achieving an classification error of 0.59%. This convolutional
neural network can be trained in only 150 seconds. This result is comparable to
the fast-learning shallow convolutional neural network proposed by McDonnell
and Vladusich which achieves in four minutes an error of 0.5% [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. However,
our training time is significantly shorter with a slightly smaller accuracy.
The CIFAR-10 data set is another popular image classification benchmark data
set which contains 50,000 32x32 color training images and 10,000 test images.
The results achieved by our investigated methods are reported in Table 1. The
best result achieved by a human is as low as 3.47% [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], clearly better than the
results we are able to achieve with the automatic methods. Actually, results
below 10% error are very common in recently published papers for this data
set. We identified two reasons for this result. First, our search space is very
restrictive, only allowing networks with a depth up to 5 convolutional layers and
without skip connections [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Second, we limited the search duration to only 6
GPU hours. Since we learn only shallow networks, we draw a comparison to the
work of McDonnell and Vladusich [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] again. Their best network achieved an
error of 24.14% in less than an hour while our best network achieves an error of
22.81% within less than 25 minutes training time. Thus, our training time is not
only by a factor of two shorter but it also provides a smaller classification error.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We investigated how well state of the art black-box optimization methods used
for hyperparameter optimization perform for the task of neural network
architecture search. We can show that they are capable of reaching human level
performance on the MNIST data set. The results on CIFAR-10 are currently
less promising but we identified possible problems which we are going to
investigate in the future. Comparing the shallow networks found during our search to
human crafted shallow neural networks, we see that the automatic search is able
to find neural network architectures which need less training time for achieving
the same or even better performance.</p>
      <p>We proposed incremental evaluation, a variation of succcessive halving, and
various initialization methods for it. While this idea is theoretically sound, our
experiments do not show any significant improvements.</p>
      <p>As future work, we consider to increase the search space significantly in order
to create deep architectures. Furthermore, we will increase the search time and
use multiple GPUs in parallel. Finally, we want to test our variation of successive
halving for the task of hyperparameter optimization in order to see whether a
significant improvement for this task can be achieved.
23. Mohamed, A., Dahl, G.E., Hinton, G.E.: Acoustic modeling using deep belief
networks. IEEE Trans. Audio, Speech &amp; Language Processing 20(1), 14–22 (2012),
https://doi.org/10.1109/TASL.2011.2109382
24. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann
machines. In: Proceedings of the 27th International Conference on Machine
Learning (ICML-10), June 21-24, 2010, Haifa, Israel. pp. 807–814 (2010), http://www.
icml2010.org/papers/432.pdf
25. Rasmussen, C.E., Williams, C.K.I.: Gaussian processes for machine learning.
Adaptive computation and machine learning, MIT Press (2006), http://www.worldcat.
org/oclc/61285753
26. Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Le, Q.V., Kurakin,
A.: Large-scale evolution of image classifiers. CoRR abs/1703.01041 (2017), http:
//arxiv.org/abs/1703.01041
27. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.
39(6), 1137–1149 (2017), https://doi.org/10.1109/TPAMI.2016.2577031
28. Schilling, N., Wistuba, M., Drumond, L., Schmidt-Thieme, L.: Hyperparameter
optimization with factorized multilayer perceptrons. In: Machine Learning and
Knowledge Discovery in Databases - European Conference, ECML PKDD 2015,
Porto, Portugal, September 7-11, 2015, Proceedings, Part II. pp. 87–103 (2015),
https://doi.org/10.1007/978-3-319-23525-7_6
29. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N.: Taking the
human out of the loop: A review of bayesian optimization. Proceedings of the
IEEE 104(1), 148–175 (2016), https://doi.org/10.1109/JPROC.2015.2494218
30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014), http://arxiv.org/abs/1409.1556
31. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of
machine learning algorithms. In: Advances in Neural Information Processing
Systems 25: 26th Annual Conference on Neural Information Processing
Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe,
Nevada, United States. pp. 2960–2968 (2012), http://papers.nips.cc/paper/
4522-practical-bayesian-optimization-of-machine-learning-algorithms
32. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N.,
Patwary, M.M.A., Prabhat, Adams, R.P.: Scalable bayesian optimization using deep
neural networks. In: Proceedings of the 32nd International Conference on
Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. pp. 2171–2180 (2015),
http://jmlr.org/proceedings/papers/v37/snoek15.html
33. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: a simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research 15(1), 1929–1958 (2014), http://dl.acm.org/citation.
cfm?id=2670313
34. Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of
initialization and momentum in deep learning. In: Proceedings of the 30th International
Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013.
pp. 1139–1147 (2013), http://jmlr.org/proceedings/papers/v28/sutskever13.
html
35. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: Advances in Neural Information Processing Systems 27: Annual
Conference on Neural Information Processing Systems 2014, December 8-13 2014,
Montreal, Quebec, Canada. pp. 3104–3112 (2014), http://papers.nips.cc/paper/
5346-sequence-to-sequence-learning-with-neural-networks
36. Swersky, K., Duvenaud, D., Snoek, J., Hutter, F., Osborne, M.A.: Raiders of the
lost architecture: Kernels for bayesian optimization in conditional parameter spaces
(2014)
37. Swersky, K., Snoek, J., Adams, R.P.: Multi-task bayesian optimization. In:
Advances in Neural Information Processing Systems 26: 27th Annual Conference
on Neural Information Processing Systems 2013. Proceedings of a meeting held
December 5-8, 2013, Lake Tahoe, Nevada, United States. pp. 2004–2012 (2013),
http://papers.nips.cc/paper/5086-multi-task-bayesian-optimization
38. Swersky, K., Snoek, J., Adams, R.P.: Freeze-thaw bayesian optimization. CoRR
abs/1406.3896 (2014), http://arxiv.org/abs/1406.3896
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA,
June 7-12, 2015. pp. 1–9 (2015), https://doi.org/10.1109/CVPR.2015.7298594
40. Wan, L., Zeiler, M.D., Zhang, S., LeCun, Y., Fergus, R.: Regularization of neural
networks using dropconnect. In: Proceedings of the 30th International Conference
on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013. pp. 1058–
1066 (2013), http://jmlr.org/proceedings/papers/v28/wan13.html
41. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Hyperparameter search space
pruning - A new component for sequential model-based hyperparameter
optimization. In: Machine Learning and Knowledge Discovery in Databases - European
Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings,
Part II. pp. 104–119 (2015), https://doi.org/10.1007/978-3-319-23525-7_7
42. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Learning data set similarities for
hyperparameter optimization initializations. In: Proceedings of the 2015
International Workshop on Meta-Learning and Algorithm Selection co-located with
European Conference on Machine Learning and Principles and Practice of Knowledge
Discovery in Databases 2015 (ECMLPKDD 2015), Porto, Portugal, September 7th,
2015. pp. 15–26 (2015), http://ceur-ws.org/Vol-1455/paper-04.pdf
43. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Learning hyperparameter
optimization initializations. In: 2015 IEEE International Conference on Data Science
and Advanced Analytics, DSAA 2015, Campus des Cordeliers, Paris, France,
October 19-21, 2015. pp. 1–10 (2015), https://doi.org/10.1109/DSAA.2015.7344817
44. Wistuba, M., Schilling, N., Schmidt-Thieme, L.: Automatic frankensteining:
Creating complex ensembles autonomously. In: Proceedings of the 2017 SIAM
International Conference on Data Mining. pp. 741–749. Society for Industrial and Applied
Mathematics (2017)
45. Yogatama, D., Mann, G.: Efficient transfer learning method for automatic
hyperparameter tuning. In: Proceedings of the Seventeenth International Conference
on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April
22-25, 2014. pp. 1077–1085 (2014), http://jmlr.org/proceedings/papers/v33/
yogatama14.html
46. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. CoRR
abs/1611.01578 (2016), http://arxiv.org/abs/1611.01578</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naik</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raskar</surname>
          </string-name>
          , R.:
          <article-title>Designing neural network architectures using reinforcement learning</article-title>
          .
          <source>CoRR abs/1611</source>
          .02167 (
          <year>2016</year>
          ), http://arxiv.org/ abs/1611.02167
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bardenet</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brendel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kégl</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebag</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Collaborative hyperparameter tuning</article-title>
          .
          <source>In: Proceedings of the 30th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2013</year>
          ,
          <article-title>Atlanta</article-title>
          ,
          <string-name>
            <surname>GA</surname>
          </string-name>
          , USA,
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          June 2013. pp.
          <fpage>199</fpage>
          -
          <lpage>207</lpage>
          (
          <year>2013</year>
          ), http:// jmlr.org/proceedings/papers/v28/bardenet13.html
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bergstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>13</volume>
          ,
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          (
          <year>2012</year>
          ), http://dl.acm.org/citation. cfm?id=
          <fpage>2188395</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dahl</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acero</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition</article-title>
          .
          <source>IEEE Trans. Audio, Speech &amp; Language Processing</source>
          <volume>20</volume>
          (
          <issue>1</issue>
          ),
          <fpage>30</fpage>
          -
          <lpage>42</lpage>
          (
          <year>2012</year>
          ), https://doi.org/10.1109/TASL.
          <year>2011</year>
          . 2134090
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>G.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fokoue</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nannicini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samulowitz</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An effective algorithm for hyperparameter optimization of neural networks</article-title>
          .
          <source>CoRR abs/1705</source>
          .08520 (
          <year>2017</year>
          ), http://arxiv.org/abs/1705.08520
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Domhan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves</article-title>
          .
          <source>In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2015</year>
          ,
          <string-name>
            <given-names>Buenos</given-names>
            <surname>Aires</surname>
          </string-name>
          , Argentina,
          <source>July 25-31</source>
          ,
          <year>2015</year>
          . pp.
          <fpage>3460</fpage>
          -
          <lpage>3468</lpage>
          (
          <year>2015</year>
          ), http://ijcai.org/Abstract/15/487
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Feurer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Initializing bayesian hyperparameter optimization via meta-learning</article-title>
          .
          <source>In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30</source>
          ,
          <year>2015</year>
          , Austin, Texas, USA. pp.
          <fpage>1128</fpage>
          -
          <lpage>1135</lpage>
          (
          <year>2015</year>
          ), http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/ 10029
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          .
          <source>In: 2014 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2014</year>
          ,
          <article-title>Columbus</article-title>
          ,
          <string-name>
            <surname>OH</surname>
          </string-name>
          , USA, June 23-28,
          <year>2014</year>
          . pp.
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          (
          <year>2014</year>
          ), https://doi.org/10.1109/CVPR.
          <year>2014</year>
          .81
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Fractional max-pooling</article-title>
          .
          <source>CoRR abs/1412</source>
          .6071 (
          <year>2014</year>
          ), http://arxiv. org/abs/1412.6071
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2016</year>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA, June 27-30,
          <year>2016</year>
          . pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          (
          <year>2016</year>
          ), https://doi. org/10.1109/CVPR.
          <year>2016</year>
          .90
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoos</surname>
            ,
            <given-names>H.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leyton-Brown</surname>
          </string-name>
          , K.:
          <article-title>Sequential model-based optimization for general algorithm configuration</article-title>
          .
          <source>In: Learning and Intelligent Optimization - 5th International Conference, LION 5</source>
          , Rome, Italy, January
          <volume>17</volume>
          -
          <issue>21</issue>
          ,
          <year>2011</year>
          . Selected Papers. pp.
          <fpage>507</fpage>
          -
          <lpage>523</lpage>
          (
          <year>2011</year>
          ), https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -25566-3_
          <fpage>40</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ioffe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>In: Proceedings of the 32nd International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2015</year>
          , Lille, France,
          <fpage>6</fpage>
          -
          <issue>11</issue>
          <year>July 2015</year>
          . pp.
          <fpage>448</fpage>
          -
          <lpage>456</lpage>
          (
          <year>2015</year>
          ), http://jmlr.org/proceedings/papers/v37/ioffe15.html
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jamieson</surname>
            ,
            <given-names>K.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwalkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Non-stochastic best arm identification and hyperparameter optimization</article-title>
          .
          <source>In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics</source>
          , AISTATS
          <year>2016</year>
          , Cadiz, Spain, May 9-
          <issue>11</issue>
          ,
          <year>2016</year>
          . pp.
          <fpage>240</fpage>
          -
          <lpage>248</lpage>
          (
          <year>2016</year>
          ), http://jmlr.org/proceedings/papers/v51/jamieson16.html
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schonlau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welch</surname>
          </string-name>
          , W.J.:
          <article-title>Efficient global optimization of expensive black-box functions</article-title>
          .
          <source>J. Global Optimization</source>
          <volume>13</volume>
          (
          <issue>4</issue>
          ),
          <fpage>455</fpage>
          -
          <lpage>492</lpage>
          (
          <year>1998</year>
          ), https://doi. org/10.1023/A:1008306431147
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falkner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bartels</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hennig</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Fast bayesian optimization of machine learning hyperparameters on large datasets</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics</source>
          , AISTATS
          <year>2017</year>
          ,
          <volume>20</volume>
          -
          <fpage>22</fpage>
          April 2017,
          <string-name>
            <given-names>Fort</given-names>
            <surname>Lauderdale</surname>
          </string-name>
          , FL, USA. pp.
          <fpage>528</fpage>
          -
          <lpage>536</lpage>
          (
          <year>2017</year>
          ), http: //proceedings.mlr.press/v54/klein17a.html
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falkner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Learning curve prediction with Bayesian neural networks</article-title>
          .
          <source>In: Proceedings of the International Conference on Learning Representations (ICLR)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning multiple layers of features from tiny images</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>Commun. ACM</source>
          <volume>60</volume>
          (
          <issue>6</issue>
          ),
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          (
          <year>2017</year>
          ), http://doi.acm.
          <source>org/10.1145/3065386</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haffner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Gradient-based learning applied to document recognition</article-title>
          .
          <source>In: Proceedings of the IEEE</source>
          . pp.
          <fpage>2278</fpage>
          -
          <lpage>2324</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jamieson</surname>
            ,
            <given-names>K.G.</given-names>
          </string-name>
          , DeSalvo, G.,
          <string-name>
            <surname>Rostamizadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwalkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Hyperband: A novel bandit-based approach to hyperparameter optimization</article-title>
          .
          <source>CoRR abs/1603</source>
          .06560 (
          <year>2016</year>
          ), http://arxiv.org/abs/1603.06560
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>McDonnell</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vladusich</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enhanced image classification with a fast-learning shallow convolutional neural network</article-title>
          .
          <source>In: 2015 International Joint Conference on Neural Networks, IJCNN</source>
          <year>2015</year>
          , Killarney, Ireland,
          <source>July 12-17</source>
          ,
          <year>2015</year>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          (
          <year>2015</year>
          ), https://doi.org/10.1109/IJCNN.
          <year>2015</year>
          .7280796
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Mendoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feurer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Springenberg</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Towards automatically-tuned neural networks</article-title>
          .
          <source>In: Proceedings of the 2016 Workshop on Automatic Machine Learning</source>
          ,
          <source>AutoML</source>
          <year>2016</year>
          , co-located
          <source>with 33rd International Conference on Machine Learning (ICML</source>
          <year>2016</year>
          ), New York City, NY, USA, June 24,
          <year>2016</year>
          . pp.
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          (
          <year>2016</year>
          ), http://jmlr.org/proceedings/papers/v64/mendoza_ towards_
          <year>2016</year>
          .html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>