<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automated Selection of Covariance Function for Gaussian Process Surrogate Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jakub Repický</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zbyneˇk Pitra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <email>martin@cs.cas.cz</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Physics, Charles University in Prague Malostranské nám.</institution>
          <addr-line>25, 118 00 Prague 1</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Nuclear Sciences and Physical Engineering</institution>
          ,
          <addr-line>CTU in Prague Brˇehová 7, 115 19 Prague 1</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences Pod Vodárenskou veˇží 2</institution>
          ,
          <addr-line>182 07 Prague 8</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2203</volume>
      <fpage>64</fpage>
      <lpage>71</lpage>
      <abstract>
        <p>Gaussian processes have a long tradition in model-based algorithms for black-box optimization, where a limited number of objective function evaluations are available. A principal choice in specifying a Gaussian process model is the choice of the covariance function, which largely embodies the prior assumptions about the modeled function. Several methods for learning the form of covariance function have been proposed. We report a work in progress in which the covariance function is selected from a fixed set. The goal of covariance function selection is to capture non-local properties of the objective function and derive a more accurate surrogate model. The model-selection algorithm is evaluated in connection with Doubly Trained Surrogate Covariance Matrix Adaptation Evolution Strategy on the Comparing Continuous Optimizers framework. Several estimates of predictive performance, including cross-validation and information criteria, are discussed. Focus is placed on information criteria suitable for nonparametric methods, and two of them are compared experimentally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The principle of continuous black-box optimization is
finding extrema of real-parameter objective function
analytical definition of which is not known. Such
functions, often arising, e. g., in engineering design
optimization or material science, can only be evaluated empirically
or through simulations. Moreover, obtaining function
values may be expensive and affected by noise. The goal
of finding a global optimum is usually relaxed in favor
of finding a good enough solution within as few objective
function evaluations as possible.</p>
      <p>
        Evolution strategies, stochastic population-based
algorithms inspired by the process of natural evolution, present
a popular approach to continuous black-box
optimization. The Covariance Matrix Adaptation Evolution
Strategy (CMA-ES) [
        <xref ref-type="bibr" rid="ref11 ref14">10, 13</xref>
        ] is based on adaptation of the key
component of the mutation operator (the covariance
matrix) according to the historical search steps. The CMA-ES
is considered a state-of-the-art continuous black-box
optimizer. Nevertheless, considerable improvements in terms
of the number of fitness evaluations can be achieved by
use of surrogate models, i. e., statistical or machine
learning models of the fitness trained on data gathered during
the optimization.
      </p>
      <p>
        A variety of models for the CMA-ES has been
investigated, including but not limited to quadratic
approximations [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ], ranking support vector machines [
        <xref ref-type="bibr" rid="ref17">16</xref>
        ], random
forests [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Gaussian processes (GPs) [
        <xref ref-type="bibr" rid="ref21 ref27 ref4">4, 19, 25</xref>
        ].1
      </p>
      <p>
        Gaussian process (GP) regression is a nonparametric
method, meaning the data are assumed to be generated
from an infinite-dimensional distribution, i. e., a
distribution of functions. In black-box optimization, the
distribution of function values conditioned on observed data can
be used to derive a criterion for selecting most
promising points for evaluation with the (expensive) fitness. As
far as we know, the first optimization method utilizing
uncertainty modeled by GPs is Bayesian optimization [
        <xref ref-type="bibr" rid="ref18">17</xref>
        ].
In this paper, we are going to build upon the more
recent Adaptive Doubly Trained Surrogate CMA-ES
(aDTSCMA-ES), which uses a Gaussian process surrogate
models for the CMA-ES, although our approach is directly
applicable to Bayesian optimization as well.
      </p>
      <p>A Gaussian process is fully specified by a mean
function and a covariance function parametrized a by a small
number of parameters. In order to distinguish parameters
of the mean and covariance functions from the
infinitedimensional parameter vector – the vector of function
values – they are referred to as hyperparameters. In
statistical works, the mean and covariance functions are chosen
by the statistician in a cycle of model building and model
checking.</p>
      <p>The goal of this work is to lay out a suitable method
for learning the form of covariance function for Gaussian
processes in black-box optimization with focus on criteria
for evaluating candidate covariance functions. The main
hypothesis behind this paper is that a GP with a composite
form of its covariance function may result in a more
accurate approximation of the objective function and,
consequently, better performance of the model-assisted
optimization algorithm.</p>
      <p>
        1An experimental comparison of selected surrogate-assisted variants
of the CMA-ES can be found in [
        <xref ref-type="bibr" rid="ref22 ref3">3, 20</xref>
        ].
Related Work Learning a composite expression of kernel
functions for support vector machines by genetic
programming was explored in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Hierarchical kernel learning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Additive Gaussian
processes [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are algorithms for determining kernels
composed of lower-dimensional kernels.
      </p>
      <p>
        The goal of Automatic Statistician project [
        <xref ref-type="bibr" rid="ref16">15</xref>
        ] is
automatic statistical analysis of given data with output in
natural language. The algorithm of structure discovery in GP
models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is a greedy search in the space of composite
covariance functions generated by operators of addition
and multiplication recursively applied to basis covariance
functions.
      </p>
      <p>Up to our knowledge, structure discovery in GP
surrogate models for continuous black-box optimization has not
yet been investigated. As a first step towards this goal,
we perform selection of the best GP model from a model
population that we tried to design large enough to
capture structure of typical continuous black-box function but
still small enough for model selection to be
computationaly feasible.</p>
      <p>The paper is organized as follows. Section 2 presents
ideas behind surrogate models in evolutionary
optimization and aDTS-CMA-ES algorithm. Section 3 describes
inference and learning in Gaussian process regression
models. Section 4 presents the algorithm for selecting the
best GP surrogate model. First results from an early stage
of experimental evaluation are presented in Section 5.
Section 6 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Surrogate-Assisted Evolutionary</title>
    </sec>
    <sec id="sec-3">
      <title>Optimization</title>
      <p>Evolutionary strategies are stochastic search algorithms
based on maintaining a population of candidate solutions,
usually encoded as real vectors. In each iteration
(generation), a population of λ offsprings is generated from a
population of μ parents by operators of recombination and
mutation. The new population of parents is selected either
from the union of offsprings and parents (plus selection),
or, provided that μ ≤ λ , from the offsprings exclusively
(comma selection).
2.1</p>
      <p>CMA-ES
Mutation in evolutionary strategies is usually implemented
by sampling from a Gaussian distribution, parameters of
which play a crucial role in algorithms’ convergence. The
main idea behind the CMA-ES is self-adaptation of
mutation parameters, especially of the covariance matrix. The
CMA-ES repeatedly samples from N (m, σ 2C) and
updates parameters σ 2 (overall step-size), m (the mean) and
C (the covariance matrix) so that likelihood of successful
mutation steps increases under new parametrization.
Algorithm 1 aDTS-CMA-ES
Input: λ (population-size), ytarget (target value),
f (original fitness function), α (ratio of
originalevaluated points), C (uncertainty criterion)
1: σ , m, C ← CMA-ES initialize
2: A ← 0/ {archive initialization}
3: while stopping conditions not met do
4: {xk}kλ=1 ∼ N m, σ 2C {CMA-ES sampling}
5: fM 1 ← trainModel(A , σ , m, C) {model training}
6: (yˆ, s2) ← fM 1([x1, . . . , xλ ]) {model evaluation}
7: Xorig←select dαλe best points accord. to C (yˆ, s2)
8: yorig ← f (Xorig) {original fitness evaluation }
9: A = A ∪ {(Xorig, yorig)} {archive update}
10: fM 2 ← trainModel(A , σ , m, C) {model retrain}
11: y ← fM 2([x1, . . . , xλ ]) {2nd model prediction}
12: (y)i ← yorig,i for all original-evaluated yorig,i ∈ yorig
13: α ← selfAdaptation(y, yˆ)
14: σ , m, C ← CMA-ES update
15: end while
16: xres ← xk from A where yk is minimal
Output: xres (point with minimal y)</p>
      <sec id="sec-3-1">
        <title>2.2 aDTS-CMA-ES</title>
        <p>
          The aDTS-CMA-ES [
          <xref ref-type="bibr" rid="ref21 ref23 ref3">3, 19, 21</xref>
          ], utilizes a GP surrogate
model to estimate the fitness of a fraction of the
population. A pseudocode is given in Algorithm 1. The
algorithm expects an uncertainty criterion C for choosing
solutions for re-evaluation. In optimization based on
Gaussian processes, such criteria are conveniently defined on
the marginal GP posterior, which is a univariate
Gaussian distribution. One of the most prominent uncertainty
criteria is the probability of improvement, CPOI(x; T ) =
Pr( f (x) ≤ T ), i. e., the posterior probability that the
function value at a candidate solution x improves on a chosen
target T , typically set to the historically best fitness value.
        </p>
        <p>The sampling in aDTS-CMA-ES is identical to that of
CMA-ES. The surrogate model is trained twice per
generation. The first model is trained on a data set, which
naturally cannot contain any individuals from the current
population. A fraction α of the population is selected
according to C , evaluated with the (expensive) fitness function
and included into the archive of individuals with known
fitness values. The model is retrained and used to predict
the remainder of the population. The fraction α is adapted
according to surrogate model performance.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Gaussian Processes</title>
      <p>Let X be some input space of dimensionality D.
Gaussian process with a mean function μ : X → R and a
covariance function k : X × X → R, is a collection of
random variables ( f (x))x∈X such that every finite-variate
marginal ( f (xi))iN=1 follows a multivariate Gaussian
distribution N (μ(X ), K(X , X )), where μ(X ) = (μ(xi))iN=1 and
K(X , X ) = (k(xi, x j))iN, j=1. Both μ and k are
parameterized, but we omit their parameters for the sake of
readability.</p>
      <sec id="sec-4-1">
        <title>3.1 Inference</title>
        <p>
          Let y = {y1, . . . , yN } be N i. i. d. observations at inputs
X = {x1, . . . , xN }. A model with Gaussian likelihood and
GP prior is given by distributions y | f ∼ N (f, σn2IN ) and
f | X ∼ N (μ(X ), K(X , X )). From now on, we assume μ =
0. Deterministic non-zero mean functions can be used by
simply substracting from y (see [
          <xref ref-type="bibr" rid="ref24">22</xref>
          ] for more on this). Let
us denote by θ the vector of hyperparameters consisting
of parameters of k and noise variance σ 2.
n
        </p>
        <p>
          The marginal likelihood of hyperparameters θ is (see
[
          <xref ref-type="bibr" rid="ref24">22</xref>
          ])
        </p>
        <p>Z
p(y | X , θ ) =</p>
        <p>p(y | X , f , θ )p( f | θ )d f
= ϕ(y | 0, K(X , X ) + σnIN ),
where ϕ denotes the normalized multivariate Gaussian
density.</p>
        <p>In the regression problem, we are interested in
conditional distribution f∗ | y, X , X∗, θ for X∗ a set of N∗ test
inputs. Since [yT fT ]T | X , X∗, θ follows a multivariate
Gaus∗
sian distribution, the distribution of f∗ | y, X , X∗, θ is also a
multivariate Gaussian, in particular
f∗ ∼ N (fˆ∗, cov(f∗)), where
fˆ∗ = K(X∗, X )[K(X , X ) + σn2IN ]−1y
cov(f∗) = K(X∗, X∗)−</p>
        <p>K(X∗, X )[K(X , X ) + σnIN ]−1K(X , X∗)
(1)
(2)
(3)
(4)
(5)
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Hierarchical Model</title>
        <p>When the covariance function family is given, model
selection for GP regression is usually performed
by maximum marginal likelihood estimate θˆML =
arg maxθ log p(y | X , θ ), which is a non-convex
optimization problem. Computation of log marginal likelihood
takes O(N3) time due to a Cholesky decomposition of
covariance matrix K(X , X ).</p>
        <p>From a Bayesian perspective, especially if the number
of hyperparameters is large or if N is small, it might be
more appropriate to do inference with the marginal
posterior distribution of hyperparameters
p(θ | X , y) =
p(y | X , θ )p(θ )
p(y | X )
where p(y | X , θ ) is the marginal likelihood (1), now
playing the role of the likelihood, and p(θ ) is a hyper-prior.
Simulations from p(θ | X , y) can be obtained by Bayesian
computation methods, such as Markov chain Monte Carlo.</p>
        <p>Uncertainty criteria in Algorithm 1 can thus
incorporate uncertainty of hyperparameter estimation in addition
(6)
4.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Performance Criteria</title>
        <p>to uncertainty about functions. In the current stage of
research, we compute the prediction conditioned on a Bayes
estimate θBayes = median({θs, s = 1, . . . , S}), i. e., the
median of the posterior sample.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Model Selection</title>
      <p>
        If the probability of the true fitness function under GP
prior is low, the performance of the model will be poor.
For example, a GP with a neural network covariance fits
data from a jump function better compared to a GP with a
squared exponential [
        <xref ref-type="bibr" rid="ref24">22</xref>
        ] (more on covariance functions in
Subsection 4.1). Searching over GP models with different
covariances thus can be viewed as an automated
construction of suitable priors. We select the model from a finite set
according to a criterion of predictive performance, since
this approach can easily be embedded into a combinatorial
search algorithm, such as in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. GPs can represent
random functions. The finite population of models included
in our approach is described in Subsection 4.1. Some
important classes of functions, such as linear and quadratic
functions, neural networks and additive functions, are
represented.
4.1
      </p>
      <sec id="sec-5-1">
        <title>Model Population</title>
        <p>The set of candidate GP models is shown in Figure 1. All
models have zero mean.</p>
        <p>
          A covariance function k(x, x0) is stationary if it is a
function of a distance kx − x0k. The squared exponential
(SE) [
          <xref ref-type="bibr" rid="ref24">22</xref>
          ] is a stationary covariance function that leads to
smooth processes [
          <xref ref-type="bibr" rid="ref24">22</xref>
          ].
        </p>
        <p>
          A neural network (NN) covariance is a covariance of a
GP induced by a Gaussian prior on weights of an infinitely
wide neural network [
          <xref ref-type="bibr" rid="ref19">18</xref>
          ].
        </p>
        <p>A dot product with a bias constant term models linear
functions. The quadratic covariance is such a linear
covariance squared. GPs with these covariances lead to Bayesian
variants of linear and quadratic regression, respectively.</p>
        <p>
          Additive covariance functions [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] are sums of lower
dimensional components. We include an additive covariance
function with a single degree of interaction – a
superposition of one-dimensional squared exponentials.
        </p>
        <p>Finally, we consider two cases of composite covariance
functions: a sum of a squared exponential and a neural
network; and a sum of a squared exponential and a quadratic.
We would like to select the surrogate model based on an
estimation of out-of-sample predictive accuracy.</p>
        <p>An attractive estimate of the out-of-sample predictive
accuracy is cross-validation based on some partitioning of
the data set into multiple data sets called folds. However,
choosing among multiple GP models by cross-validation
in each generation of the evolutionary optimization can
E0
S
-2
-4
4
2
N0
N
-2
-4
4
2
IN0
L
-2
-4
4
2
D
A0
U
Q-2
-4
4
2
DD0
A
-2
-4
4
2
N
N0
+
E
S-2
-4
4
D2
A
U0
Q
+
E-2
S
-4
-10
50
-50
0
4
2
0
5
0
-2</p>
        <p>-5
-5
20</p>
        <p>0
-20
-40
-5
-5
0
0
0
0
0
0
0
5
5
5
5
5
5
5
2
0
-2
50</p>
        <p>0
-50
1000
500
0
4</p>
        <p>0
0
2
1
0
5
0
-5
4</p>
        <p>0
4</p>
        <p>0
4</p>
        <p>0
4</p>
        <p>0
4</p>
        <p>0
1000
500
0
4
0
0 4
be considered prohibitive from the computational
perspective.</p>
        <p>
          In the remainder of this subsection we follow the
exposition of model comparison from Bayesian perspective
given in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We denote by q the true distribution from
which data y are sampled and we suppress conditioning
on X for simplicity.
        </p>
        <p>A general measure of fit of a probabilistic model
y to data is the log likelihood or log predictive
density log p(y | θ ) = log ∏iN=1 p(yi | θ ). The quantity
−2 log p(y | θ ) is called deviance.</p>
        <p>
          Akaike information criterion (AIC) [
          <xref ref-type="bibr" rid="ref1 ref9">1</xref>
          ] and related
Bayes information criterion (BIC) [
          <xref ref-type="bibr" rid="ref25">23</xref>
          ] are based on the
expected log predictive density conditioned on a
maximum likelihood estimate θˆML,
elpdθˆ= Eq(log p(y˜| θˆML)),
(7)
where the expectation is taken over all possible data sets
y˜. Since expectation (7) cannot be computed exactly, it is
estimated from sample y. The AIC and BIC compensate
for the bias towards overfitting by substracting a correction
term, the number of parameters nθ and 21 nθ log N,
respectively.
        </p>
        <p>
          For hierarchical Bayesian models, such as (6), it is not
always entirely clear, what the parameters of the model
are, since the likelihood can factorize in different ways.
The deviance information criterion (DIC) [
          <xref ref-type="bibr" rid="ref26">24</xref>
          ] is still
based on deviance, conditioned on a Bayes estimate θˆBayes,
but the effective number of parameters pDIC depends on
data. We define the DIC for the marginal likelihood (1),
focusing on hyperparameters θ , although it could be defined
for the likelihood p(y | f , θ ), focusing on both f and θ .
        </p>
        <p>
          We use the following definition of the effective number
of parameters (see [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]):
        </p>
        <p>pDIC = 2varpost(log p(y | θ )),
which can be estimated by the sample variance of a
posterior sample. Using the effective number of parameters, the
DIC is</p>
        <p>DIC = −2 log p(y | θˆBayes) + 2pDIC.</p>
        <p>
          A probabilistic model is called regular if its
parameters are identifiable and its Fisher information matrix is
positive definite for all parameter values. The model is
called singular otherwise. The information criteria defined
above assume regularity. The Widely applicable
information (WAIC) [
          <xref ref-type="bibr" rid="ref28">26</xref>
          ] works also for singular models. The
WAIC is based on estimation of the expected log
pointwise predictive density
        </p>
        <p>N
elppd = ∑ Eq(log ppost(y˜i))
i=1</p>
        <p>N Z
= ∑ Eq(log
i=1
p(y˜i | y, θ )p(θ | y)dθ .</p>
        <p>
          The estimation of elppd from the sample is biased, so
again, an effective number of parameters must be added as
a correction. We use the following definition of the WAIC
(see [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]):
        </p>
        <p>N N
WAIC = − ∑ log ppost(yi) + ∑ varpost(log p(yi | θ )),
i=1 i=1
that is the negative log pointwise predictive density
corrected for bias by pointwise posterior variance of log
predictive density.</p>
        <p>The pointwise predictive density ppost(yi | y, θ ) for the
GP model (1) is computed by integrating Gaussian
likelihood over the marginal posterior GP at ith training point:
p(yi | y, θ ) =</p>
        <p>Z</p>
        <p>p(yi | y, fi, θ )p( fi | y, θ ) d fi
= ϕ(yi | fˆi, σn2 + var( fi)),
where ϕ denotes the Gaussian density and fˆi, var( fi) are
as in (3).
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experimental Evaluation</title>
      <p>
        In this section, we describe preliminary experimental
evaluation procedure of aDTS-CMA-ES that uses a GP model
with an automated selection of covariance function. Since
GPs are a nonparametric model, we opt for the WAIC,
which require a sample from distribution (6). We use
Metropolis-Hastings MCMC with an adaptive proposal
distribution [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ] 2.
      </p>
      <p>Algorithm 1 is updated in the following way: 3
1. In steps (5) and (10), all GPs from Figure 1 are
trained.
2. The predictive accuracy of all models is evaluated
using the WAIC (4.2). The DIC (4.2) is also computed
for information, but not taken into account.
3. The model with the lowest WAIC is used for
prediction (steps (6) and (11)).</p>
      <p>The hyper-priors are chosen as follows: log-normal
with mean log(0.01) and variance 2 for σn2; and log-tν=4
with mean 0 for all other hyperpameters.
5.1</p>
      <sec id="sec-6-1">
        <title>Setup</title>
        <p>
          The proposed algorithm implemented in MATLAB is
evaluated on the noiseless testbed of the COCO/BBOB
(Comparing Continuous Optimizers / Black-Box Optimization
Benchmarking) framework [
          <xref ref-type="bibr" rid="ref12 ref13">11,12</xref>
          ] and compared with the
GP-based aDTS-CMA-ES and the CMA-ES itself.
        </p>
        <p>2Using MATLAB implementation available at http://helios.
fmi.fi/~lainema/dram/</p>
        <p>3The sources are available at https://github.com/repjak/
surrogate-cmaes/tree/modelsel</p>
        <p>
          The testbed consists of 24 functions, each defined
everywhere on RD with the optimum in [
          <xref ref-type="bibr" rid="ref5">−5, 5</xref>
          ]D for all
dimensionalities D ≥ 2. Each test function has multiple
instances which are derived by various transformations of
input space or f -space. We run the algorithm on 5
instances (1 . . . , 5) as opposed to 15 recommended instances
for the reason of increased computational demands of the
modified algorithm. For the same reason, only functions
of 10 variables (10D) are considered.
        </p>
        <p>
          If not stated otherwise, all settings of the
aDTS-CMAES are as recommended in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>The CMA-ES results in BBOB format were
downloaded from the BBOB 2010 workshop archive 4.
5.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Results</title>
        <p>Figure 2 gives the scaled best-achieved logarithms Δlog
f
of median distances to the functions optimum for the
respective number of function evaluations per dimension
(FE/D). Medians and the 1st and 3rd quartiles are
calculated from 5 independent instances in case of the algorithm
with covariance selection according to the WAIC and from
15 independent instances otherwise. We observe that in
most cases, the WAIC-based algorithm mostly barely
outperforms the pure CMA-ES, which suggests the chosen
model is generally weak and the adaptivity mechanism
basically turns off using the surrogate model. The functions
where the WAIC variant outperforms the aDTS-CMA-ES
(f21 and f22) are multi-modal and the interquartile range
is large.</p>
        <p>In order to compare the considered information
criteria, we calculate the rank of each model under both WAIC
and DIC. Table 1 summarizes the average ranks over all
model selections performed on each benchmark function.
We observe that the DIC often prefers the additive model,
while the WAIC is more balanced in this respect.
Surprisingly the linear kernel has been very rarely selected even
on the linear function (f5) under both information criteria.
A similar observation holds for the quadratic kernel and
the quadratic functions (f1, f2).
6</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion &amp; Further Work</title>
      <p>In this paper, we presented an algorithm for selecting a
GP kernel using Bayesian model comparison techniques.
Preliminary experiments for the model selection plugged
into the aDTS-CMA-ES algorithm were conducted on the
COCO/BBOB testbed. Due to the small number of
experiments performed so far, it is difficult to draw any serious
conclusions. The first obtained results may indicate
improper convergence of the MCMC sampler or that more
sophisticated covariance functions may be needed.</p>
      <p>
        One direction of future research, beside analyzing and
repairing aforementioned deficiencies, is an extension of
4http://coco.gforge.inria.fr/data-archive/bbob/
2010/
the proposed algorithm into a combinatorial search over
kernels in flavor of [
        <xref ref-type="bibr" rid="ref5 ref7">5,7</xref>
        ], which is challenging due to
computational costs related to the need of repeated surrogate
model retraining.
      </p>
      <p>
        One possible direction of research is a co-evolution of
a population of covariance functions alongside the
population of candidate solutions to the black-box objective
function. Other related research area is applying surrogate
modeling to high-dimensional problems using algorithms
for variable selection via multiple kernel learning [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This research was supported by SVV project
number 260 453 and the Czech Science Foundation grants
No. 17-01251. Further, access to computing and
storage facilities owned by parties and projects
contributing to the National Grid Infrastructure MetaCentrum
provided under the programme "Projects of Large Research,
Development, and Innovations Infrastructures" (CESNET
LM2015042), is greatly appreciated.
aDTS-CMA-ES
WAIC</p>
      <p>CMA-ES
50</p>
      <p>100 150
f4 Bueche-Rastrigin 10D
200</p>
      <p>250
50
50
0
-2</p>
      <p>WAIC</p>
      <p>SE+NN</p>
      <p>SE+LIN</p>
      <p>SE+NN</p>
      <p>SE+LIN</p>
      <p>DIC</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Akaike</surname>
          </string-name>
          .
          <source>Information Theory and an Extension of the Maximum Likelihood Principle</source>
          , pages
          <fpage>199</fpage>
          -
          <lpage>213</lpage>
          . Springer New York,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bach</surname>
          </string-name>
          .
          <article-title>High-dimensional non-linear variable selection through hierarchical kernel learning</article-title>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bajer</surname>
          </string-name>
          .
          <article-title>Model-based evolutionary optimization methods</article-title>
          .
          <source>PhD thesis</source>
          ,
          <source>Faculty of Mathematics and Physics</source>
          , Charles University in Prague, Prague,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bajer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Holenˇa. Benchmarking</surname>
          </string-name>
          <article-title>Gaussian processes and random forests surrogate models on the BBOB noiseless testbed</article-title>
          .
          <source>In Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference - GECCO Companion '15 . Association for Computing Machinery (ACM)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Duvenaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Zoubin</surname>
          </string-name>
          .
          <article-title>Structure discovery in nonparametric regression through compositional kernel search</article-title>
          . In S. Dasgupta and D. McAllester, editors,
          <source>Proceedings of the 30th International Conference on Machine Learning</source>
          , volume
          <volume>28</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , pages
          <fpage>1166</fpage>
          -
          <lpage>1174</lpage>
          , Atlanta, Georgia, USA,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          Jun
          <year>2013</year>
          . PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Duvenaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nickisch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          .
          <article-title>Additive gaussian processes</article-title>
          . In J.
          <string-name>
            <surname>Shawe-Taylor</surname>
            , R. S. Zemel,
            <given-names>P. L.</given-names>
          </string-name>
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Q. Weinberger, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>24</volume>
          , pages
          <fpage>226</fpage>
          -
          <lpage>234</lpage>
          . Curran Associates, Inc.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gagné</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schoenauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sebag</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tomassini</surname>
          </string-name>
          .
          <article-title>Genetic programming for kernel-based learning with coevolving subsets selection</article-title>
          .
          <source>In PARALLEL PROBLEM SOLVING FROM NATURE</source>
          , REYKJAVIK, LNCS, pages
          <fpage>1008</fpage>
          -
          <lpage>1017</lpage>
          . Springer Verlag,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Vehtari</surname>
          </string-name>
          .
          <article-title>Understanding predictive information criteria for bayesian models</article-title>
          .
          <source>Statistics and Computing</source>
          ,
          <volume>24</volume>
          (
          <issue>6</issue>
          ):
          <fpage>997</fpage>
          -
          <lpage>1016</lpage>
          , Nov.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>f1 Sphere 10D</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Haario</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Laine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Saksman</surname>
          </string-name>
          . Dram:
          <article-title>Efficient adaptive mcmc</article-title>
          .
          <source>Statistics and Computing</source>
          ,
          <volume>16</volume>
          (
          <issue>4</issue>
          ):
          <fpage>339</fpage>
          -
          <lpage>354</lpage>
          ,
          <year>Dec 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          .
          <article-title>The CMA evolution strategy: a comparing review</article-title>
          , pages
          <fpage>75</fpage>
          -
          <lpage>102</lpage>
          . Springer, Berlin, Heidelberg,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Finck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ros</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Auger</surname>
          </string-name>
          .
          <article-title>Real-parameter Black-Box Optimization Benchmarking 2009: Noiseless functions definitions</article-title>
          .
          <source>Technical report, INRIA</source>
          ,
          <year>2009</year>
          , updated
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Finck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ros</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Auger</surname>
          </string-name>
          .
          <article-title>Real-parameter Black-Box Optimization Benchmarking 2012: Experimental setup</article-title>
          .
          <source>Technical report, INRIA</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Ostermeier. Completely Derandomized</surname>
          </string-name>
          Self-Adaptation in Evolution Strategies.
          <source>Evolutionary Computation</source>
          ,
          <volume>9</volume>
          (
          <issue>2</issue>
          ):
          <fpage>159</fpage>
          -
          <lpage>195</lpage>
          ,
          <year>June 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Koumoutsakos</surname>
          </string-name>
          .
          <article-title>Local Metamodels for Optimization Using Evolution Strategies</article-title>
          , pages
          <fpage>939</fpage>
          -
          <lpage>948</lpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Duvenaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          .
          <article-title>Automatic construction and naturallanguage description of nonparametric regression models</article-title>
          .
          <source>CoRR</source>
          , abs/1402.4304,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schoenauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sebag</surname>
          </string-name>
          .
          <article-title>Intensive surrogate model exploitation in self-adaptive surrogateassisted cma-es (saacm-es)</article-title>
          .
          <source>In Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13</source>
          . ACM Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mocˇkus</surname>
          </string-name>
          .
          <article-title>On bayesian methods for seeking the extremum</article-title>
          .
          <source>In Proceedings of the IFIP Technical Conference</source>
          , London, UK,
          <year>1974</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Neal</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          Springer-Verlag New York, Inc., Secaucus, NJ, USA,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bajer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Holenˇa</surname>
          </string-name>
          .
          <article-title>Doubly trained evolution control for the surrogate CMA-ES. In Parallel Problem Solving from Nature -</article-title>
          PPSN XIV , pages
          <fpage>59</fpage>
          -
          <lpage>68</lpage>
          . Springer International Publishing,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bajer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Repický</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Holenˇa. Overview of surrogate-model versions of covariance matrix adaptation evolution strategy</article-title>
          .
          <source>In Proceedings of the Genetic and Evolutionary Computation Conference</source>
          <year>2017</year>
          , Berlin, Germany,
          <source>July 15-19</source>
          ,
          <year>2017</year>
          (GECCO '17) . ACM,
          <year>July 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bajer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Repický</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Holenˇa. Adaptive Doubly Trained Evolution</surname>
          </string-name>
          <article-title>Control for the Covariance Matrix Adaptation Evolution Strategy</article-title>
          . In ITAT 2017:
          <article-title>Information Technologies-Applications and</article-title>
          <string-name>
            <surname>Theory</surname>
            , Martin,
            <given-names>Sept.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>CreateSpace Independent Publ</article-title>
          . Platform.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Rassmusen</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. K. I.</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <article-title>Gaussian processes for machine learning</article-title>
          .
          <source>Adaptive computation and machine learning series</source>
          . MIT Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Schwarz.</surname>
          </string-name>
          <article-title>Estimating the dimension of a model</article-title>
          .
          <source>The Annals of Statistics</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <fpage>461</fpage>
          -
          <lpage>464</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Spiegelhalter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Best</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carlin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Van Der</given-names>
            <surname>Linde</surname>
          </string-name>
          .
          <article-title>Bayesian measures of model complexity and fit</article-title>
          .
          <source>Journal of the Royal Statistical Society. Series B: Statistical Methodology</source>
          ,
          <volume>64</volume>
          (
          <issue>4</issue>
          ):
          <fpage>583</fpage>
          -
          <lpage>616</lpage>
          ,
          <year>12 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ulmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Streichert</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zell</surname>
          </string-name>
          .
          <article-title>Evolution strategies assisted by Gaussian processes with improved pre-selection criterion</article-title>
          .
          <source>In The 2003 Congress on Evolutionary Computation</source>
          ,
          <year>2003</year>
          . CEC '
          <volume>03</volume>
          . , pages
          <fpage>692</fpage>
          -
          <lpage>699</lpage>
          . Institute of Electrical and Electronics
          <string-name>
            <surname>Engineers</surname>
          </string-name>
          (IEEE),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          .
          <article-title>Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>11</volume>
          :
          <fpage>3571</fpage>
          -
          <lpage>3594</lpage>
          , Dec.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>