1 Introduction

1613-0073

Automated Selection of Covariance Function for Gaussian Process Surrogate Models

Jakub Repický

0 2

Zbyneˇk Pitra

1 2

Martin Holenˇa

martin@cs.cas.cz 2 0 Faculty of Mathematics and Physics, Charles University in Prague Malostranské nám. 25, 118 00 Prague 1 , Czech Republic 1 Faculty of Nuclear Sciences and Physical Engineering , CTU in Prague Brˇehová 7, 115 19 Prague 1 , Czech Republic 2 Institute of Computer Science, Czech Academy of Sciences Pod Vodárenskou veˇží 2 , 182 07 Prague 8 , Czech Republic

2018

2203 64 71

Gaussian processes have a long tradition in model-based algorithms for black-box optimization, where a limited number of objective function evaluations are available. A principal choice in specifying a Gaussian process model is the choice of the covariance function, which largely embodies the prior assumptions about the modeled function. Several methods for learning the form of covariance function have been proposed. We report a work in progress in which the covariance function is selected from a fixed set. The goal of covariance function selection is to capture non-local properties of the objective function and derive a more accurate surrogate model. The model-selection algorithm is evaluated in connection with Doubly Trained Surrogate Covariance Matrix Adaptation Evolution Strategy on the Comparing Continuous Optimizers framework. Several estimates of predictive performance, including cross-validation and information criteria, are discussed. Focus is placed on information criteria suitable for nonparametric methods, and two of them are compared experimentally.

1 Introduction

The principle of continuous black-box optimization is finding extrema of real-parameter objective function analytical definition of which is not known. Such functions, often arising, e. g., in engineering design optimization or material science, can only be evaluated empirically or through simulations. Moreover, obtaining function values may be expensive and affected by noise. The goal of finding a global optimum is usually relaxed in favor of finding a good enough solution within as few objective function evaluations as possible.

Evolution strategies, stochastic population-based algorithms inspired by the process of natural evolution, present a popular approach to continuous black-box optimization. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [ 10, 13 ] is based on adaptation of the key component of the mutation operator (the covariance matrix) according to the historical search steps. The CMA-ES is considered a state-of-the-art continuous black-box optimizer. Nevertheless, considerable improvements in terms of the number of fitness evaluations can be achieved by use of surrogate models, i. e., statistical or machine learning models of the fitness trained on data gathered during the optimization.

A variety of models for the CMA-ES has been investigated, including but not limited to quadratic approximations [ 14 ], ranking support vector machines [ 16 ], random forests [ 4 ] and Gaussian processes (GPs) [ 4, 19, 25 ].1

Gaussian process (GP) regression is a nonparametric method, meaning the data are assumed to be generated from an infinite-dimensional distribution, i. e., a distribution of functions. In black-box optimization, the distribution of function values conditioned on observed data can be used to derive a criterion for selecting most promising points for evaluation with the (expensive) fitness. As far as we know, the first optimization method utilizing uncertainty modeled by GPs is Bayesian optimization [ 17 ]. In this paper, we are going to build upon the more recent Adaptive Doubly Trained Surrogate CMA-ES (aDTSCMA-ES), which uses a Gaussian process surrogate models for the CMA-ES, although our approach is directly applicable to Bayesian optimization as well.

A Gaussian process is fully specified by a mean function and a covariance function parametrized a by a small number of parameters. In order to distinguish parameters of the mean and covariance functions from the infinitedimensional parameter vector – the vector of function values – they are referred to as hyperparameters. In statistical works, the mean and covariance functions are chosen by the statistician in a cycle of model building and model checking.

The goal of this work is to lay out a suitable method for learning the form of covariance function for Gaussian processes in black-box optimization with focus on criteria for evaluating candidate covariance functions. The main hypothesis behind this paper is that a GP with a composite form of its covariance function may result in a more accurate approximation of the objective function and, consequently, better performance of the model-assisted optimization algorithm.

1An experimental comparison of selected surrogate-assisted variants of the CMA-ES can be found in [ 3, 20 ]. Related Work Learning a composite expression of kernel functions for support vector machines by genetic programming was explored in [ 7 ].

Hierarchical kernel learning [ 2 ] and Additive Gaussian processes [ 6 ] are algorithms for determining kernels composed of lower-dimensional kernels.

The goal of Automatic Statistician project [ 15 ] is automatic statistical analysis of given data with output in natural language. The algorithm of structure discovery in GP models [ 5 ] is a greedy search in the space of composite covariance functions generated by operators of addition and multiplication recursively applied to basis covariance functions.

Up to our knowledge, structure discovery in GP surrogate models for continuous black-box optimization has not yet been investigated. As a first step towards this goal, we perform selection of the best GP model from a model population that we tried to design large enough to capture structure of typical continuous black-box function but still small enough for model selection to be computationaly feasible.

The paper is organized as follows. Section 2 presents ideas behind surrogate models in evolutionary optimization and aDTS-CMA-ES algorithm. Section 3 describes inference and learning in Gaussian process regression models. Section 4 presents the algorithm for selecting the best GP surrogate model. First results from an early stage of experimental evaluation are presented in Section 5. Section 6 concludes the paper. 2

Surrogate-Assisted Evolutionary Optimization

Evolutionary strategies are stochastic search algorithms based on maintaining a population of candidate solutions, usually encoded as real vectors. In each iteration (generation), a population of λ offsprings is generated from a population of μ parents by operators of recombination and mutation. The new population of parents is selected either from the union of offsprings and parents (plus selection), or, provided that μ ≤ λ , from the offsprings exclusively (comma selection). 2.1

CMA-ES Mutation in evolutionary strategies is usually implemented by sampling from a Gaussian distribution, parameters of which play a crucial role in algorithms’ convergence. The main idea behind the CMA-ES is self-adaptation of mutation parameters, especially of the covariance matrix. The CMA-ES repeatedly samples from N (m, σ 2C) and updates parameters σ 2 (overall step-size), m (the mean) and C (the covariance matrix) so that likelihood of successful mutation steps increases under new parametrization. Algorithm 1 aDTS-CMA-ES Input: λ (population-size), ytarget (target value), f (original fitness function), α (ratio of originalevaluated points), C (uncertainty criterion) 1: σ , m, C ← CMA-ES initialize 2: A ← 0/ {archive initialization} 3: while stopping conditions not met do 4: {xk}kλ=1 ∼ N m, σ 2C {CMA-ES sampling} 5: fM 1 ← trainModel(A , σ , m, C) {model training} 6: (yˆ, s2) ← fM 1([x1, . . . , xλ ]) {model evaluation} 7: Xorig←select dαλe best points accord. to C (yˆ, s2) 8: yorig ← f (Xorig) {original fitness evaluation } 9: A = A ∪ {(Xorig, yorig)} {archive update} 10: fM 2 ← trainModel(A , σ , m, C) {model retrain} 11: y ← fM 2([x1, . . . , xλ ]) {2nd model prediction} 12: (y)i ← yorig,i for all original-evaluated yorig,i ∈ yorig 13: α ← selfAdaptation(y, yˆ) 14: σ , m, C ← CMA-ES update 15: end while 16: xres ← xk from A where yk is minimal Output: xres (point with minimal y)

2.2 aDTS-CMA-ES

The aDTS-CMA-ES [ 3, 19, 21 ], utilizes a GP surrogate model to estimate the fitness of a fraction of the population. A pseudocode is given in Algorithm 1. The algorithm expects an uncertainty criterion C for choosing solutions for re-evaluation. In optimization based on Gaussian processes, such criteria are conveniently defined on the marginal GP posterior, which is a univariate Gaussian distribution. One of the most prominent uncertainty criteria is the probability of improvement, CPOI(x; T ) = Pr( f (x) ≤ T ), i. e., the posterior probability that the function value at a candidate solution x improves on a chosen target T , typically set to the historically best fitness value.

The sampling in aDTS-CMA-ES is identical to that of CMA-ES. The surrogate model is trained twice per generation. The first model is trained on a data set, which naturally cannot contain any individuals from the current population. A fraction α of the population is selected according to C , evaluated with the (expensive) fitness function and included into the archive of individuals with known fitness values. The model is retrained and used to predict the remainder of the population. The fraction α is adapted according to surrogate model performance. 3

Gaussian Processes

Let X be some input space of dimensionality D. Gaussian process with a mean function μ : X → R and a covariance function k : X × X → R, is a collection of random variables ( f (x))x∈X such that every finite-variate marginal ( f (xi))iN=1 follows a multivariate Gaussian distribution N (μ(X ), K(X , X )), where μ(X ) = (μ(xi))iN=1 and K(X , X ) = (k(xi, x j))iN, j=1. Both μ and k are parameterized, but we omit their parameters for the sake of readability.

3.1 Inference

Let y = {y1, . . . , yN } be N i. i. d. observations at inputs X = {x1, . . . , xN }. A model with Gaussian likelihood and GP prior is given by distributions y | f ∼ N (f, σn2IN ) and f | X ∼ N (μ(X ), K(X , X )). From now on, we assume μ = 0. Deterministic non-zero mean functions can be used by simply substracting from y (see [ 22 ] for more on this). Let us denote by θ the vector of hyperparameters consisting of parameters of k and noise variance σ 2. n

The marginal likelihood of hyperparameters θ is (see [ 22 ])

Z p(y | X , θ ) =

p(y | X , f , θ )p( f | θ )d f = ϕ(y | 0, K(X , X ) + σnIN ), where ϕ denotes the normalized multivariate Gaussian density.

In the regression problem, we are interested in conditional distribution f∗ | y, X , X∗, θ for X∗ a set of N∗ test inputs. Since [yT fT ]T | X , X∗, θ follows a multivariate Gaus∗ sian distribution, the distribution of f∗ | y, X , X∗, θ is also a multivariate Gaussian, in particular f∗ ∼ N (fˆ∗, cov(f∗)), where fˆ∗ = K(X∗, X )[K(X , X ) + σn2IN ]−1y cov(f∗) = K(X∗, X∗)−

K(X∗, X )[K(X , X ) + σnIN ]−1K(X , X∗) (1) (2) (3) (4) (5) 3.2

Hierarchical Model

When the covariance function family is given, model selection for GP regression is usually performed by maximum marginal likelihood estimate θˆML = arg maxθ log p(y | X , θ ), which is a non-convex optimization problem. Computation of log marginal likelihood takes O(N3) time due to a Cholesky decomposition of covariance matrix K(X , X ).

From a Bayesian perspective, especially if the number of hyperparameters is large or if N is small, it might be more appropriate to do inference with the marginal posterior distribution of hyperparameters p(θ | X , y) = p(y | X , θ )p(θ ) p(y | X ) where p(y | X , θ ) is the marginal likelihood (1), now playing the role of the likelihood, and p(θ ) is a hyper-prior. Simulations from p(θ | X , y) can be obtained by Bayesian computation methods, such as Markov chain Monte Carlo.

Uncertainty criteria in Algorithm 1 can thus incorporate uncertainty of hyperparameter estimation in addition (6) 4.2

Performance Criteria

to uncertainty about functions. In the current stage of research, we compute the prediction conditioned on a Bayes estimate θBayes = median({θs, s = 1, . . . , S}), i. e., the median of the posterior sample. 4

Model Selection

If the probability of the true fitness function under GP prior is low, the performance of the model will be poor. For example, a GP with a neural network covariance fits data from a jump function better compared to a GP with a squared exponential [ 22 ] (more on covariance functions in Subsection 4.1). Searching over GP models with different covariances thus can be viewed as an automated construction of suitable priors. We select the model from a finite set according to a criterion of predictive performance, since this approach can easily be embedded into a combinatorial search algorithm, such as in [ 5 ]. GPs can represent random functions. The finite population of models included in our approach is described in Subsection 4.1. Some important classes of functions, such as linear and quadratic functions, neural networks and additive functions, are represented. 4.1

Model Population

The set of candidate GP models is shown in Figure 1. All models have zero mean.

A covariance function k(x, x0) is stationary if it is a function of a distance kx − x0k. The squared exponential (SE) [ 22 ] is a stationary covariance function that leads to smooth processes [ 22 ].

A neural network (NN) covariance is a covariance of a GP induced by a Gaussian prior on weights of an infinitely wide neural network [ 18 ].

A dot product with a bias constant term models linear functions. The quadratic covariance is such a linear covariance squared. GPs with these covariances lead to Bayesian variants of linear and quadratic regression, respectively.

Additive covariance functions [ 6 ] are sums of lower dimensional components. We include an additive covariance function with a single degree of interaction – a superposition of one-dimensional squared exponentials.

Finally, we consider two cases of composite covariance functions: a sum of a squared exponential and a neural network; and a sum of a squared exponential and a quadratic. We would like to select the surrogate model based on an estimation of out-of-sample predictive accuracy.

An attractive estimate of the out-of-sample predictive accuracy is cross-validation based on some partitioning of the data set into multiple data sets called folds. However, choosing among multiple GP models by cross-validation in each generation of the evolutionary optimization can E0 S -2 -4 4 2 N0 N -2 -4 4 2 IN0 L -2 -4 4 2 D A0 U Q-2 -4 4 2 DD0 A -2 -4 4 2 N N0 + E S-2 -4 4 D2 A U0 Q + E-2 S -4 -10 50 -50 0 4 2 0 5 0 -2

-5 -5 20

0 -20 -40 -5 -5 0 0 0 0 0 0 0 5 5 5 5 5 5 5 2 0 -2 50

0 -50 1000 500 0 4

0 0 2 1 0 5 0 -5 4

0 4

0 1000 500 0 4 0 0 4 be considered prohibitive from the computational perspective.

In the remainder of this subsection we follow the exposition of model comparison from Bayesian perspective given in [ 8 ]. We denote by q the true distribution from which data y are sampled and we suppress conditioning on X for simplicity.

A general measure of fit of a probabilistic model y to data is the log likelihood or log predictive density log p(y | θ ) = log ∏iN=1 p(yi | θ ). The quantity −2 log p(y | θ ) is called deviance.

Akaike information criterion (AIC) [ 1 ] and related Bayes information criterion (BIC) [ 23 ] are based on the expected log predictive density conditioned on a maximum likelihood estimate θˆML, elpdθˆ= Eq(log p(y˜| θˆML)), (7) where the expectation is taken over all possible data sets y˜. Since expectation (7) cannot be computed exactly, it is estimated from sample y. The AIC and BIC compensate for the bias towards overfitting by substracting a correction term, the number of parameters nθ and 21 nθ log N, respectively.

For hierarchical Bayesian models, such as (6), it is not always entirely clear, what the parameters of the model are, since the likelihood can factorize in different ways. The deviance information criterion (DIC) [ 24 ] is still based on deviance, conditioned on a Bayes estimate θˆBayes, but the effective number of parameters pDIC depends on data. We define the DIC for the marginal likelihood (1), focusing on hyperparameters θ , although it could be defined for the likelihood p(y | f , θ ), focusing on both f and θ .

We use the following definition of the effective number of parameters (see [ 8 ]):

pDIC = 2varpost(log p(y | θ )), which can be estimated by the sample variance of a posterior sample. Using the effective number of parameters, the DIC is

DIC = −2 log p(y | θˆBayes) + 2pDIC.

A probabilistic model is called regular if its parameters are identifiable and its Fisher information matrix is positive definite for all parameter values. The model is called singular otherwise. The information criteria defined above assume regularity. The Widely applicable information (WAIC) [ 26 ] works also for singular models. The WAIC is based on estimation of the expected log pointwise predictive density

N elppd = ∑ Eq(log ppost(y˜i)) i=1

N Z = ∑ Eq(log i=1 p(y˜i | y, θ )p(θ | y)dθ .

The estimation of elppd from the sample is biased, so again, an effective number of parameters must be added as a correction. We use the following definition of the WAIC (see [ 8 ]):

N N WAIC = − ∑ log ppost(yi) + ∑ varpost(log p(yi | θ )), i=1 i=1 that is the negative log pointwise predictive density corrected for bias by pointwise posterior variance of log predictive density.

The pointwise predictive density ppost(yi | y, θ ) for the GP model (1) is computed by integrating Gaussian likelihood over the marginal posterior GP at ith training point: p(yi | y, θ ) =

p(yi | y, fi, θ )p( fi | y, θ ) d fi = ϕ(yi | fˆi, σn2 + var( fi)), where ϕ denotes the Gaussian density and fˆi, var( fi) are as in (3). 5

Experimental Evaluation

In this section, we describe preliminary experimental evaluation procedure of aDTS-CMA-ES that uses a GP model with an automated selection of covariance function. Since GPs are a nonparametric model, we opt for the WAIC, which require a sample from distribution (6). We use Metropolis-Hastings MCMC with an adaptive proposal distribution [ 9 ] 2.

Algorithm 1 is updated in the following way: 3 1. In steps (5) and (10), all GPs from Figure 1 are trained. 2. The predictive accuracy of all models is evaluated using the WAIC (4.2). The DIC (4.2) is also computed for information, but not taken into account. 3. The model with the lowest WAIC is used for prediction (steps (6) and (11)).

The hyper-priors are chosen as follows: log-normal with mean log(0.01) and variance 2 for σn2; and log-tν=4 with mean 0 for all other hyperpameters. 5.1

Setup

The proposed algorithm implemented in MATLAB is evaluated on the noiseless testbed of the COCO/BBOB (Comparing Continuous Optimizers / Black-Box Optimization Benchmarking) framework [ 11,12 ] and compared with the GP-based aDTS-CMA-ES and the CMA-ES itself.

2Using MATLAB implementation available at http://helios. fmi.fi/~lainema/dram/

3The sources are available at https://github.com/repjak/ surrogate-cmaes/tree/modelsel

The testbed consists of 24 functions, each defined everywhere on RD with the optimum in [ −5, 5 ]D for all dimensionalities D ≥ 2. Each test function has multiple instances which are derived by various transformations of input space or f -space. We run the algorithm on 5 instances (1 . . . , 5) as opposed to 15 recommended instances for the reason of increased computational demands of the modified algorithm. For the same reason, only functions of 10 variables (10D) are considered.

If not stated otherwise, all settings of the aDTS-CMAES are as recommended in [ 3 ].

The CMA-ES results in BBOB format were downloaded from the BBOB 2010 workshop archive 4. 5.2

Results

Figure 2 gives the scaled best-achieved logarithms Δlog f of median distances to the functions optimum for the respective number of function evaluations per dimension (FE/D). Medians and the 1st and 3rd quartiles are calculated from 5 independent instances in case of the algorithm with covariance selection according to the WAIC and from 15 independent instances otherwise. We observe that in most cases, the WAIC-based algorithm mostly barely outperforms the pure CMA-ES, which suggests the chosen model is generally weak and the adaptivity mechanism basically turns off using the surrogate model. The functions where the WAIC variant outperforms the aDTS-CMA-ES (f21 and f22) are multi-modal and the interquartile range is large.

In order to compare the considered information criteria, we calculate the rank of each model under both WAIC and DIC. Table 1 summarizes the average ranks over all model selections performed on each benchmark function. We observe that the DIC often prefers the additive model, while the WAIC is more balanced in this respect. Surprisingly the linear kernel has been very rarely selected even on the linear function (f5) under both information criteria. A similar observation holds for the quadratic kernel and the quadratic functions (f1, f2). 6

Conclusion & Further Work

In this paper, we presented an algorithm for selecting a GP kernel using Bayesian model comparison techniques. Preliminary experiments for the model selection plugged into the aDTS-CMA-ES algorithm were conducted on the COCO/BBOB testbed. Due to the small number of experiments performed so far, it is difficult to draw any serious conclusions. The first obtained results may indicate improper convergence of the MCMC sampler or that more sophisticated covariance functions may be needed.

One direction of future research, beside analyzing and repairing aforementioned deficiencies, is an extension of 4http://coco.gforge.inria.fr/data-archive/bbob/ 2010/ the proposed algorithm into a combinatorial search over kernels in flavor of [ 5,7 ], which is challenging due to computational costs related to the need of repeated surrogate model retraining.

One possible direction of research is a co-evolution of a population of covariance functions alongside the population of candidate solutions to the black-box objective function. Other related research area is applying surrogate modeling to high-dimensional problems using algorithms for variable selection via multiple kernel learning [ 2, 6 ].

Acknowledgements

This research was supported by SVV project number 260 453 and the Czech Science Foundation grants No. 17-01251. Further, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. aDTS-CMA-ES WAIC

CMA-ES 50

100 150 f4 Bueche-Rastrigin 10D 200

250 50 50 0 -2

WAIC

SE+NN

SE+LIN

SE+NN

SE+LIN

DIC

[1]

Akaike . Information Theory and an Extension of the Maximum Likelihood Principle , pages 199 - 213 . Springer New York, 1973 .

[2]

Bach . High-dimensional non-linear variable selection through hierarchical kernel learning , 2009 .

[3]

Bajer . Model-based evolutionary optimization methods . PhD thesis , Faculty of Mathematics and Physics , Charles University in Prague, Prague, 2018 .

[4]

Bajer ,

Pitra , and

Holenˇa. Benchmarking Gaussian processes and random forests surrogate models on the BBOB noiseless testbed . In Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference - GECCO Companion '15 . Association for Computing Machinery (ACM) , 2015 .

[5]

Duvenaud ,

Lloyd ,

Grosse ,

Tenenbaum , and

Zoubin . Structure discovery in nonparametric regression through compositional kernel search . In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning , volume 28 of Proceedings of Machine Learning Research , pages 1166 - 1174 , Atlanta, Georgia, USA, 17 - 19 Jun 2013 . PMLR.

[6]

D. K.

Duvenaud ,

Nickisch , and

C. E.

Rasmussen . Additive gaussian processes . In J. Shawe-Taylor , R. S. Zemel, P. L.

Bartlett , F.

Pereira , and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24 , pages 226 - 234 . Curran Associates, Inc., 2011 .

[7]

Gagné ,

Schoenauer ,

Sebag , and

Tomassini . Genetic programming for kernel-based learning with coevolving subsets selection . In PARALLEL PROBLEM SOLVING FROM NATURE , REYKJAVIK, LNCS, pages 1008 - 1017 . Springer Verlag, 2006 .

[8]

Gelman ,

Hwang , and

Vehtari . Understanding predictive information criteria for bayesian models . Statistics and Computing , 24 ( 6 ): 997 - 1016 , Nov. 2014 .

f1 Sphere 10D

[9]

Haario ,

Laine ,

Mira , and

Saksman . Dram: Efficient adaptive mcmc . Statistics and Computing , 16 ( 4 ): 339 - 354 , Dec 2006 .

[10]

Hansen . The CMA evolution strategy: a comparing review , pages 75 - 102 . Springer, Berlin, Heidelberg, 2006 .

[11]

Hansen ,

Finck ,

Ros , and

Auger . Real-parameter Black-Box Optimization Benchmarking 2009: Noiseless functions definitions . Technical report, INRIA , 2009 , updated 2010 .

[12]

Hansen ,

Finck ,

Ros , and

Auger . Real-parameter Black-Box Optimization Benchmarking 2012: Experimental setup . Technical report, INRIA , 2012 .

[13]

Hansen and

Ostermeier. Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation , 9 ( 2 ): 159 - 195 , June 2001 .

[14]

Kern ,

Hansen , and

Koumoutsakos . Local Metamodels for Optimization Using Evolution Strategies , pages 939 - 948 . Springer Berlin Heidelberg, Berlin, Heidelberg, 2006 .

[15]

J. R.

Lloyd ,

Duvenaud ,

Grosse ,

J. B.

Tenenbaum , and

Ghahramani . Automatic construction and naturallanguage description of nonparametric regression models . CoRR , abs/1402.4304, Apr . 2014 .

[16]

Loshchilov ,

Schoenauer , and

Sebag . Intensive surrogate model exploitation in self-adaptive surrogateassisted cma-es (saacm-es) . In Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13 . ACM Press, 2013 .

[17]

Mocˇkus . On bayesian methods for seeking the extremum . In Proceedings of the IFIP Technical Conference , London, UK, 1974 . Springer-Verlag.

[18]

R. M.

Neal .

Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996 .

[19]

Pitra ,

Bajer , and

Holenˇa . Doubly trained evolution control for the surrogate CMA-ES. In Parallel Problem Solving from Nature - PPSN XIV , pages 59 - 68 . Springer International Publishing, 2016 .

[20]

Pitra ,

Bajer ,

Repický , and M. Holenˇa. Overview of surrogate-model versions of covariance matrix adaptation evolution strategy . In Proceedings of the Genetic and Evolutionary Computation Conference 2017 , Berlin, Germany, July 15-19 , 2017 (GECCO '17) . ACM, July 2017 .

[21]

Pitra ,

Bajer ,

Repický , and

Holenˇa. Adaptive Doubly Trained Evolution Control for the Covariance Matrix Adaptation Evolution Strategy . In ITAT 2017: Information Technologies-Applications and

Theory , Martin, Sept.

2017 . CreateSpace Independent Publ . Platform.

[22]

C. E.

Rassmusen and

C. K. I.

Williams . Gaussian processes for machine learning . Adaptive computation and machine learning series . MIT Press, 2006 .

[23]

Schwarz. Estimating the dimension of a model . The Annals of Statistics , 6 ( 2 ): 461 - 464 , 1978 .

[24]

Spiegelhalter ,

Best ,

Carlin , and

A. Van Der

Linde . Bayesian measures of model complexity and fit . Journal of the Royal Statistical Society. Series B: Statistical Methodology , 64 ( 4 ): 583 - 616 , 12 2002 .

[25]

Ulmer ,

Streichert , and

Zell . Evolution strategies assisted by Gaussian processes with improved pre-selection criterion . In The 2003 Congress on Evolutionary Computation , 2003 . CEC ' 03 . , pages 692 - 699 . Institute of Electrical and Electronics Engineers (IEEE), 2003 .

[26]

Watanabe . Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory . J. Mach. Learn. Res. , 11 : 3571 - 3594 , Dec. 2010 .