Combining Gaussian Processes and Neural Networks in Surrogate Modeling for Covariance Matrix Adaptation Evolution Strategy Jan Koza1 , Jiří Tumpach2 , Zbyněk Pitra3 , and Martin Holeňa4 1 Czech Technical University, Faculty of Information Technology, Prague, kozajan@fit.cvut.cz 2 Charles University, Faculty of Mathematics and Physics, Prague, tumpach@cs.cas.cz 3 Czech Technical University, Faculty of Nuclear Sciences and Physical Engineering, Prague, z.pitra@gmail.com 4 Academy of Sciences, Institute of Computer Science, Prague, martin@cs.cas.cz Abstract: This paper focuses on surrogate models for Co- Basically, a surrogate model in continuous black-box variance Matrix Adaptation Evolution Strategy (CMA-ES) optimization is any regression model that with a sufficient in continuous black-box optimization. Surrogate modeling fidelity approximates the true black-box objective function has proven to be able to decrease the number of evalua- and replaces it in some of its evaluations. Among the re- tions of the objective function, which is an important re- gression models most frequently used to this end, we are quirement in some real-world applications where the eval- most interested in Gaussian processes (GPs) [47], due to uation can be costly or time-demanding. Surrogate models the fact that they estimate not only the expected value of achieve this by providing an approximation instead of the the true objective function, but the whole distribution of its evaluation of the true objective function. One of the state- values. Apart from regression, they are also encountered of-the-art models for this task is the Gaussian process. We in classification, and play a key role in Bayesian optimiza- present an approach to combining Gaussian processes with tion [22, 28], where the distribution of values provides se- artificial neural networks, which was previously success- lection criteria that are alternatives to the objective func- fully applied to other machine learning domains. tion value, such as the probability of improvement or the The experimental part employs data recorded from expected improvement. previous CMA-ES runs, allowing us to assess different The importance of GPs in machine learning incited at- settings of surrogate models without running the whole tempts to integrate them with the leading learning para- CMA-ES algorithm. The data were collected using 24 digm of the last decades – neural learning, including deep noiseless benchmark functions of the platform for com- learning. The attractiveness of this research direction is paring continuous optimizers COCO in 5 different dimen- further supported by recent theoretical results concerning sions. Overall, we used data samples from over 2.8 million relationships of asymptotic properties of important kinds generations of CMA-ES runs. The results examine and of artificial neural networks (ANNs) to properties of GPs statistically compare six covariance functions of Gaussian [35,39,41]. The integration of GP with neural learning has processes with the neural network extension. So far, the been proposed on two different levels: combined model did not show up to outperform the Gaus- (i) Proper integration of an ANN with a GP, in which the sian process alone. Therefore, in conclusion, we discuss GP forms the final output layer of the ANN [9, 52]. possible reasons for this and ideas for future research. (ii) Only a transfer of the layered structure, which is a Keywords: black-box optimization, surrogate modeling, crucial feature, to the GP context, leading to the con- artificial neural networks, Gaussian processes, covariance cept of deep GPs (DGPs) [8, 12, 22, 23]. functions The recalled research into the integration of GPs with neural learning has used regression data [8, 9, 12, 23, 52], mostly from the UCI Machine Learning Repository [50], 1 Introduction but also data concerning locomotion of walking bipedal robots [9], and face patches and hadwritten digits [52]. In evolutionary black-box optimization, an increasing at- In [12,23], also classification data were used. However, we tention is paid to tasks with expensive evaluation of the are aware of only one application of such an integration, in black-box objective function. It is immaterial whether particular of DGPs, to two very easy 1- and 2-dimensional that expensiveness is due to time-consuming computa- optimization problems [22]. Hence, there is a gap between tion like in long simulations [17], or due to evaluation the importance of GPs in Bayesian optimization and miss- through costly experiments like in some areas of science ing investigations of the suitability of integrated GP-ANN [3]. To deal with such expensive evaluations, black-box models for surrogate modeling in black-box optimization. optimization has in the late 1990s and early 2000s adopted That gap motivated the research reported in this paper. an approach called surrogate modeling or metamodeling [7, 13, 15, 34, 42, 45, 48]. We have performed this novel kind of investigation of GP-ANN integration using data from more than 5000 runs Copyright ©2021 for this paper by its authors. Use permitted under of using GPs as Bayesian optimizers in black-box opti- Creative Commons License Attribution 4.0 International (CC BY 4.0). mization, i.e. optimization of pointwise observable func- tions for which no analytic expression is available and only the ordering of those evaluations. Details of the algo- the function values have to be obtained either empirically, rithm can be found in [19, 21]. e.g. through measuring or experiments, or through numer- During the more than 20 years of CMA-ES existence, ical simulations. Based on that data, our investigation ad- a number of surrogate-assisted variants of this algorithm dressed two research questions: have been proposed, a survey can be found in [6, 43]. Here, we pay attention only to the most recent GP-based 1. Does the integration of a GP with neural learning has among them, the Doubly Trained Surrogate CMA-ES an added value compared with employing a GP alone (DTS-CMA-ES) [6], surrogate-assisted variant of CMA- or an ANN alone? ES. It employs two GPs f1 and f2 , trained consecutively, 2. What is the impact of each of the considered GP co- to find an evaluation of the population x1 , . . . , xλ , with f1 variance functions on the result of GP-ANN integra- used for active learning of training data for f2 . Due to tion? the CMA-ES invariance with respect to strictly increas- ing transformations, it evaluates the difference between The next section introduces the principles of surro- predictions only according to the difference in the order- gate modeling as well as of the evolutionary optimization ing of those predictions, using the ranking difference error method Covariance Matrix Adaptation Evolution Strategy (RDE). The RDE of y ∈ Rλ with respect to y0 ∈ Rλ con- (CMA-ES), in the context of which our research has been sidering k best components is defined: performed. In Section 3, the fundamentals of GPs are re- ∑i,(ρ(y0 ))i ≤k (|ρ(y0 ))i − (ρ(y))i | called and the method we have used for their integration RDE(y, y0 ) = , (1) with neural networks is explained. The core part of the pa- ≤k maxπ∈Π(λ ) ∑ki=1 |i − π −1 (i)| per is Section 4, which attempts to provide experimental where Π(λ ) denotes the set of permutaions of {1, . . . , λ } answers to the above research questions. and ρ(y) denotes the ordering of the components of y, i.e., (∀y ∈ Rλ ) ρ(y) ∈ Π(λ ) & (ρ(y))i < (ρ(y)) j ⇒ yi ≤ y j . Because the CMA-ES algorithm uses the surrogate model 2 Surrogate modeling in Black-Box to select the most promising candidates for true evaluation, Optimization the metric considers only k best samples. The range of RDE metric is [0, 1], it equals 0 for the exact ordering and Basically, the purpose of surrogate modeling – to approxi- 1 for thereverse order. mate an unknown functional dependence – coincides with The algorithm DTS-CMA-ES is described in Algo- the purpose of response surface modeling in the design of rithm 1, using the following notation: experiments [24, 40]. Therefore, it is not surprising that typical response surface models, i.e., low order polynomi- • A for an archive – a set of points that have already als, belong also to the most traditional and most success- been evaluated by the true black-box objective func- ful surrogate models [1, 2, 20, 29, 45]. Other frequently tion BB; used kinds of surrogate models are artificial neural net- • dσ 2C for the Mahalanobis distance given by σ 2C: works of the kind multilayer perceptron (MLP) or radial q basis function network [18, 25–27, 42, 53], and Gaussian dσ 2C (x, x0 ) = (x − x0 )> σ −2C−1 (x − x0 ); (2) processes, in surrogate modeling also known as kriging [6, 7, 13–15, 33, 34, 37, 49, 51]. Occasinally encountered • Nk (x; A) for the set of k of dσ 2C -nearest neighbours of are support vector regression [10, 36] and random forests x ∈ Rd with respect to the archive A; [5, 44]. • fi (x1 , . . . , xλ ) = ( fi (x1 ), . . . , fi (xλ )), for i = 1, 2; • Th for the training set selection: 2.1 CMA-ES and Its Surrogate-Assisted Variant λ DTS-CMA-ES [ Th = {x ∈ Nh (x j ; A)|dσ 2C (x, x j ) < rmax } (3) The CMA-ES algorithm performs unconstrained opti- j=1 mization on Rd , by means of iterative sampling of pop- with rmax > 0 for h = 1, . . . , |A|; ulations sized λ from a d-dimensional Gaussian distribu- • k(A) = max{h||Th | ≤ Nmax }, with Nmax ∈ N; tion N(m, σ 2C), and uses a given parent number µ among the sampled points corresponding to the lowest objective • ρPoI for decreasing ordering of f1 (x1 ), . . . , f1 (xλ ) ac- function values, to update the parameters of that distribu- cording to the probability of improvement with re- tion. Hence, it updates the expected value m, which is used spect to the lowest BB value found so far, as the current point estimate of the function optimum, the matrix C and the step-size σ . The CMA-ES is invariant i < j ⇒ PoI(ρPoI ( f1 (x1 , . . . , xλ )))i ;V ) ≥ with respect to strictly increasing transformations of the ≥ PoI(ρPoI ( f1 (x1 , . . . , xλ ))) j ;V ), (4) objective function. Hence, to make use of the evaluations where V = minx∈A BB(x). of the objective function in a set of points, it needs to know Algorithm 1 Algorithm DTS-CMA-ES vectors x = (x1 , . . . , xn )> , y = (y1 , . . . , yn )> = ( f (x1 ) + Require: x1 , . . . , xλ ∈ Rd , µ, A, σ and C – step ε, . . . f (xn ) + ε)> , k? = (κ(x1 , x? ), . . . , κ(xn , x? ))> and the size and matrix from the CMA-ES distribu- matrix K ∈ Rn×n such that (K)i, j = κ(xi , x j ) + σn2 I(i = j). tion, Nmax ∈ N such that Nmax ≥ λ , rmax > 0, Then the probability density of the vector y of observations β , εmin , εmax , εinitial , αmin , αmax , αinitial ∈ (0, 1) is 1: if this is the 1st call of the algorithm in the current exp − 21 (y − µ)> K −1 (y − µ)  2 CMA-ES run then p(y; µ, κ, σn ) = p , (7) 2: α = αinitial , ε = εinitial (2π)2 det(K + σn2 In ) 3: else where det(M) denotes the determinant of a matrix M. Fur- 4: take over the returned values of α, ε from its pre- thermore, as a consequence of the assumption of Gaussian vious call in the run joint distribution, also the conditional distribution of f (x? ) 5: end if conditioned on y is Gaussian, namely 6: Train a Gaussian process f 1 on Tk(A) , estimating mGP , σn , σ f , ` through maximization of the likelihood (7) N(µ(x? ) + k? K −1 (y − µ), κ(x? , x? ) − k?> K −1 k? ). (8) 7: Evaluate BB(x j ) for x j not yet BB-evaluated and such that (ρPoI ( f1 (x1 , . . . , xλ ))) j ≤ dαλ e According to (6), the relationship between the observa- 8: Update A to A ∪ {(x j |(ρPoI ( f 1 (x1 ), . . . , f 1 (xλ ))) j ≤ tions y and y0 is determined by the covariance function κ. dαλ e} In the reported research, we have considered 6 kinds of 9: Train a Gaussian process f 2 on Tk(A) , estimating mGP , covariance functions, listed below. In their definitions, the σn , σ f , ` through maximization of the likelihood (7) notation r = kx0 − xk is used, and among the parameters 10: For x j such that (ρPoI ( f 1 (x1 , . . . , xλ ))) j ≤ dαλ e, up- of κ, aka hyperparameters of the GP, frequently encoun- date f2 (x j ) = BB(x j ) tered are σ 2f , ` > 0, called signal variance and character- 11: Update ε to β RDEµ ( f 1 (x1 , . . . , xλ ), ( f 2 (x1 , . . . , xλ )) + istic length scale, respectively. Other parameters are intro- ε−εmin (1 − β )ε and α to αmin + max(0, min(1, εmax −εmin )) duced for each covariance function separately. 12: For j fulfilling (ρPoI ( f 1 (x1 , . . . , xλ )) j > dαλ e, update (i) Linear: κLIN (x, x0 ) = σ02 + x> x0 , with a bias σ02 . f2 (x j) to f2 (x j)−min{ f2 (x j0)|(ρPoI ( f1 (x1 , . . . , xλ)) j0 > (ii) Quadratic is the square of the linear covariance: dαλ e}+min{ f2 (x j0 )|(ρPoI ( f1 (x1 , . . . , xλ )) j0 ≤ dαλ e} κQ (x, x0 ) = (σ02 + x> x0 )2 .  −α 13: Return f 2 (x1 ), . . . , f 2 (xλ ), ε, α r2 (iii) Rational quadratic: κRQ (x, x0 ) = σ 2f 1 + 2α` 2 , with α > 0.  2 (iv) Squared exponential: κSE (x, x0 ) = σ 2f exp − 2`r 2 . 3 Gaussian Processes and Their (v) Matérn 52 : Integration with Neural Networks 5  √   √  5r2 2 κMatérn (x, x0 ) = σ 2f 1 + `5r + 3` 2 exp − `5r . 3.1 Gaussian processes (vi) One composite covariance function, namely the sum of κSE and κQ : A Gaussian process on a set X ⊂ Rd , d ∈ N is a collection κSE+Q (x, x0 ) = κSE (x, x0 ) + κQ (x, x0 ). of random variables ( f (x))x∈X , any finite number of which has a joint Gaussian distribution [47]. It is completely specified by a mean function µ : X → R, typically assumed 3.2 GP as the Output Layer of a Neural Network constant, and by a covariance function κ : X ×X → R such The approach integrating a GP into an ANN as its output that for x, x0 ∈ X, layer has been independently proposed in [9] and [52]. It E f (x) = µ, cov( f (x), f (x0 )) = κ(x, x0 ). (5) relies on the following two assumptions: 1. If nI denotes the number of the ANN input neu- Therefore, a GP is often denoted GP(µ, κ(x, x0 )) or rons, then the ANN computes a mapping net of nI -dimen- GP(µ, κ). sional input values into the set X on which is the GP. The value of f (x) is typically accessible only as a noisy Consequently, the number of neurons in the last hidden observation y = f (x) + ε, where ε is a zero-mean Gauss- layer equals the dimension d, and the ANN maps an in- sian noise with a variance σn > 0. Then put v into a point x = net(v) ∈ X, corresponding to an observation f (x + ε) governed by GP (Figure 1). From cov(y, y0 ) = κ(x, x0 ) + σn2 I(x = x0 ), (6) the point of view of the ANN inputs, the GP is now GP(µ(net(v)), κ(net(v), net(v0 ))). where I(proposition) equals for a true proposition 1, for a 2. The GP mean µ is assumed to be a known constant, false proposition 0. thus not contributing to the GP hyperparameters and inde- Consider now the prediction of the random variable pendent of net. f (x? ) in a point x? ∈ X if we already know the ob- Due to the assumption 2., the only hyperparameters of servations y1 , . . . , yn in points x1 , . . . , xn . Introduce the the GP are the parameters θ κ of the covariance function. out putGPl ayer :y (θ κ , θ W ). Second, to perform the optimization with re- ( +di st ri but ionofy ) spect to θ A in an outer loop (using, e.g. grid search com- bined with cross-validation) and for each considered com- bination of values of θ A perform the optimization with re- spect to (θ κ , θ W ) in an inner loop. Finally, let the smooth optimization be performed in the most simple but, in the context of neural networks, l asthi ddenl ayer :x also most frequent way – as gradient descent. The partial derivatives forming ∇(θ κ ,θ W ) L can be computed as: n ∂L ∂ L ∂ Ki, j ∂ L κ = ∑ κ , W = ∂ θ` i, j=1 Ki, j ∂ θ` ∂ θ` ∂ n (11) ∂ L ∂ Ki, j ∂ net(vk ) = ∑ ∂ Ki, j ∂ xk ∂ θ W . . i, j,k=1 ` . . 1s thi ddenl ayer In (11), the partial derivatives ∂∂KLi, j , i, j = 1, . . . , n, are com- ponents of the matrix derivative ∂∂ KL , for which the calcula- tions of matrix differential calculus [38] together with (7) and (9) yield ∂L 1  −1 > −1  = K yy K − K −1 . (12) ∂K 2 i nputl ayer :i nputv aluesv I I 4 Experiments For the evaluation of the aforementioned models, we used Figure 1: Schema of the integration of a GP into an ANN the open-source Python library GPytorch and our imple- as its output layer. Taken over from [32]. mentation is available at [31]. As to the ANN, it depends on the one hand on the vector 4.1 Employed Data θ W of its weights and biases, on the other hand on the As a dataset to compare different configurations of the parameters θ A of its architecture (such as the number of combined GP-ANN model, we used recorded DTS-CMA- hidden layers, the numbers of neurons in each of them, the ES runs from previous experiments. This allowed us to activation functions). Altogether, the integrated GP-ANN effectively evaluate the surrogate model on its own with- model depends on the vector (θ κ , θ W , θ A ), which we in out having to perform the whole optimization. The open- accordance with [9,52] and with the terminology common source Matlab implementation of DTS-CMA-ES, used to for GPs also call hyperparameters, although in the context obtain the data, is available at [4]. The underlying surro- of ANNs, this term is typically used only for θ A . gate models were Gaussian process with 8 different co- Consider now n inputs to the neural network, v1 , . . . , vn , variance functions implemented using the GPML Tool- mapped to the inputs x1 = net(v1 ), . . . , xn = net(vn ) of the box [46]. The optimization runs were collected on the plat- GP, corresponding to the observations y = (y1 , . . . , yn )> . form for comparing continuous optimizers COCO. The Then the log-likelihood of θ is employed black-box optimization benchmark set contains 24 noiseless functions scalable to any input dimension. We L(θ ) = ln p(y; µ, κ, σn2 ), (9) used dimensions 2, 3, 5, 10, and 20 in 5 different instances, where µ is the constant assumed in Assumption 2., and which consist in random rotation and translation in the in- put space. Therefore, for each combination of benchmark (K)i, j = κ(net(vi ), net(v j )). (10) function and dimension, 40 runs are available. More im- portant than the number of runs, however, is the number of To find the combination of values of θ with the maxi- their generations because data from each generation apart mal likelihood, methods for smooth optimization can be from the first can be used for testing all those surrogate applied only with respect to θ κ and θ W , with respect to models that could be trained with data from the previous which is L continuous. With respect to θ A , we have two generations. The number of generations in a particular run possibilities. First, to fix θ A in advance, thus effectively of CMA-ES algorithm is unknown before the run and de- restricting the hyperparameters of the integrated model to pends on the objective function and its particular instance. Table 1: Noiseless benchmark functions of the platform comparing continuous optimizers (COCO) [11, 16] and the number of available generations of DTS-CMA-ES runs for each of them in each considered dimension Benchmark Available generations function in dimension name 2 3 5 10 20 Separable 1. Sphere 4689 6910 11463 17385 25296 2. Separable Ellipsoid 6609 9613 15171 25994 55714 3. Separable Rastrigin 7688 11308 17382 27482 42660 4. Büche-Rastrigin 8855 13447 22203 31483 49673 Moderately Ill-Conditioned 6. Attractive Sector 16577 25200 38150 45119 72795 7. Step Ellipsoid 7103 9816 24112 34090 56340 8. Rosenbrock 7306 11916 21191 32730 71754 9. Rotated Rosenbrock 7687 12716 24084 35299 71017 Highly Ill-Conditioned 10. Ellipsoid with High Conditioning 6691 9548 15867 25327 59469 11. Discus 6999 9657 15877 25478 45181 12. Bent Cigar 10369 18059 28651 34605 56528 13. Sharp Ridge 7760 11129 20346 30581 48154 14. Different Powers 6653 10273 17693 31590 61960 Multi-modal with Global Structure 15. Non-separable Rastrigin 7855 11476 19374 28986 44446 16. Weierstrass 9294 13617 24158 27628 40969 17. Schaffers F7 9031 13960 24244 34514 56247 18. Ill-Conditioned Schaffers F7 9598 13404 25802 31609 53836 19. Composite Griewank-Rosenbrock 9147 16268 24456 34171 53536 Multi-modal Weakly Structured 20. Schwefel 9081 13676 24219 33753 53104 21. Gallagher’s Gaussian 101-me Points 7645 12199 18208 25366 43186 22. Gallagher’s Gaussian 21-hi Points 7629 11086 17881 26626 44971 23. Katsuura 8751 11233 17435 25030 37366 24. Lunacek bi-Rastrigin 8983 13966 19405 29762 44420 The benchmarks and their total numbers of available gen- parameters of the Gaussian process as proposed in [52] erations for individual dimensions are listed in Table 1. and outlined in Subsection 3.2. As a loss function, we used We skipped the linear function (benchmark function num- the Gaussian log-likelihood and optimized the parameters ber 5), which is easy to optimize, and the recorded runs with Adam [30] for a maximum of 1000 iterations. We did not provide enough data samples to train and evaluate also kept a 10 % validation set out of the training data to the surrogate models. monitor overfitting, and we selected the model with the lowest L2 validation error during the training. 4.2 Experimental Setup 4.3 Results We compare the combinations of neural networks with To evaluate different covariance functions, we used the Gaussian processes using six different covariance func- recorded data in a similar way as it would be used by DTS- tions listed in Subsection 3.1. All models were trained on CMA-ES algorithm. We took each generation of points the same set of training data Tk(A) as was used in steps 6 except the first one and we used it as testing data. Every and 9 of Algorithm 1. Due to the condition (3), this set is point evaluated by the true objective function in all previ- rather restricted and allows training only a restricted ANN ous generations is then available as a training sample. We to prevent overfitting. Therefore, we decided to use a mul- filter these samples using the training set selection method tilayer perceptron with a single hidden layer, thus a topol- Tk(A) and trained the surrogate model on it. ogy (nI , nH , nO ), where nI is the dimension of the training This way, we evaluated six different models on every data, i.e. nI ∈ {2, 3, 5, 10, 20}, and generation of samples listed in Table 1. The results are  presented first in a function-method view in Table 4, then 2 if nI = 2,  in a dimension-method view in Table 5. The metric used nH = nO = 3 if nI = 3, 5, (13) in both tables is RDE, which shows how precise the model   5 if nI = 10, 20. is in ordering of the predicted values, therefore the lower is the error value, the better. The first table contains re- As the activation function for both the hidden and out- sults for every benchmark function separately, averaged put layer, we chose the logistic sigmoid. We trained the over every input dimension, function instance and DTS- weights and biases of the neural network together with the CMA-ES generation as well as the average over the whole group of functions. The second table provides a view on combination. Unfortunately, the results with other kernels how the dimension of the input space affects the approxi- ended up much worse. mation error. The results for a particular dimension are av- In our future research, we would like to try to overcome eraged over all functions in a specific group, instances, and this. We want to systematically investigate different ANN generations. We also visualized with boxplots the summa- topologies, including the direction of deep Gaussian pro- rized RDE values for the function groups in Figure 2. cesses [8, 12, 22, 23], in which only the topology is used Moreover, we verified the results by performing multi- from an ANN, but all neurons are replaced by GPs. More- ple comparison tests. A Non-parametric Friedman test was over, in addition to the selection of the training set used conducted on RDEs across all results for particular func- in DTS-CMA-ES and described in Algorithm 1, we want tions and function types in Table 4, and for a particular to consider also alternative ways of training set selection, combination of dimensions and function types in Table 5. allowing to train larger networks. Finally, we intend to If the null hypothesis of the equality of all six considered perform research into transfer learning of surrogate mod- methods was declined, the Wilcoxon signed-rank test was els: An ANN-GP model with a deep neural network will performed for all pairs of covariance functions, and its re- be trained on data from many optimization runs, such as sults were corrected for multiple hypotheses testing using those employed in this paper, and then the model used in the Holm method. We summarized the results of statistical a new run of the same optimizer will be obtained through testing in Tables 2 and 3. additional fine tuning restricted only to the GP and last 1-2 layers of the ANN. We would also like to try to change the way how the Table 2: Statistical comparison of six covariance functions model is trained. In paper [52] the parameters of Gaus- from the function-method view in Table 4. It shows for sian process are learned together with the parameters of how many functions (23 in total) was the kernel in a row the neural network. We think that this might not work significantly better than the one in a column. well in the domain of surrogate modeling in black-box op- κLIN κQ κRQ κSE κMat κSE+Q Σ timization. Therefore, we will try to train the network and κLIN – 20 19 21 20 20 100 Gaussian process separately. κQ 0 – 3 2 3 16 24 κRQ 1 5 – 1 1 13 21 κSE 2 6 1 – 1 14 24 5 Conclusion κMat 2 6 2 0 – 13 23 κSE+Q 2 1 2 0 1 – 6 In this paper, we examined an extension of Gaussian pro- cesses used in surrogate modeling. At the beginning, we described the CMA-ES algorithm for black-box opti- Table 3: Statistical comparison of six covariance functions mization and its surrogate-assisted variant DTS-CMA-ES. from the dimension-method view in Table 5. It shows for Then we outlined Gaussian processes, various covariance how many combinations of input dimension and a specific functions, and their neural network extension. The pre- function group (25 in total) was the kernel in a row signif- sented research is our first attempt in the application of the icantly better than the one in a column. combination of Gaussian processes with neural networks κLIN κQ κRQ κSE κMat κSE+Q Σ as a surrogate model in black-box optimization. We imple- κLIN – 25 23 21 20 24 113 mented the combined model and compared the results ob- κQ 0 – 4 0 3 16 23 tained with six different covariance functions. By looking κRQ 0 4 – 2 1 13 20 at the results presented in the previous section, we can con- κSE 0 4 5 – 5 13 27 clude that the combined model yields the best results with κMat 0 2 3 1 – 9 15 the simplest linear kernel. The other covariance functions κSE+Q 0 0 0 0 0 – 0 produce much larger errors and the worst performing one seems to be the composite kernel κSE+Q . Unfortunatelly the results showed up to be worse than using Gaussian pro- cesses alone, therefore we discuss possible ideas of further improvements at the end of the paper. 4.4 Discussion and Further Research Acknowledgement The results show that the combined model performs best with the simplest linear kernel. We compared the GP- The research reported in this paper has been supported ANN with a linear kernel with pure Gaussian processes in by SVV project number 260 575 and was also partially paper [32]. We found out that if we compare the combined supported by Czech Science Foundation (GAČR) grant GP-ANN with GP both with linear kernels, the neural ex- 18-18080S. Computational resources were supplied by tensions can bring better results in some cases. However, the project "e-Infrastruktura CZ" (e-INFRA LM2018140) using more complex covariance functions with GP is still provided within the program Projects of Large Research, better. Therefore, we tried to apply it also to the GP-ANN Development and Innovations Infrastructures. Table 5: Comparison of average RDE values for six different covariance func- tions depending on the type of benchmark function and the input space di- Table 4: Comparison of average RDE values for six different covariance func- mension. The values are averaged over different functions in particular group tions depending on a particular benchmark function. The values are averaged and instances. over different dimensions and instances. HH κ H HH κ κLIN κQ κRQ κSE κMat κSE+Q κLIN κQ κRQ κSE κMat κSE+Q f H f HH H H SEP 0.252 0.452 0.448 0.445 0.418 0.581 1 0.242 0.528 0.486 0.469 0.476 0.591 MOD 0.341 0.450 0.450 0.483 0.455 0.471 Dimension 2 2 0.128 0.386 0.475 0.486 0.503 0.491 Separable Functions HC 0.204 0.449 0.459 0.474 0.441 0.502 3 0.300 0.452 0.449 0.462 0.443 0.546 MMA 0.416 0.485 0.472 0.475 0.484 0.550 4 0.352 0.480 0.459 0.462 0.462 0.548 MMW 0.443 0.517 0.419 0.471 0.431 0.522 all 0.255 0.461 0.467 0.469 0.471 0.544 all 0.334 0.473 0.450 0.470 0.447 0.525 6 0.351 0.446 0.410 0.451 0.451 0.421 SEP 0.194 0.439 0.414 0.455 0.447 0.527 conditioning 7 0.202 0.461 0.485 0.476 0.499 0.529 Moderate MOD 0.262 0.449 0.383 0.407 0.429 0.512 Dimension 3 8 0.330 0.463 0.410 0.406 0.460 0.527 HC 0.174 0.425 0.425 0.451 0.464 0.489 9 0.326 0.465 0.463 0.435 0.430 0.530 MMA 0.353 0.486 0.449 0.461 0.453 0.536 all 0.302 0.458 0.441 0.442 0.459 0.501 MMW 0.387 0.512 0.467 0.485 0.464 0.527 10 0.142 0.458 0.505 0.457 0.484 0.487 all 0.278 0.464 0.430 0.454 0.453 0.518 High conditioning 11 0.147 0.417 0.537 0.527 0.525 0.488 and unimodal SEP 0.227 0.445 0.473 0.479 0.478 0.526 12 0.284 0.436 0.506 0.498 0.510 0.451 MOD 0.260 0.458 0.444 0.458 0.468 0.502 Dimension 5 13 0.216 0.434 0.455 0.418 0.433 0.532 HC 0.202 0.439 0.513 0.489 0.486 0.492 14 0.289 0.498 0.452 0.405 0.416 0.551 MMA 0.373 0.481 0.491 0.500 0.482 0.517 all 0.215 0.448 0.491 0.461 0.473 0.501 MMW 0.363 0.513 0.496 0.504 0.499 0.532 15 0.321 0.462 0.467 0.446 0.427 0.545 all 0.289 0.468 0.486 0.487 0.484 0.514 global structure 16 0.551 0.530 0.516 0.486 0.489 0.504 Multi-modal SEP 0.278 0.479 0.476 0.500 0.494 0.536 adequate 17 0.391 0.450 0.487 0.474 0.483 0.550 Dimension 10 MOD 0.285 0.460 0.453 0.417 0.443 0.502 18 0.380 0.486 0.489 0.472 0.480 0.553 HC 0.206 0.464 0.515 0.438 0.478 0.507 19 0.400 0.540 0.524 0.521 0.526 0.536 MMA 0.422 0.498 0.525 0.499 0.478 0.549 all 0.408 0.493 0.496 0.479 0.480 0.537 MMW 0.363 0.500 0.530 0.525 0.530 0.549 20 0.213 0.508 0.467 0.456 0.472 0.547 all 0.313 0.481 0.503 0.477 0.486 0.529 global structure 21 0.387 0.470 0.467 0.476 0.468 0.532 Multi-modal SEP 0.327 0.493 0.525 0.469 0.518 0.551 22 0.347 0.480 0.484 0.467 0.475 0.532 weak Dimension 20 MOD 0.362 0.475 0.479 0.447 0.504 0.522 23 0.589 0.589 0.537 0.544 0.547 0.549 HC 0.293 0.467 0.543 0.454 0.500 0.519 24 0.448 0.492 0.509 0.516 0.506 0.506 MMA 0.478 0.518 0.546 0.464 0.508 0.537 all 0.396 0.507 0.492 0.491 0.493 0.533 MMW 0.429 0.496 0.551 0.475 0.544 0.536 all 0.381 0.490 0.531 0.462 0.515 0.533 Separable Moderately Ill-Conditioned 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 κLIN κQ κRQ κSE κMat κSE+Q κLIN κQ κRQ κSE κMat κSE+Q Highly Ill-Conditioned Multi-modal with Global Structure 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 κLIN κQ κRQ κSE κMat κSE+Q κLIN κQ κRQ κSE κMat κSE+Q Multi-modal Weakly Structured 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 κLIN κQ κRQ κSE κMat κSE+Q Figure 2: Visualization of the distribution of RDE values for a specific group of functions. Each boxplot corresponds to the aggregated mean value in the grey rows in Table 4. References [17] A. Forrester, A. Sobester, and A. Keane. Engineering De- sign via Surrogate Modelling: A Practical Guide. John [1] A. Auger, D. Brockhoff, and N. Hansen. Benchmarking the Wiley and Sons, New York, 2008. local metamodel cma-es on the noiseless BBOB’2013 test [18] H.M. Gutmann. A radial basis function method for global bed. In GECCO’13, pages 1225–1232, 2013. optimization. Journal of Global Optimization, 19:201–227, [2] A. Auger, M. Schoenauer, and N. Vanhaecke. LS-CMA- 2001. ES: A second-order algorithm for covariance matrix adap- [19] N. Hansen. The CMA evolution strategy: A comparing re- tation. In Parallel Problem Solving from Nature - PPSN view. In Towards a New Evolutionary Computation, pages VIII, pages 182–191, 2004. 75–102. Springer, 2006. [3] M. Baerns and M. Holeňa. Combinatorial Development of [20] N. Hansen. A global surrogate assisted CMA-ES. In Solid Catalytic Materials. Design of High-Throughput Ex- GECCO’19, pages 664–672, 2019. periments, Data Analysis, Data Mining. Imperial College [21] N. Hansen and A. Ostermaier. Completely derandomized Press / World Scientific, London, 2009. self-adaptation in evolution strategies. Evolutionary Com- [4] L. Bajer and Z. Pitra. Surrogate CMA-ES. https:// putation, 9:159–195, 2001. github.com/bajeluk/surrogate-cmaes, 2021. [22] A. Hebbal, L. Brevault, M. Balesdent, E.G. Talbi, and [5] L. Bajer, Z. Pitra, and M. Holeňa. Benchmarking Gaus- N. Melab. Efficient global optimization using deep Gaus- sian processes and random forests surrogate models on the sian processes. In IEEE CEC, pages 1–12, 2018. BBOB noiseless testbed. In GECCO’15 Companion, pages [23] G. Hernández-Muñoz, C. Villacampa-Calvo, and D. Her- 1143–1150, 2015. nández Lobato. Deep Gaussian processes using expectation [6] L. Bajer, Z. Pitra, J. Repický, and M. Holeňa. Gaussian propagation and Monte Carlo methods. In ECML PKDD, process surrogate models for the CMA evolution strategy. pages 1–17, paper no. 128, 2020. Evolutionary Computation, 27:665–697, 2019. [24] S. Hosder, L. Watson, and B. Grossman. Polynomial re- [7] A.J. Booker, J. Dennis, P.D. Frank, D.B. Serafini, Torczon sponse surface approximations for the multidisciplinary de- V., and M. Trosset. A rigorous framework for optimization sign optimization of a high speed civil transport. Optimiza- by surrogates. Structural and Multidisciplinary Optimiza- tion and Engineering, 2:431–452, 2001. tion, 17:1–13, 1999. [25] Y. Jin, M. Hüsken, M. Olhofer, and Sendhoff B. Neural net- [8] T. Bui, D. Hernandez-Lobato, J. Hernandez-Lobato, Y. Li, works for fitness approximation in evolutionary optimiza- and R. Turner. Deep Gaussian processes for regression us- tion. In Y. Jin, editor, Knowledge Incorporation in Evolu- ing approximate expectation propagation. In ICML, pages tionary Computation, pages 281–306. Springer, 2005. 1472–1481, 2016. [26] Y. Jin, M. Olhofer, and B. Sendhoff. Managing approxi- [9] R. Calandra, J. Peters, C.E. Rasmussen, and M.P. Deisen- mate models in evolutionary aerodynamic design optimiza- roth. Manifold Gaussian processes for regression. In tion. In CEC 2001, pages 592–599, 2001. IJCNN, pages 3338–3345, 2016. [27] Y. Jin, M. Olhofer, and B. Sendhoff. A framework for evo- [10] S.M. Clarke, J.H. Griebisch, and T.W. Simpson. Analy- lutionary optimization with approximate fitness functions. sis of support vector regression for approximation of com- IEEE Transactions on Evolutionary Computation, 6:481– plex engineering analyses. Journal of Mechanical Design, 494, 2002. 127:1077–1087, 2005. [28] D.R. Jones, M. Schonlau, and W.J. Welch. Efficient global [11] The COCO platform, 2016. http://coco.gforge.inria.fr. optimization of expensive black-box functions. Journal of [12] K. Cutajar, E.V. Bonilla, P. Michiardi, and M. Filippone. Global Optimization, 13:455–492, 1998. Random feature expansions for deep Gaussian processes. [29] S. Kern, N. Hansen, and P. Koumoutsakos. Local metamod- In ICML, pages 884–893, 2017. els for optimization using evolution strategies. In PPSN IX, [13] M.A. El-Beltagy, P.B. Nair, and A.J. Keane. Metamodeling pages 939–948, 2006. techniques for evolutionary optimization of computaiton- [30] D.P. Kingma and J. Ba. Adam: A method for stochastic ally expensive problems: Promises and limitations. In optimization. Preprint arXiv:1412.6980, 2014. Proceedings of the Genetic and Evolutionary Computation [31] J. Koza and J. Tumpach. Surrogate networks. https:// Conference, pages 196–203. Morgan Kaufmann Publish- github.com/c0zzy/surrogate-networks, 2021. ers, 1999. [32] J. Koza, J. Tumpach, Z. Pitra, and M. Holeňa. Using [14] M. Emmerich, K. Giannakoglou, and B. Naujoks. Single- past experience for configuration of Gaussian processes in and multi-objective evolutionary optimization assisted by black-box optimization. In 15th Learning and Intelligent Gaussian random field metamodels,. IEEE Transactions Optimization Conference, page accepted for publication, on Evolutionary Computation, 10:421–439, 2006. 2021. [15] M. Emmerich, A. Giotis, M. Özdemir, T. Bäck, and K. Gi- [33] J.W. Kruisselbrink, M.T.M. Emmerich, A.H. Deutz, and annakoglou. Metamodel-assisted evolution strategies. In T. Bäck. A robust optimization approach using kriging PPSN VII, pages 361–370. ACM, 2002. metamodels for robustness approximation in the CMA-ES. [16] S. Finck, N. Hansen, R. Ros, and A. Auger. Real-parameter In IEEE CEC, pages 1–8, 2010. black-box optimization benchmarking 2010: Presentation [34] S.J. Leary, A. Bhaskar, and A.J. Keane. A derivative based of the noisy functions. Technical report, INRIA, Paris surrogate model for approximating and optimizing the out- Saclay, 2010. put of an expensive computer simulation. Journal of Global Optimization, 30:39–58, 2004. [35] J. Lee, Y. Bahri, R. Novak, S.S. Schoenholz, J. Penning- [44] Z. Pitra, J. Repický, and M. Holeňa. Boosted regression ton, et al. Deep neural networks as Gaussian processes. In forest for the doubly trained surrogate covariance matrix ICLR, pages 1–17, 2018. adaptation evolution strategy. In ITAT 2018, pages 72–79, [36] I. Loshchilov, M. Schoenauer, and M. Sebag. Intensive 2018. surrogate model exploitation in self-adaptive surrogate- [45] K. Rasheed, X. Ni, and S. Vattam. Methods for using sur- assisted CMA-ES (saACM-ES). In GECCO’13, pages rogate modesl to speed up genetic algorithm oprimization: 439–446, 2013. Informed operators and genetic engineering. In Y. Jin, ed- [37] J. Lu, B. Li, and Y. Jin. An evolution strategy assisted by an itor, Knowledge Incorporation in Evolutionary Computa- ensemble of local Gaussian process models. In GECCO’13, tion, pages 103–123. Springer, 2005. pages 447–454, 2013. [46] C.E Rasmussen and N. Hansen. GPML 4.0. matlab tool- [38] J.R. Magnus and H. Neudecker. Matrix Differential Calcu- box. http://www.gaussianprocess.org/gpml/code/ lus with Applications in Statistics and Econometrics. John matlab/doc/. Wiley and Sons, Chichester, 2007. [47] E. Rasmussen and C. Williams. Gaussian Processes for [39] A.G.G. Matthews, J. Hron, M. Rowland, and R.E. Turner. Machine Learning. MIT Press, Cambridge, 2006. Gaussian process behaviour in wide deep neural networks. [48] A. Ratle. Kriging as a surrogate fitness landscape in evolu- In ICLR, pages 1–15, 2019. tionary optimization. Artificial Intelligence for Engineering [40] R.H. Myers, D.C. Montgomery, and C.M. Anderson-Cook. Design, Analysis and Manufacturing, 15:37–49, 2001. Response Surface Methodology: Proces and Product Op- [49] H. Ulmer, F. Streichert, and A. Zell. Evolution strategies timization Using Designed Experiments. John Wiley and assisted by Gaussian processes with improved pre-selection Sons, Hoboken, 2009. criterion. In IEEE CEC, pages 692–699, 2003. [41] R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, et al. [50] University of California in Irvine. Repository of machine Bayesian deep convolutional networks with many channels learning databases. http://www.ics.uci.edu/∼mlearn, 2016. are Gaussian processes. In ICLR, pages 1–35, 2019. [51] V. Volz, G. Rudolph, and B. Naujoks. Investigating uncer- [42] Y.S. Ong, P.B. Nair, A.J. Keane, and K.W. Wong. Sur- tainty propagation in surrogate-assisted evolutionary algo- rogate-assisted evolutionary optimization frameworks for rithms. In GECCO’17, pages 881–888, 2017. high-fidelity engineering design problems. In Y. Jin, ed- [52] A.G. Wilson, Z. Hu, R. Salakhutdinov, and E.P. Xing. Deep itor, Knowledge Incorporation in Evolutionary Computa- kernel learning. In ICAIS, pages 370–378, 2016. tion, pages 307–331. Springer, 2005. [53] Z.Z. Zhou, Y.S. Ong, P.B. Nair, A.J. Keane, and K.Y. Lum. [43] Z. Pitra, M. Hanuš, J. Koza, J. Tumpach, and M. Holeňa. Combining global and local surrogate models to accellerate Interaction between model and its evolution control in evolutionary optimization. IEEE Transactions on Systems, surrogate-assisted CMA evolution strategy. In GECCO’21, Man and Cybernetics. Part C: Applications and Reviews, page paper no. 358, 2021. 37:66–76, 2007.