Combining Gaussian Processes and Neural Networks in Surrogate Modeling
             for Covariance Matrix Adaptation Evolution Strategy

                                    Jan Koza1 , Jiří Tumpach2 , Zbyněk Pitra3 , and Martin Holeňa4
                  1
                  Czech Technical University, Faculty of Information Technology, Prague, kozajan@fit.cvut.cz
                     2 Charles University, Faculty of Mathematics and Physics, Prague, tumpach@cs.cas.cz
         3 Czech Technical University, Faculty of Nuclear Sciences and Physical Engineering, Prague, z.pitra@gmail.com
                       4 Academy of Sciences, Institute of Computer Science, Prague, martin@cs.cas.cz


Abstract: This paper focuses on surrogate models for Co-                     Basically, a surrogate model in continuous black-box
variance Matrix Adaptation Evolution Strategy (CMA-ES)                    optimization is any regression model that with a sufficient
in continuous black-box optimization. Surrogate modeling                  fidelity approximates the true black-box objective function
has proven to be able to decrease the number of evalua-                   and replaces it in some of its evaluations. Among the re-
tions of the objective function, which is an important re-                gression models most frequently used to this end, we are
quirement in some real-world applications where the eval-                 most interested in Gaussian processes (GPs) [47], due to
uation can be costly or time-demanding. Surrogate models                  the fact that they estimate not only the expected value of
achieve this by providing an approximation instead of the                 the true objective function, but the whole distribution of its
evaluation of the true objective function. One of the state-              values. Apart from regression, they are also encountered
of-the-art models for this task is the Gaussian process. We               in classification, and play a key role in Bayesian optimiza-
present an approach to combining Gaussian processes with                  tion [22, 28], where the distribution of values provides se-
artificial neural networks, which was previously success-                 lection criteria that are alternatives to the objective func-
fully applied to other machine learning domains.                          tion value, such as the probability of improvement or the
   The experimental part employs data recorded from                       expected improvement.
previous CMA-ES runs, allowing us to assess different                        The importance of GPs in machine learning incited at-
settings of surrogate models without running the whole                    tempts to integrate them with the leading learning para-
CMA-ES algorithm. The data were collected using 24                        digm of the last decades – neural learning, including deep
noiseless benchmark functions of the platform for com-                    learning. The attractiveness of this research direction is
paring continuous optimizers COCO in 5 different dimen-                   further supported by recent theoretical results concerning
sions. Overall, we used data samples from over 2.8 million                relationships of asymptotic properties of important kinds
generations of CMA-ES runs. The results examine and                       of artificial neural networks (ANNs) to properties of GPs
statistically compare six covariance functions of Gaussian                [35,39,41]. The integration of GP with neural learning has
processes with the neural network extension. So far, the                  been proposed on two different levels:
combined model did not show up to outperform the Gaus-                    (i) Proper integration of an ANN with a GP, in which the
sian process alone. Therefore, in conclusion, we discuss                        GP forms the final output layer of the ANN [9, 52].
possible reasons for this and ideas for future research.                  (ii) Only a transfer of the layered structure, which is a
Keywords: black-box optimization, surrogate modeling,                           crucial feature, to the GP context, leading to the con-
artificial neural networks, Gaussian processes, covariance                      cept of deep GPs (DGPs) [8, 12, 22, 23].
functions                                                                    The recalled research into the integration of GPs with
                                                                          neural learning has used regression data [8, 9, 12, 23, 52],
                                                                          mostly from the UCI Machine Learning Repository [50],
1     Introduction
                                                                          but also data concerning locomotion of walking bipedal
                                                                          robots [9], and face patches and hadwritten digits [52].
In evolutionary black-box optimization, an increasing at-
                                                                          In [12,23], also classification data were used. However, we
tention is paid to tasks with expensive evaluation of the
                                                                          are aware of only one application of such an integration, in
black-box objective function. It is immaterial whether
                                                                          particular of DGPs, to two very easy 1- and 2-dimensional
that expensiveness is due to time-consuming computa-
                                                                          optimization problems [22]. Hence, there is a gap between
tion like in long simulations [17], or due to evaluation
                                                                          the importance of GPs in Bayesian optimization and miss-
through costly experiments like in some areas of science
                                                                          ing investigations of the suitability of integrated GP-ANN
[3]. To deal with such expensive evaluations, black-box
                                                                          models for surrogate modeling in black-box optimization.
optimization has in the late 1990s and early 2000s adopted
                                                                          That gap motivated the research reported in this paper.
an approach called surrogate modeling or metamodeling
[7, 13, 15, 34, 42, 45, 48].                                                 We have performed this novel kind of investigation of
                                                                          GP-ANN integration using data from more than 5000 runs
     Copyright ©2021 for this paper by its authors. Use permitted under   of using GPs as Bayesian optimizers in black-box opti-
Creative Commons License Attribution 4.0 International (CC BY 4.0).       mization, i.e. optimization of pointwise observable func-
tions for which no analytic expression is available and          only the ordering of those evaluations. Details of the algo-
the function values have to be obtained either empirically,      rithm can be found in [19, 21].
e.g. through measuring or experiments, or through numer-            During the more than 20 years of CMA-ES existence,
ical simulations. Based on that data, our investigation ad-      a number of surrogate-assisted variants of this algorithm
dressed two research questions:                                  have been proposed, a survey can be found in [6, 43].
                                                                 Here, we pay attention only to the most recent GP-based
    1. Does the integration of a GP with neural learning has     among them, the Doubly Trained Surrogate CMA-ES
       an added value compared with employing a GP alone         (DTS-CMA-ES) [6], surrogate-assisted variant of CMA-
       or an ANN alone?                                          ES. It employs two GPs f1 and f2 , trained consecutively,
    2. What is the impact of each of the considered GP co-       to find an evaluation of the population x1 , . . . , xλ , with f1
       variance functions on the result of GP-ANN integra-       used for active learning of training data for f2 . Due to
       tion?                                                     the CMA-ES invariance with respect to strictly increas-
                                                                 ing transformations, it evaluates the difference between
  The next section introduces the principles of surro-           predictions only according to the difference in the order-
gate modeling as well as of the evolutionary optimization        ing of those predictions, using the ranking difference error
method Covariance Matrix Adaptation Evolution Strategy           (RDE). The RDE of y ∈ Rλ with respect to y0 ∈ Rλ con-
(CMA-ES), in the context of which our research has been          sidering k best components is defined:
performed. In Section 3, the fundamentals of GPs are re-                                     ∑i,(ρ(y0 ))i ≤k (|ρ(y0 ))i − (ρ(y))i |
called and the method we have used for their integration                RDE(y, y0 ) =                                                  ,     (1)
with neural networks is explained. The core part of the pa-
                                                                          ≤k                  maxπ∈Π(λ ) ∑ki=1 |i − π −1 (i)|
per is Section 4, which attempts to provide experimental         where Π(λ ) denotes the set of permutaions of {1, . . . , λ }
answers to the above research questions.                         and ρ(y) denotes the ordering of the components of y, i.e.,
                                                                 (∀y ∈ Rλ ) ρ(y) ∈ Π(λ ) & (ρ(y))i < (ρ(y)) j ⇒ yi ≤ y j .
                                                                 Because the CMA-ES algorithm uses the surrogate model
2     Surrogate modeling in Black-Box                            to select the most promising candidates for true evaluation,
      Optimization                                               the metric considers only k best samples. The range of
                                                                 RDE metric is [0, 1], it equals 0 for the exact ordering and
Basically, the purpose of surrogate modeling – to approxi-       1 for thereverse order.
mate an unknown functional dependence – coincides with              The algorithm DTS-CMA-ES is described in Algo-
the purpose of response surface modeling in the design of        rithm 1, using the following notation:
experiments [24, 40]. Therefore, it is not surprising that
typical response surface models, i.e., low order polynomi-          • A for an archive – a set of points that have already
als, belong also to the most traditional and most success-            been evaluated by the true black-box objective func-
ful surrogate models [1, 2, 20, 29, 45]. Other frequently             tion BB;
used kinds of surrogate models are artificial neural net-           • dσ 2C for the Mahalanobis distance given by σ 2C:
works of the kind multilayer perceptron (MLP) or radial                                    q
basis function network [18, 25–27, 42, 53], and Gaussian                   dσ 2C (x, x0 ) = (x − x0 )> σ −2C−1 (x − x0 ); (2)
processes, in surrogate modeling also known as kriging
[6, 7, 13–15, 33, 34, 37, 49, 51]. Occasinally encountered          • Nk (x; A) for the set of k of dσ 2C -nearest neighbours of
are support vector regression [10, 36] and random forests             x ∈ Rd with respect to the archive A;
[5, 44].                                                            • fi (x1 , . . . , xλ ) = ( fi (x1 ), . . . , fi (xλ )), for i = 1, 2;
                                                                    • Th for the training set selection:
2.1    CMA-ES and Its Surrogate-Assisted Variant
                                                                                   λ
       DTS-CMA-ES                                                                  [
                                                                           Th =          {x ∈ Nh (x j ; A)|dσ 2C (x, x j ) < rmax }          (3)
The CMA-ES algorithm performs unconstrained opti-                                  j=1

mization on Rd , by means of iterative sampling of pop-               with rmax > 0 for h = 1, . . . , |A|;
ulations sized λ from a d-dimensional Gaussian distribu-
                                                                    • k(A) = max{h||Th | ≤ Nmax }, with Nmax ∈ N;
tion N(m, σ 2C), and uses a given parent number µ among
the sampled points corresponding to the lowest objective            • ρPoI for decreasing ordering of f1 (x1 ), . . . , f1 (xλ ) ac-
function values, to update the parameters of that distribu-           cording to the probability of improvement with re-
tion. Hence, it updates the expected value m, which is used           spect to the lowest BB value found so far,
as the current point estimate of the function optimum, the
matrix C and the step-size σ . The CMA-ES is invariant                    i < j ⇒ PoI(ρPoI ( f1 (x1 , . . . , xλ )))i ;V ) ≥
with respect to strictly increasing transformations of the                                  ≥ PoI(ρPoI ( f1 (x1 , . . . , xλ ))) j ;V ), (4)
objective function. Hence, to make use of the evaluations
                                                                      where V = minx∈A BB(x).
of the objective function in a set of points, it needs to know
Algorithm 1 Algorithm DTS-CMA-ES                                               vectors x = (x1 , . . . , xn )> , y = (y1 , . . . , yn )> = ( f (x1 ) +
Require: x1 , . . . , xλ ∈ Rd , µ, A, σ and C – step                           ε, . . . f (xn ) + ε)> , k? = (κ(x1 , x? ), . . . , κ(xn , x? ))> and the
    size and matrix from the CMA-ES distribu-                                  matrix K ∈ Rn×n such that (K)i, j = κ(xi , x j ) + σn2 I(i = j).
    tion, Nmax ∈ N such that Nmax ≥ λ , rmax > 0,                              Then the probability density of the vector y of observations
    β , εmin , εmax , εinitial , αmin , αmax , αinitial ∈ (0, 1)               is
 1: if this is the 1st call of the algorithm in the current
                                                                                                          exp − 21 (y − µ)> K −1 (y − µ)
                                                                                                                                                
                                                                                                   2
    CMA-ES run then                                                                  p(y; µ, κ, σn ) =       p                                    , (7)
 2:       α = αinitial , ε = εinitial                                                                           (2π)2 det(K + σn2 In )
 3: else
                                                                               where det(M) denotes the determinant of a matrix M. Fur-
 4:       take over the returned values of α, ε from its pre-
                                                                               thermore, as a consequence of the assumption of Gaussian
    vious call in the run
                                                                               joint distribution, also the conditional distribution of f (x? )
 5: end if
                                                                               conditioned on y is Gaussian, namely
 6: Train a Gaussian process f 1 on Tk(A) , estimating mGP ,
    σn , σ f , ` through maximization of the likelihood (7)                          N(µ(x? ) + k? K −1 (y − µ), κ(x? , x? ) − k?> K −1 k? ).       (8)
 7: Evaluate BB(x j ) for x j not yet BB-evaluated and such
    that (ρPoI ( f1 (x1 , . . . , xλ ))) j ≤ dαλ e                                According to (6), the relationship between the observa-
 8: Update A to A ∪ {(x j |(ρPoI ( f 1 (x1 ), . . . , f 1 (xλ ))) j ≤          tions y and y0 is determined by the covariance function κ.
    dαλ e}                                                                     In the reported research, we have considered 6 kinds of
 9: Train a Gaussian process f 2 on Tk(A) , estimating mGP ,                   covariance functions, listed below. In their definitions, the
    σn , σ f , ` through maximization of the likelihood (7)                    notation r = kx0 − xk is used, and among the parameters
10: For x j such that (ρPoI ( f 1 (x1 , . . . , xλ ))) j ≤ dαλ e, up-          of κ, aka hyperparameters of the GP, frequently encoun-
    date f2 (x j ) = BB(x j )                                                  tered are σ 2f , ` > 0, called signal variance and character-
11: Update ε to β RDEµ ( f 1 (x1 , . . . , xλ ), ( f 2 (x1 , . . . , xλ )) +   istic length scale, respectively. Other parameters are intro-
                                                            ε−εmin
    (1 − β )ε and α to αmin + max(0, min(1, εmax               −εmin ))
                                                                               duced for each covariance function separately.
12: For j fulfilling (ρPoI ( f 1 (x1 , . . . , xλ )) j > dαλ e, update         (i) Linear: κLIN (x, x0 ) = σ02 + x> x0 , with a bias σ02 .
     f2 (x j) to f2 (x j)−min{ f2 (x j0)|(ρPoI ( f1 (x1 , . . . , xλ)) j0 >    (ii) Quadratic is the square of the linear covariance:
    dαλ e}+min{ f2 (x j0 )|(ρPoI ( f1 (x1 , . . . , xλ )) j0 ≤ dαλ e}                κQ (x, x0 ) = (σ02 + x> x0 )2 .
                                                                                                                                          −α
13: Return f 2 (x1 ), . . . , f 2 (xλ ), ε, α                                                                                        r2
                                                                               (iii) Rational quadratic: κRQ (x, x0 ) = σ 2f 1 + 2α`    2      ,
                                                                                     with α > 0.                                   2
                                                                               (iv) Squared exponential: κSE (x, x0 ) = σ 2f exp − 2`r 2 .
3      Gaussian Processes and Their
                                                                               (v) Matérn 52 :
      Integration with Neural Networks                                                5                         √                 √ 
                                                                                                                        5r2
                                                                                      2
                                                                                    κMatérn (x, x0 ) = σ 2f 1 + `5r + 3`  2    exp  − `5r .
3.1     Gaussian processes                                                     (vi) One composite covariance function, namely the sum
                                                                                    of κSE and κQ :
A Gaussian process on a set X ⊂ Rd , d ∈ N is a collection                          κSE+Q (x, x0 ) = κSE (x, x0 ) + κQ (x, x0 ).
of random variables ( f (x))x∈X , any finite number of which
has a joint Gaussian distribution [47]. It is completely
specified by a mean function µ : X → R, typically assumed                      3.2      GP as the Output Layer of a Neural Network
constant, and by a covariance function κ : X ×X → R such
                                                                               The approach integrating a GP into an ANN as its output
that for x, x0 ∈ X,
                                                                               layer has been independently proposed in [9] and [52]. It
            E f (x) = µ, cov( f (x), f (x0 )) = κ(x, x0 ).              (5)    relies on the following two assumptions:
                                                                                  1. If nI denotes the number of the ANN input neu-
Therefore, a GP is often denoted GP(µ, κ(x, x0 )) or                           rons, then the ANN computes a mapping net of nI -dimen-
GP(µ, κ).                                                                      sional input values into the set X on which is the GP.
   The value of f (x) is typically accessible only as a noisy                  Consequently, the number of neurons in the last hidden
observation y = f (x) + ε, where ε is a zero-mean Gauss-                       layer equals the dimension d, and the ANN maps an in-
sian noise with a variance σn > 0. Then                                        put v into a point x = net(v) ∈ X, corresponding to an
                                                                               observation f (x + ε) governed by GP (Figure 1). From
               cov(y, y0 ) = κ(x, x0 ) + σn2 I(x = x0 ),                (6)    the point of view of the ANN inputs, the GP is now
                                                                               GP(µ(net(v)), κ(net(v), net(v0 ))).
where I(proposition) equals for a true proposition 1, for a                       2. The GP mean µ is assumed to be a known constant,
false proposition 0.                                                           thus not contributing to the GP hyperparameters and inde-
   Consider now the prediction of the random variable                          pendent of net.
f (x? ) in a point x? ∈ X if we already know the ob-                              Due to the assumption 2., the only hyperparameters of
servations y1 , . . . , yn in points x1 , . . . , xn . Introduce the           the GP are the parameters θ κ of the covariance function.
                    out
                      putGPl
                           ayer
                              :y                                    (θ κ , θ W ). Second, to perform the optimization with re-
                      (
                      +di
                        st
                         ri
                          but
                            ionofy
                                 )                                  spect to θ A in an outer loop (using, e.g. grid search com-
                                                                    bined with cross-validation) and for each considered com-
                                                                    bination of values of θ A perform the optimization with re-
                                                                    spect to (θ κ , θ W ) in an inner loop.
                                                                       Finally, let the smooth optimization be performed in
                                                                    the most simple but, in the context of neural networks,
                    l
                    asthi
                        ddenl
                            ayer
                               :x                                   also most frequent way – as gradient descent. The partial
                                                                    derivatives forming ∇(θ κ ,θ W ) L can be computed as:

                                                                                            n
                                                                                   ∂L          ∂ L ∂ Ki, j ∂ L
                                                                                     κ = ∑              κ ,  W
                                                                                                               =
                                                                                  ∂ θ`  i, j=1 Ki, j ∂ θ` ∂ θ`
                                                                                              ∂
                                                                                               n                                       (11)
                                                                                                      ∂ L ∂ Ki, j ∂ net(vk )
                                                                                         =      ∑ ∂ Ki, j ∂ xk ∂ θ W .
                                .                                                            i, j,k=1                  `
                                .
                                .
                      1s
                       thi
                         ddenl
                             ayer                                   In (11), the partial derivatives ∂∂KLi, j , i, j = 1, . . . , n, are com-
                                                                    ponents of the matrix derivative ∂∂ KL , for which the calcula-
                                                                    tions of matrix differential calculus [38] together with (7)
                                                                    and (9) yield

                                                                                    ∂L   1  −1 > −1     
                                                                                       =    K yy K − K −1 .                            (12)
                                                                                    ∂K   2
               i
               nputl
                   ayer
                      :i
                       nputv
                           aluesv
                     I           I                                  4     Experiments
                                                                    For the evaluation of the aforementioned models, we used
Figure 1: Schema of the integration of a GP into an ANN             the open-source Python library GPytorch and our imple-
as its output layer. Taken over from [32].                          mentation is available at [31].


As to the ANN, it depends on the one hand on the vector             4.1    Employed Data
θ W of its weights and biases, on the other hand on the
                                                                    As a dataset to compare different configurations of the
parameters θ A of its architecture (such as the number of
                                                                    combined GP-ANN model, we used recorded DTS-CMA-
hidden layers, the numbers of neurons in each of them, the
                                                                    ES runs from previous experiments. This allowed us to
activation functions). Altogether, the integrated GP-ANN
                                                                    effectively evaluate the surrogate model on its own with-
model depends on the vector (θ κ , θ W , θ A ), which we in
                                                                    out having to perform the whole optimization. The open-
accordance with [9,52] and with the terminology common
                                                                    source Matlab implementation of DTS-CMA-ES, used to
for GPs also call hyperparameters, although in the context
                                                                    obtain the data, is available at [4]. The underlying surro-
of ANNs, this term is typically used only for θ A .
                                                                    gate models were Gaussian process with 8 different co-
   Consider now n inputs to the neural network, v1 , . . . , vn ,
                                                                    variance functions implemented using the GPML Tool-
mapped to the inputs x1 = net(v1 ), . . . , xn = net(vn ) of the
                                                                    box [46]. The optimization runs were collected on the plat-
GP, corresponding to the observations y = (y1 , . . . , yn )> .
                                                                    form for comparing continuous optimizers COCO. The
Then the log-likelihood of θ is
                                                                    employed black-box optimization benchmark set contains
                                                                    24 noiseless functions scalable to any input dimension. We
                  L(θ ) = ln p(y; µ, κ, σn2 ),               (9)
                                                                    used dimensions 2, 3, 5, 10, and 20 in 5 different instances,
where µ is the constant assumed in Assumption 2., and               which consist in random rotation and translation in the in-
                                                                    put space. Therefore, for each combination of benchmark
                (K)i, j = κ(net(vi ), net(v j )).           (10)    function and dimension, 40 runs are available. More im-
                                                                    portant than the number of runs, however, is the number of
To find the combination of values of θ with the maxi-               their generations because data from each generation apart
mal likelihood, methods for smooth optimization can be              from the first can be used for testing all those surrogate
applied only with respect to θ κ and θ W , with respect to          models that could be trained with data from the previous
which is L continuous. With respect to θ A , we have two            generations. The number of generations in a particular run
possibilities. First, to fix θ A in advance, thus effectively       of CMA-ES algorithm is unknown before the run and de-
restricting the hyperparameters of the integrated model to          pends on the objective function and its particular instance.
Table 1: Noiseless benchmark functions of the platform comparing continuous optimizers (COCO) [11, 16] and the
number of available generations of DTS-CMA-ES runs for each of them in each considered dimension
                                      Benchmark                                 Available generations
                                       function                                     in dimension
                                         name                            2        3         5        10      20
                                                          Separable
                         1.   Sphere                                    4689    6910   11463     17385    25296
                         2.   Separable Ellipsoid                       6609    9613   15171     25994    55714
                         3.   Separable Rastrigin                       7688   11308   17382     27482    42660
                         4.   Büche-Rastrigin                           8855   13447   22203     31483    49673
                                                   Moderately Ill-Conditioned
                         6.   Attractive Sector                       16577    25200   38150     45119    72795
                         7.   Step Ellipsoid                            7103    9816   24112     34090    56340
                         8.   Rosenbrock                                7306   11916   21191     32730    71754
                         9.   Rotated Rosenbrock                        7687   12716   24084     35299    71017
                                                      Highly Ill-Conditioned
                        10.   Ellipsoid with High Conditioning          6691    9548   15867     25327    59469
                        11.   Discus                                    6999    9657   15877     25478    45181
                        12.   Bent Cigar                              10369    18059   28651     34605    56528
                        13.   Sharp Ridge                               7760   11129   20346     30581    48154
                        14.   Different Powers                          6653   10273   17693     31590    61960
                                                Multi-modal with Global Structure
                        15.   Non-separable Rastrigin                   7855   11476   19374     28986    44446
                        16.   Weierstrass                               9294   13617   24158     27628    40969
                        17.   Schaffers F7                              9031   13960   24244     34514    56247
                        18.   Ill-Conditioned Schaffers F7              9598   13404   25802     31609    53836
                        19.   Composite Griewank-Rosenbrock             9147   16268   24456     34171    53536
                                                 Multi-modal Weakly Structured
                        20.   Schwefel                                  9081   13676   24219     33753    53104
                        21.   Gallagher’s Gaussian 101-me Points        7645   12199   18208     25366    43186
                        22.   Gallagher’s Gaussian 21-hi Points         7629   11086   17881     26626    44971
                        23.   Katsuura                                  8751   11233   17435     25030    37366
                        24.   Lunacek bi-Rastrigin                      8983   13966   19405     29762    44420


The benchmarks and their total numbers of available gen-               parameters of the Gaussian process as proposed in [52]
erations for individual dimensions are listed in Table 1.              and outlined in Subsection 3.2. As a loss function, we used
We skipped the linear function (benchmark function num-                the Gaussian log-likelihood and optimized the parameters
ber 5), which is easy to optimize, and the recorded runs               with Adam [30] for a maximum of 1000 iterations. We
did not provide enough data samples to train and evaluate              also kept a 10 % validation set out of the training data to
the surrogate models.                                                  monitor overfitting, and we selected the model with the
                                                                       lowest L2 validation error during the training.
4.2   Experimental Setup
                                                                       4.3    Results
We compare the combinations of neural networks with                    To evaluate different covariance functions, we used the
Gaussian processes using six different covariance func-                recorded data in a similar way as it would be used by DTS-
tions listed in Subsection 3.1. All models were trained on             CMA-ES algorithm. We took each generation of points
the same set of training data Tk(A) as was used in steps 6             except the first one and we used it as testing data. Every
and 9 of Algorithm 1. Due to the condition (3), this set is            point evaluated by the true objective function in all previ-
rather restricted and allows training only a restricted ANN            ous generations is then available as a training sample. We
to prevent overfitting. Therefore, we decided to use a mul-            filter these samples using the training set selection method
tilayer perceptron with a single hidden layer, thus a topol-           Tk(A) and trained the surrogate model on it.
ogy (nI , nH , nO ), where nI is the dimension of the training             This way, we evaluated six different models on every
data, i.e. nI ∈ {2, 3, 5, 10, 20}, and                                 generation of samples listed in Table 1. The results are
                                                                      presented first in a function-method view in Table 4, then
                            2 if nI = 2,
                                                                      in a dimension-method view in Table 5. The metric used
                nH = nO = 3 if nI = 3, 5,                 (13)         in both tables is RDE, which shows how precise the model
                            
                            
                               5 if nI = 10, 20.                       is in ordering of the predicted values, therefore the lower
                                                                       is the error value, the better. The first table contains re-
As the activation function for both the hidden and out-                sults for every benchmark function separately, averaged
put layer, we chose the logistic sigmoid. We trained the               over every input dimension, function instance and DTS-
weights and biases of the neural network together with the             CMA-ES generation as well as the average over the whole
group of functions. The second table provides a view on         combination. Unfortunately, the results with other kernels
how the dimension of the input space affects the approxi-       ended up much worse.
mation error. The results for a particular dimension are av-       In our future research, we would like to try to overcome
eraged over all functions in a specific group, instances, and   this. We want to systematically investigate different ANN
generations. We also visualized with boxplots the summa-        topologies, including the direction of deep Gaussian pro-
rized RDE values for the function groups in Figure 2.           cesses [8, 12, 22, 23], in which only the topology is used
   Moreover, we verified the results by performing multi-       from an ANN, but all neurons are replaced by GPs. More-
ple comparison tests. A Non-parametric Friedman test was        over, in addition to the selection of the training set used
conducted on RDEs across all results for particular func-       in DTS-CMA-ES and described in Algorithm 1, we want
tions and function types in Table 4, and for a particular       to consider also alternative ways of training set selection,
combination of dimensions and function types in Table 5.        allowing to train larger networks. Finally, we intend to
If the null hypothesis of the equality of all six considered    perform research into transfer learning of surrogate mod-
methods was declined, the Wilcoxon signed-rank test was         els: An ANN-GP model with a deep neural network will
performed for all pairs of covariance functions, and its re-    be trained on data from many optimization runs, such as
sults were corrected for multiple hypotheses testing using      those employed in this paper, and then the model used in
the Holm method. We summarized the results of statistical       a new run of the same optimizer will be obtained through
testing in Tables 2 and 3.                                      additional fine tuning restricted only to the GP and last 1-2
                                                                layers of the ANN.
                                                                   We would also like to try to change the way how the
Table 2: Statistical comparison of six covariance functions     model is trained. In paper [52] the parameters of Gaus-
from the function-method view in Table 4. It shows for          sian process are learned together with the parameters of
how many functions (23 in total) was the kernel in a row        the neural network. We think that this might not work
significantly better than the one in a column.                  well in the domain of surrogate modeling in black-box op-
           κLIN κQ κRQ κSE κMat κSE+Q                    Σ      timization. Therefore, we will try to train the network and
  κLIN       –      20     19    21      20     20      100     Gaussian process separately.
  κQ         0       –      3     2       3     16      24
  κRQ        1       5      –     1       1     13      21
  κSE        2       6      1     –       1     14      24      5   Conclusion
  κMat       2       6      2     0       –     13      23
  κSE+Q      2       1      2     0       1      –       6      In this paper, we examined an extension of Gaussian pro-
                                                                cesses used in surrogate modeling. At the beginning,
                                                                we described the CMA-ES algorithm for black-box opti-
Table 3: Statistical comparison of six covariance functions     mization and its surrogate-assisted variant DTS-CMA-ES.
from the dimension-method view in Table 5. It shows for         Then we outlined Gaussian processes, various covariance
how many combinations of input dimension and a specific         functions, and their neural network extension. The pre-
function group (25 in total) was the kernel in a row signif-    sented research is our first attempt in the application of the
icantly better than the one in a column.                        combination of Gaussian processes with neural networks
           κLIN κQ κRQ κSE κMat κSE+Q                    Σ      as a surrogate model in black-box optimization. We imple-
  κLIN       –      25    23      21    20       24     113     mented the combined model and compared the results ob-
  κQ         0       –     4       0     3       16      23     tained with six different covariance functions. By looking
  κRQ        0       4     –       2     1       13      20     at the results presented in the previous section, we can con-
  κSE        0       4     5       –     5       13      27     clude that the combined model yields the best results with
  κMat       0       2     3       1     –        9      15     the simplest linear kernel. The other covariance functions
  κSE+Q      0       0     0       0     0        –       0     produce much larger errors and the worst performing one
                                                                seems to be the composite kernel κSE+Q . Unfortunatelly
                                                                the results showed up to be worse than using Gaussian pro-
                                                                cesses alone, therefore we discuss possible ideas of further
                                                                improvements at the end of the paper.
4.4   Discussion and Further Research
                                                                Acknowledgement
The results show that the combined model performs best
with the simplest linear kernel. We compared the GP-            The research reported in this paper has been supported
ANN with a linear kernel with pure Gaussian processes in        by SVV project number 260 575 and was also partially
paper [32]. We found out that if we compare the combined        supported by Czech Science Foundation (GAČR) grant
GP-ANN with GP both with linear kernels, the neural ex-         18-18080S. Computational resources were supplied by
tensions can bring better results in some cases. However,       the project "e-Infrastruktura CZ" (e-INFRA LM2018140)
using more complex covariance functions with GP is still        provided within the program Projects of Large Research,
better. Therefore, we tried to apply it also to the GP-ANN      Development and Innovations Infrastructures.
                                                                               Table 5: Comparison of average RDE values for six different covariance func-
                                                                               tions depending on the type of benchmark function and the input space di-
Table 4: Comparison of average RDE values for six different covariance func-
                                                                               mension. The values are averaged over different functions in particular group
tions depending on a particular benchmark function. The values are averaged
                                                                               and instances.
over different dimensions and instances.
                                                                                                      HH κ
                                                                                   H
     HH κ                                                                                                      κLIN     κQ     κRQ     κSE     κMat    κSE+Q
                           κLIN     κQ     κRQ     κSE     κMat    κSE+Q                          f       H
      f HH                                                                                                 H
         H                                                                                              SEP    0.252   0.452   0.448   0.445   0.418   0.581
          1                0.242   0.528   0.486   0.469   0.476   0.591
                                                                                                       MOD     0.341   0.450   0.450   0.483   0.455   0.471


                                                                                   Dimension 2
          2                0.128   0.386   0.475   0.486   0.503   0.491
       Separable
       Functions


                                                                                                        HC     0.204   0.449   0.459   0.474   0.441   0.502
          3                0.300   0.452   0.449   0.462   0.443   0.546
                                                                                                       MMA     0.416   0.485   0.472   0.475   0.484   0.550
          4                0.352   0.480   0.459   0.462   0.462   0.548
                                                                                                       MMW     0.443   0.517   0.419   0.471   0.431   0.522
          all              0.255   0.461   0.467   0.469   0.471   0.544
                                                                                                         all   0.334   0.473   0.450   0.470   0.447   0.525
          6                0.351   0.446   0.410   0.451   0.451   0.421
                                                                                                        SEP    0.194   0.439   0.414   0.455   0.447   0.527
       conditioning


          7                0.202   0.461   0.485   0.476   0.499   0.529
        Moderate


                                                                                                       MOD     0.262   0.449   0.383   0.407   0.429   0.512


                                                                                   Dimension 3
          8                0.330   0.463   0.410   0.406   0.460   0.527
                                                                                                        HC     0.174   0.425   0.425   0.451   0.464   0.489
          9                0.326   0.465   0.463   0.435   0.430   0.530
                                                                                                       MMA     0.353   0.486   0.449   0.461   0.453   0.536
          all              0.302   0.458   0.441   0.442   0.459   0.501
                                                                                                       MMW     0.387   0.512   0.467   0.485   0.464   0.527
         10                0.142   0.458   0.505   0.457   0.484   0.487
                                                                                                         all   0.278   0.464   0.430   0.454   0.453   0.518
       High conditioning


         11                0.147   0.417   0.537   0.527   0.525   0.488
         and unimodal


                                                                                                        SEP    0.227   0.445   0.473   0.479   0.478   0.526
         12                0.284   0.436   0.506   0.498   0.510   0.451
                                                                                                       MOD     0.260   0.458   0.444   0.458   0.468   0.502


                                                                                   Dimension 5
         13                0.216   0.434   0.455   0.418   0.433   0.532
                                                                                                        HC     0.202   0.439   0.513   0.489   0.486   0.492
         14                0.289   0.498   0.452   0.405   0.416   0.551
                                                                                                       MMA     0.373   0.481   0.491   0.500   0.482   0.517
          all              0.215   0.448   0.491   0.461   0.473   0.501
                                                                                                       MMW     0.363   0.513   0.496   0.504   0.499   0.532
         15                0.321   0.462   0.467   0.446   0.427   0.545
                                                                                                         all   0.289   0.468   0.486   0.487   0.484   0.514
    global structure


         16                0.551   0.530   0.516   0.486   0.489   0.504
     Multi-modal


                                                                                                        SEP    0.278   0.479   0.476   0.500   0.494   0.536
       adequate


         17                0.391   0.450   0.487   0.474   0.483   0.550


                                                                                   Dimension 10
                                                                                                       MOD     0.285   0.460   0.453   0.417   0.443   0.502
         18                0.380   0.486   0.489   0.472   0.480   0.553
                                                                                                        HC     0.206   0.464   0.515   0.438   0.478   0.507
         19                0.400   0.540   0.524   0.521   0.526   0.536
                                                                                                       MMA     0.422   0.498   0.525   0.499   0.478   0.549
          all              0.408   0.493   0.496   0.479   0.480   0.537
                                                                                                       MMW     0.363   0.500   0.530   0.525   0.530   0.549
         20                0.213   0.508   0.467   0.456   0.472   0.547
                                                                                                         all   0.313   0.481   0.503   0.477   0.486   0.529
    global structure


         21                0.387   0.470   0.467   0.476   0.468   0.532
     Multi-modal


                                                                                                        SEP    0.327   0.493   0.525   0.469   0.518   0.551
         22                0.347   0.480   0.484   0.467   0.475   0.532
         weak


                                                                                   Dimension 20
                                                                                                       MOD     0.362   0.475   0.479   0.447   0.504   0.522
         23                0.589   0.589   0.537   0.544   0.547   0.549
                                                                                                        HC     0.293   0.467   0.543   0.454   0.500   0.519
         24                0.448   0.492   0.509   0.516   0.506   0.506
                                                                                                       MMA     0.478   0.518   0.546   0.464   0.508   0.537
          all              0.396   0.507   0.492   0.491   0.493   0.533
                                                                                                       MMW     0.429   0.496   0.551   0.475   0.544   0.536
                                                                                                         all   0.381   0.490   0.531   0.462   0.515   0.533
                           Separable                                           Moderately Ill-Conditioned
 0.8                                                          0.8

 0.7                                                          0.7

 0.6                                                          0.6

 0.5                                                          0.5

 0.4                                                          0.4

 0.3                                                          0.3

 0.2                                                          0.2

 0.1                                                          0.1

 0.0                                                          0.0
        κLIN     κQ       κRQ     κSE      κMat   κSE+Q              κLIN       κQ     κRQ     κSE     κMat    κSE+Q


                    Highly Ill-Conditioned                                  Multi-modal with Global Structure
 0.8                                                          0.8

 0.7                                                          0.7

 0.6                                                          0.6

 0.5                                                          0.5

 0.4                                                          0.4

 0.3                                                          0.3

 0.2                                                          0.2

 0.1                                                          0.1

 0.0                                                          0.0
        κLIN     κQ       κRQ     κSE      κMat   κSE+Q              κLIN       κQ     κRQ     κSE     κMat    κSE+Q


               Multi-modal Weakly Structured
 0.8

 0.7

 0.6

 0.5

 0.4

 0.3

 0.2

 0.1

 0.0
        κLIN     κQ       κRQ     κSE      κMat   κSE+Q


Figure 2: Visualization of the distribution of RDE values for a specific group of functions. Each boxplot corresponds to
the aggregated mean value in the grey rows in Table 4.
References                                                        [17] A. Forrester, A. Sobester, and A. Keane. Engineering De-
                                                                       sign via Surrogate Modelling: A Practical Guide. John
 [1] A. Auger, D. Brockhoff, and N. Hansen. Benchmarking the           Wiley and Sons, New York, 2008.
     local metamodel cma-es on the noiseless BBOB’2013 test       [18] H.M. Gutmann. A radial basis function method for global
     bed. In GECCO’13, pages 1225–1232, 2013.                          optimization. Journal of Global Optimization, 19:201–227,
 [2] A. Auger, M. Schoenauer, and N. Vanhaecke. LS-CMA-                2001.
     ES: A second-order algorithm for covariance matrix adap-     [19] N. Hansen. The CMA evolution strategy: A comparing re-
     tation. In Parallel Problem Solving from Nature - PPSN            view. In Towards a New Evolutionary Computation, pages
     VIII, pages 182–191, 2004.                                        75–102. Springer, 2006.
 [3] M. Baerns and M. Holeňa. Combinatorial Development of       [20] N. Hansen. A global surrogate assisted CMA-ES. In
     Solid Catalytic Materials. Design of High-Throughput Ex-          GECCO’19, pages 664–672, 2019.
     periments, Data Analysis, Data Mining. Imperial College      [21] N. Hansen and A. Ostermaier. Completely derandomized
     Press / World Scientific, London, 2009.                           self-adaptation in evolution strategies. Evolutionary Com-
 [4] L. Bajer and Z. Pitra. Surrogate CMA-ES. https://                 putation, 9:159–195, 2001.
     github.com/bajeluk/surrogate-cmaes, 2021.                    [22] A. Hebbal, L. Brevault, M. Balesdent, E.G. Talbi, and
 [5] L. Bajer, Z. Pitra, and M. Holeňa. Benchmarking Gaus-            N. Melab. Efficient global optimization using deep Gaus-
     sian processes and random forests surrogate models on the         sian processes. In IEEE CEC, pages 1–12, 2018.
     BBOB noiseless testbed. In GECCO’15 Companion, pages         [23] G. Hernández-Muñoz, C. Villacampa-Calvo, and D. Her-
     1143–1150, 2015.                                                  nández Lobato. Deep Gaussian processes using expectation
 [6] L. Bajer, Z. Pitra, J. Repický, and M. Holeňa. Gaussian          propagation and Monte Carlo methods. In ECML PKDD,
     process surrogate models for the CMA evolution strategy.          pages 1–17, paper no. 128, 2020.
     Evolutionary Computation, 27:665–697, 2019.                  [24] S. Hosder, L. Watson, and B. Grossman. Polynomial re-
 [7] A.J. Booker, J. Dennis, P.D. Frank, D.B. Serafini, Torczon        sponse surface approximations for the multidisciplinary de-
     V., and M. Trosset. A rigorous framework for optimization         sign optimization of a high speed civil transport. Optimiza-
     by surrogates. Structural and Multidisciplinary Optimiza-         tion and Engineering, 2:431–452, 2001.
     tion, 17:1–13, 1999.                                         [25] Y. Jin, M. Hüsken, M. Olhofer, and Sendhoff B. Neural net-
 [8] T. Bui, D. Hernandez-Lobato, J. Hernandez-Lobato, Y. Li,          works for fitness approximation in evolutionary optimiza-
     and R. Turner. Deep Gaussian processes for regression us-         tion. In Y. Jin, editor, Knowledge Incorporation in Evolu-
     ing approximate expectation propagation. In ICML, pages           tionary Computation, pages 281–306. Springer, 2005.
     1472–1481, 2016.                                             [26] Y. Jin, M. Olhofer, and B. Sendhoff. Managing approxi-
 [9] R. Calandra, J. Peters, C.E. Rasmussen, and M.P. Deisen-          mate models in evolutionary aerodynamic design optimiza-
     roth. Manifold Gaussian processes for regression. In              tion. In CEC 2001, pages 592–599, 2001.
     IJCNN, pages 3338–3345, 2016.                                [27] Y. Jin, M. Olhofer, and B. Sendhoff. A framework for evo-
[10] S.M. Clarke, J.H. Griebisch, and T.W. Simpson. Analy-             lutionary optimization with approximate fitness functions.
     sis of support vector regression for approximation of com-        IEEE Transactions on Evolutionary Computation, 6:481–
     plex engineering analyses. Journal of Mechanical Design,          494, 2002.
     127:1077–1087, 2005.                                         [28] D.R. Jones, M. Schonlau, and W.J. Welch. Efficient global
[11] The COCO platform, 2016. http://coco.gforge.inria.fr.             optimization of expensive black-box functions. Journal of
[12] K. Cutajar, E.V. Bonilla, P. Michiardi, and M. Filippone.         Global Optimization, 13:455–492, 1998.
     Random feature expansions for deep Gaussian processes.       [29] S. Kern, N. Hansen, and P. Koumoutsakos. Local metamod-
     In ICML, pages 884–893, 2017.                                     els for optimization using evolution strategies. In PPSN IX,
[13] M.A. El-Beltagy, P.B. Nair, and A.J. Keane. Metamodeling          pages 939–948, 2006.
     techniques for evolutionary optimization of computaiton-     [30] D.P. Kingma and J. Ba. Adam: A method for stochastic
     ally expensive problems: Promises and limitations. In             optimization. Preprint arXiv:1412.6980, 2014.
     Proceedings of the Genetic and Evolutionary Computation      [31] J. Koza and J. Tumpach. Surrogate networks. https://
     Conference, pages 196–203. Morgan Kaufmann Publish-               github.com/c0zzy/surrogate-networks, 2021.
     ers, 1999.
                                                                  [32] J. Koza, J. Tumpach, Z. Pitra, and M. Holeňa. Using
[14] M. Emmerich, K. Giannakoglou, and B. Naujoks. Single-             past experience for configuration of Gaussian processes in
     and multi-objective evolutionary optimization assisted by         black-box optimization. In 15th Learning and Intelligent
     Gaussian random field metamodels,. IEEE Transactions              Optimization Conference, page accepted for publication,
     on Evolutionary Computation, 10:421–439, 2006.                    2021.
[15] M. Emmerich, A. Giotis, M. Özdemir, T. Bäck, and K. Gi-      [33] J.W. Kruisselbrink, M.T.M. Emmerich, A.H. Deutz, and
     annakoglou. Metamodel-assisted evolution strategies. In           T. Bäck. A robust optimization approach using kriging
     PPSN VII, pages 361–370. ACM, 2002.                               metamodels for robustness approximation in the CMA-ES.
[16] S. Finck, N. Hansen, R. Ros, and A. Auger. Real-parameter         In IEEE CEC, pages 1–8, 2010.
     black-box optimization benchmarking 2010: Presentation       [34] S.J. Leary, A. Bhaskar, and A.J. Keane. A derivative based
     of the noisy functions. Technical report, INRIA, Paris            surrogate model for approximating and optimizing the out-
     Saclay, 2010.                                                     put of an expensive computer simulation. Journal of Global
                                                                       Optimization, 30:39–58, 2004.
[35] J. Lee, Y. Bahri, R. Novak, S.S. Schoenholz, J. Penning-         [44] Z. Pitra, J. Repický, and M. Holeňa. Boosted regression
     ton, et al. Deep neural networks as Gaussian processes. In            forest for the doubly trained surrogate covariance matrix
     ICLR, pages 1–17, 2018.                                               adaptation evolution strategy. In ITAT 2018, pages 72–79,
[36] I. Loshchilov, M. Schoenauer, and M. Sebag. Intensive                 2018.
     surrogate model exploitation in self-adaptive surrogate-         [45] K. Rasheed, X. Ni, and S. Vattam. Methods for using sur-
     assisted CMA-ES (saACM-ES). In GECCO’13, pages                        rogate modesl to speed up genetic algorithm oprimization:
     439–446, 2013.                                                        Informed operators and genetic engineering. In Y. Jin, ed-
[37] J. Lu, B. Li, and Y. Jin. An evolution strategy assisted by an        itor, Knowledge Incorporation in Evolutionary Computa-
     ensemble of local Gaussian process models. In GECCO’13,               tion, pages 103–123. Springer, 2005.
     pages 447–454, 2013.                                             [46] C.E Rasmussen and N. Hansen. GPML 4.0. matlab tool-
[38] J.R. Magnus and H. Neudecker. Matrix Differential Calcu-              box. http://www.gaussianprocess.org/gpml/code/
     lus with Applications in Statistics and Econometrics. John            matlab/doc/.
     Wiley and Sons, Chichester, 2007.                                [47] E. Rasmussen and C. Williams. Gaussian Processes for
[39] A.G.G. Matthews, J. Hron, M. Rowland, and R.E. Turner.                Machine Learning. MIT Press, Cambridge, 2006.
     Gaussian process behaviour in wide deep neural networks.         [48] A. Ratle. Kriging as a surrogate fitness landscape in evolu-
     In ICLR, pages 1–15, 2019.                                            tionary optimization. Artificial Intelligence for Engineering
[40] R.H. Myers, D.C. Montgomery, and C.M. Anderson-Cook.                  Design, Analysis and Manufacturing, 15:37–49, 2001.
     Response Surface Methodology: Proces and Product Op-             [49] H. Ulmer, F. Streichert, and A. Zell. Evolution strategies
     timization Using Designed Experiments. John Wiley and                 assisted by Gaussian processes with improved pre-selection
     Sons, Hoboken, 2009.                                                  criterion. In IEEE CEC, pages 692–699, 2003.
[41] R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, et al.             [50] University of California in Irvine. Repository of machine
     Bayesian deep convolutional networks with many channels               learning databases. http://www.ics.uci.edu/∼mlearn, 2016.
     are Gaussian processes. In ICLR, pages 1–35, 2019.               [51] V. Volz, G. Rudolph, and B. Naujoks. Investigating uncer-
[42] Y.S. Ong, P.B. Nair, A.J. Keane, and K.W. Wong. Sur-                  tainty propagation in surrogate-assisted evolutionary algo-
     rogate-assisted evolutionary optimization frameworks for              rithms. In GECCO’17, pages 881–888, 2017.
     high-fidelity engineering design problems. In Y. Jin, ed-        [52] A.G. Wilson, Z. Hu, R. Salakhutdinov, and E.P. Xing. Deep
     itor, Knowledge Incorporation in Evolutionary Computa-                kernel learning. In ICAIS, pages 370–378, 2016.
     tion, pages 307–331. Springer, 2005.                             [53] Z.Z. Zhou, Y.S. Ong, P.B. Nair, A.J. Keane, and K.Y. Lum.
[43] Z. Pitra, M. Hanuš, J. Koza, J. Tumpach, and M. Holeňa.              Combining global and local surrogate models to accellerate
     Interaction between model and its evolution control in                evolutionary optimization. IEEE Transactions on Systems,
     surrogate-assisted CMA evolution strategy. In GECCO’21,               Man and Cybernetics. Part C: Applications and Reviews,
     page paper no. 358, 2021.                                             37:66–76, 2007.