=Paper=
{{Paper
|id=Vol-3079/ial2021_paper9
|storemode=property
|title=Combining Gaussian Processes with Neural Networks for Active Learning in Optimization
|pdfUrl=https://ceur-ws.org/Vol-3079/ial2021_paper9.pdf
|volume=Vol-3079
|authors=Jiří Růžička,Jan Koza,Jiří Tumpach,Zbyněk Pitra,Martin Holena
|dblpUrl=https://dblp.org/rec/conf/pkdd/RuzickaKTPH21
}}
==Combining Gaussian Processes with Neural Networks for Active Learning in Optimization==
<pdf width="1500px">https://ceur-ws.org/Vol-3079/ial2021_paper9.pdf</pdf>
<pre>
     Combining Gaussian Processes with Neural
    Networks for Active Learning in Optimization

                     Jiřı́ Růžička1 , Jan Koza1 , Jiřı́ Tumpach2 ,
                      Zbyněk Pitra1 , and Martin Holeňa1,2,3
                1
                  Czech Technical University, Prague, Czech Republic
                   2
                      Charles University, Prague, Czech Republic
3
  Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic


        Abstract. One area where active learning plays an important role is
        black-box optimization of objective functions with expensive evaluations.
        To deal with such evaluations, continuous black-box optimization has
        adopted an approach called surrogate modelling or metamodelling, which
        consists in replacing the true black-box objective in some of its evalu-
        ations with a suitable regression model, the selection of evaluations for
        replacement being an active learning task. This paper concerns surrogate
        modelling in the context of a surrogate-assisted variant of the continuous
        black-box optimizer Covariance Matrix Adaptation Evolution Strategy.
        It reports the experimental investigation of surrogate models combining
        artificial neural networks with Gaussian processes, for which it considers
        six different covariance functions. The experiments were performed on
        the set of 24 noiseless benchmark functions of the platform Comparing
        Continuous Optimizers COCO with 5 different dimensionalities. Their
        results revealed that the most suitable covariance function for this com-
        bined kind of surrogate models is the rational quadratic followed by the
        Matérn 25 and squared exponential. Moreover, the rational quadratic and
        squared exponential covariances were found interchangeable in the sense
        that for no function, no group of functions, no dimension and combina-
        tion of them, the performance of the respective surrogate models was
        significantly different.

        Keywords: active learning, black-box optimization, artificial neural net-
        works, Gaussian processes, covariance functions


1      Introduction
One area where active learning plays a very important role is black-box opti-
mization, in particular optimization of black-box objective functions with ex-
pensive evaluations. It is immaterial whether that expensiveness is due to time-
consuming computation like in long simulations [15], or due to evaluation in
costly experiments like in some areas of science [3]. To deal with such ex-
pensive evaluations, continuous black-box optimization has in the late 1990s
and early 2000s adopted an approach called surrogate modelling or metamod-
elling [6, 12, 14, 32, 40, 43, 46]. In this case, the goal of the surrogate model is to

    © 2021 for this paper by its authors. Use permitted under CC BY 4.0.
2
106     J.J.Růžička,
            Růžička J.
                        et Koza,
                           al.   J. Tumpach, Z. Pitra, M. Holena

decrease the total number of evaluations of the true objective function. Basi-
cally, a surrogate model is any regression model that with a sufficient fidelity,
approximates the true black-box objective function and replaces it in some of its
evaluations. And the decision in which points to evaluate the expensive black-
box objective function, and in which to use its surrogate approximation is an
active learning task.
    This work-in-progress paper concerns surrogate modelling in the context of
a state-of-the-art method for continuous black-box optimization, the Covariance
Matrix Adaptation Evolution Strategy (CMA-ES ) [20, 23]. It reports the first
results of our investigation of surrogate models based on combining artificial
neural networks (ANNs) with Gaussian processes (GPs). This investigation has
been motivated by the importance of surrogate models based on ANNs alone
[19,27–29,40,50] and especially on GPs alone [5,6,12–14,31,32,35,47,48], as well
as by the high popularity of ANNs in the last 10–15 years. To our knowledge, this
is the first time that ANN+GP combinations have been investigated for possible
application in surrogate modelling. On the other hand, research into combining
parametric ANN models with nonparametric GP models has been around for
nearly a decade, at first due to the increasing popularity of neural networks,
later also due to recent theoretical results concerning relationships of asymptotic
properties of important kinds of ANNs to properties of GPs [33, 37, 39]. The
integration of GP with neural learning has been proposed on two different levels:
(i) Proper integration of an ANN with a GP, in which the GP forms the final,
     output layer of the ANN [8, 49].
(ii) Only a transfer of the layered structure, which is a crucial feature of ANNs,
     to the GP context, leading to the concept of deep GPs (DGPs) [7,11,24,25].
In the reported investigation, we employed proper integration, using a GP as
the final layer of an ANN.
    The rest of the paper is organized as follows. In the next section, the theo-
retical fundamentals of GP regression and integration with ANNs are recalled.
Section 3 describes active learning in a surrogate-assisted variant of CMA-ES.
Replacing GPs in that variant with several ANN+GP combinations is then ex-
perimentally tested in Section 4. Finally, the concluding Section 5 also outlines
our future research plans.

2     Gaussian Processes and Their Integration with Neural
      Networks
2.1    Gaussian Processes
A Gaussian process on a set X ⊂ Rd , d ∈ N is a collection of random variables
(f (x))x∈X , any finite number of which has a joint Gaussian distribution [45]. It
is completely specified by a mean function mGP : X → R, typically assumed
constant, and by a covariance function κ : X × X → R such that for x, x0 ∈ X ,
                                    Ef (x) = mGP                               (1)
                                           0            0
                             cov(f (x), f (x )) = κ(x, x ).                    (2)
                                 GP-Neural
      Combining Gaussian Processes         Active
                                   with Neural    Learning
                                               Networks     in Optimization
                                                        for Active Learning                      3
                                                                                               107

Therefore, a GP is usually denoted GP(mGP , κ) or GP(mGP , κ(x, x0 )).
    The value of f (x) is typically accessible only as a noisy observation y =
f (x) + ε, where ε is a zero-mean Gausssian noise with a variance σn > 0. Then

                           cov(y, y 0 ) = κ(x, x0 ) + σn2 I(x = x0 ),                          (3)

where I(proposition) equals for a true proposition 1, for a false proposition 0.
    Consider now the prediction of the random variable f (x? ) in a point x? ∈ X
if we already know the observations y1 , . . . , yn in points x1 , . . . , xn . Introduce
the vectors x = (x1 , . . . , xn )> , y = (y1 , . . . , yn )> = (f (x1 ) + ε, . . . f (xn ) + ε)> ,
k? = (κ(x1 , x? ), . . . , κ(xn , x? ))> and the matrix K ∈ Rn×n such that (K)i,j =
κ(xi , xj )+σn2 I(i = j). Then the probability density of the vector y of observations
is
                                                                                   
                                       exp − 21 (y − mGP )> K −1 (y − mGP )
             p(y; mGP , κ, σn2 ) =                   p                               ,         (4)
                                                         2π det(K)

where det(A) denotes the determinant of a matrix A. Further, as a consequence
of the assumption of Gaussian joint distribution, also the conditional distribution
of f (x? ) conditioned on y is Gaussian, namely
                                                                        
             N mGP (x? ) + k? K −1 (y − mGP ), κ(x? , x? ) − k?> K −1 k? .      (5)

    According to (3), the relationship between the observations y and y 0 is deter-
mined by the covariance function κ. In the reported research, we have considered
6 kinds of covariance functions, listed below. In their definitions, the notation
r = kx0 − xk is used, and among the parameters of κ, aka hyperparameters of
the GP, frequently encountered are σf2 , ` > 0, called signal variance and charac-
teristic length scale, respectively. Other parameters will be introduced for each
covariance function separately.
(i) Linear : κLIN (x, x0 ) = σ02 + x> x0 , with a bias σ02 .
(ii) Quadratic is the square of the linear covariance: κQUAD (x, x0 ) = (σ02 +
     x> x0 )2 .
                                                            −α
                                                        r2
(iii) Rational quadratic: κRQ (x, x0 ) = σf2 1 + 2α`       2     , with α > 0.
                                                     2
                                                           r
(iv) Squared exponential : κSE (x, x0 ) = σf2 exp − 2`       2 .
                                           √                  √ 
                                                   5r 2
(v) Matérn 2 : κMA5 (x, x ) = σf 1 + ` + 3`2 exp − `5r .
                5           0      2          5r

(vi) One composite covariance function, namely the sum of κSE and κQUAD :
    κSE+Q (x, x0 ) = κSE (x, x0 ) + κQUAD (x, x0 ).

2.2      GP as the Output Layer of a Neural Network
An approach integrating a GP into an ANN as its output layer has been inde-
pendently proposed in [8] and [49]. It relies on the following two assumptions:
   1. If nI denotes the number of the ANN input neurons, then the ANN
computes a mapping net of nI -dimensional input values into the set X on
4
108    J.J.Růžička,
           Růžička J.
                       et Koza,
                          al.   J. Tumpach, Z. Pitra, M. Holena

which is the GP defined. Consequently, the number nO of neurons in the last
hidden layer fulfills X ⊂ RnO , and the ANN maps an input v into a point
x = net(v) ∈ X , corresponding to an observation f (x) + ε governed by the
GP (Figure 1). From the point of view of the ANN inputs, the GP is now
GP(mGP (net(v)), κ(net(v), net(v 0 ))).
    2. The GP mean mGP is assumed
to be a known constant, thus not con-          outputGPl   ayer:y
                                                 (
                                                 +dist
                                                     ribut
                                                         ionofy)
tributing to the GP hyperparameters
and independent of net.
    Due to the assumption 2., the GP
depends only on the parameters θκ
of the covariance function. As to the
ANN, it depends on the one hand on             lasthiddenl  ayer:x
              W
the vector θ of its weights and bi-
ases, on the other hand on the net-
work architecture, which we will treat
as fixed before network training.
    Consider now n inputs to the neu-
ral network, v1 , . . . ,vn , mapped to the              .
                                                         .
inputs x1 = net(v1 ), . . . , xn = net(vn )              .
of the GP, corresponding to the ob-             1s thiddenl a yer
servations y = (y1 , . . . , yn )> . Then
the log-likelihood of θ = (θκ , θW ) is

 L(θ) = ln p(y; mGP , κ, σn2 ) =
      1
 = − (y − mGP )> K −1 (y − mGP )                   i
                                                   nputl
                                                       ayer
                                                          :i
                                                           nputv
                                                               aluesv
      2
           1                                            I         I
− ln(2π) − ln det(K + σn2 In ), (6)
           2
                                           Fig. 1. Schema of the integration of a GP
where mGP is the constant assumed          into an ANN as its output layer.
in 2., and

   (K)i,j = κ(net(vi ), net(vj )).   (7)

    Let model training, searching for the vector (θκ , θW ), be performed in the
most simple but, in the context of neural networks, also the most frequent way –
as gradient descent. The partial derivatives forming ∇(θκ ,θW ) L can be computed
as:
                                      Xn
                              ∂L            ∂L ∂Ki,j
                                   =               κ ,                          (8)
                              ∂θ`κ   i,j=1
                                           ∂Ki,j ∂θ`

                                X n
                        ∂L             ∂L ∂Ki,j ∂ net(vk )
                            =                              .                    (9)
                       ∂θ`W   i,j,k=1
                                      ∂Ki,j ∂xk   ∂θ`W
                                 GP-Neural
      Combining Gaussian Processes         Active
                                   with Neural    Learning
                                               Networks     in Optimization
                                                        for Active Learning          5
                                                                                   109

                                 ∂L
In (8), the partial derivatives ∂K i,j
                                       , i, j = 1, . . . , n, are components of the ma-
                ∂L
trix derivative ∂K , for which the calculations of matrix differential calculus [36]
together with (4) and (6) yield
                          ∂L   1                      
                             =   K −1 yy > K −1 − K −1 .                          (10)
                          ∂K   2

3       Surrogate Modelling in the CMA-ES Context
3.1      Surrogate Models for Continuous Black-Box Optimization
Basically, the purpose of surrogate modelling – to approximate an unknown func-
tional dependence – coincides with the purpose of response surface modelling in
the design of experiments [26, 38]. Therefore, it is not surprising that typical
response surface models, i.e., low order polynomials, belong also to the most tra-
ditional and most successful surrogate models [1, 2, 21, 30, 43]. Other frequently
used kinds of surrogate models are artificial neural networks of the kind multi-
layer perceptron (MLP) or radial basis function network [19, 27–29, 40, 50], and
the models to which the previous section was devoted – GPs, in surrogate mod-
elling also known as kriging [5,6,12–14,31,32,35,47,48]. Occasinally encountered
are support vector regression [9, 34] and random forests [4, 41].
    From the point of view of active learning, the most attractive kind of surro-
gate models are GPs, due to the fact that a GP estimate f (x) of the value of
a true objective function for an input x is not a point, but a random variable.
Its Gaussian distribution allows to define alternative criteria according to which
individuals for evaluation by the true objective function can be selected, most
importantly:
 – Probability of improvement of the estimate f (x) with respect to a reference
   value V (typically the minimal so far found value of the true objective func-
   tion),

                             PoI(f (x); V ) = P (f (x) ≤ V ),                     (11)

   which can be estimated using the Gaussian distribution of the GP.
 – Expected improvement with respect to V ,

                        EI(f (x), V ) = E(V − f (x))I(f (x) < V ).,               (12)

3.2      Covariance Matrix Adaptation Evolution Strategy and Its
        Surrogate-Assisted Variant DTS-CMA-ES
The CMA-ES algorithm performs unconstrained optimization on Rd , by means
of iterative sampling of populations sized λ from a d-dimensional Gaussian dis-
tribution N (m, σ 2 C), and uses a given parent number µ among the sampled
points corresponding to the lowest objective function values, to update the pa-
rameters of that distribution. Hence, it updates the expected value m, which is
6
110      J.J.Růžička,
             Růžička J.
                         et Koza,
                            al.   J. Tumpach, Z. Pitra, M. Holena

used as the current point estimate of the function optimum, the matrix C and
the step-size σ. The CMA-ES is invariant with respect to monotonous transfor-
mations of the objective function. Hence, to make use of the evaluations of the
objective function in a set of points, it needs to know only the ordering of those
evaluations. Details of the algorithm can be found in [20, 23].
    During the more than 20 years of CMA-ES existence, a number of surrogate-
assisted variants of this algorithm have been proposed, a survey can be found in
[5,42]. Here, we will pay attention only to the most recent GP-based among them,
the Doubly Trained Surrogate CMA-ES (DTS-CMA-ES) [5], a surrogate-assisted
variant of CMA-ES. It employs two GPs f1 and f2 , trained consecutively, to find
an evaluation of the population x1 , . . . , xλ , with f1 used for active learning of
training data for f2 . Due to the CMA-ES invariance with respect to monotonous
transformations, evaluates the difference between predictions only according to
the difference in the ordering of those predictions, more precisely, according to
the ranking difference error (RDE). The RDE of y ∈ Rλ with respect to y 0 ∈ Rλ
considering k best components is defined:
                                 P                       0
                            0       i,(ρ(y 0 ))i ≤k |(ρ(y ))i − (ρ(y))i |
                  RDE(y, y ) =                      Pk                    ,      (13)
                   ≤k             maxπ∈Π(λ) i=1 |i − π −1 (i)|

where Π(λ) denotes the set of permutaions of {1, . . . , λ} and ρ(y) denotes the
ordering of the components of y, i.e., (∀y ∈ Rλ ) ρ(y) ∈ Π(λ) & (ρ(y))i <
(ρ(y))j ⇒ yi ≤ yj .
   The algorithm DTS-CMA-ES is described in Algorithm 1, using the following
notation:

 – A for an archive – a set of points that have already been evaluated by the
   true black-box objective function BB;
 – dσ2 C for the Mahalanobis distance given by σ 2 C:
                                   q
                   dσ2 C (x, x0 ) = (x − x0 )> σ −2 C −1 (x − x0 );      (14)

 – Nk (x; A) for the set of a given number k of dσ2 C -nearest neighbours of x ∈ Rd
   with respect to the archive A;
 – fi (x1 , . . . , xλ ) = (fi (x1 ), . . . , fi (xλ )), for i = 1, 2;
               Sλ
 – Th = j=1 {x ∈ Nh (xj ; A)|dσ2 C (x, xj ) < rmax } with rmax > 0 for h =
   1, . . . , |A|;
 – k(A) = max{h||Th | ≤ Nmax }, with Nmax ∈ N;
 – ρPoI for decreasing ordering of f1 (x1 ), . . . , f1 (xλ ) according to the probability
   of improvement with respect to the lowest BB value found so far,

       i < j ⇒ PoI(ρPoI (f1 (x1 , . . . , xλ )))i ; V ) ≥ PoI(ρPoI (f1 (x1 , . . . , xλ )))j ; V ),
                                                                                                (15)

      where V = minx∈A BB(x).
                                 GP-Neural
      Combining Gaussian Processes         Active
                                   with Neural    Learning
                                               Networks     in Optimization
                                                        for Active Learning                               7
                                                                                                        111

Algorithm 1 Algorithm DTS-CMA-ES
Require: x1 , . . . , xλ ∈ Rd , µ, A, σ and C – step size and matrix from the CMA-ES
    distribution, Nmax ∈ N such that Nmax ≥ λ, rmax > 0, β, min , max , αmin , αmax ∈
    (0, 1)
 1: if this is the 1st call of the algorithm in the current CMA-ES run then
 2:     set α =  = 0.05
 3: else
 4:     take over the returned values of α,  from its previous call in the run
 5: end if
 6: Train a Gaussian process f1 on Tk(A) , estimating mGP , σn , σf , ` through maximiza-
    tion of the likelihood (4)
 7: Evaluate BB(xj ) for xj such that (ρPoI (f1 (x1 , . . . , xλ )))j ≤ dαλe and not yet BB-
    evaluated
 8: Update A to A ∪ {(xj |(ρPoI (f1 (x1 ), . . . , f1 (xλ )))j ≤ dαλe}
 9: Train a Gaussian process f2 on Tk(A) , estimating mGP , σn , σf , ` through maxi-
    mization of the likelihood (4)
10: For xj such that (ρPoI (f1 (x1 , . . . , xλ )))j ≤ dαλe, update f2 (xj ) = BB(xj )
11: Update  to (1 − β) + β RDEµ (f1 (x1 , . . . , xλ ), (f2 (x1 , . . . , xλ )) and α to αmin +
                       −min
    max(0, min(1, max     −min
                                  ))
12: Update the value f2 (xj ) to f2 (xj ) − min{f2 (xj 0 )|(ρPoI (f1 (x1 , . . . , xλ ))j 0 > dαλe} +
    min{f2 (xj 0 )|(ρPoI (f1 (x1 , . . . , xλ ))j 0 ≤ dαλe} for j fulfilling (ρPoI (f1 (x1 , . . . , xλ ))j >
    dαλe
13: Return f2 (x1 ), . . . , f2 (xλ ), , α


4       Experiments with ANN+GP Integration in the
       DTS-CMA-ES
4.1      Experimental Setup
For the experiments, we have used the 24 noiseless benchmark functions avail-
able on a platform Comparing Continuous Optimizers (COCO) [10, 22]. Those
benchmarks form five groups with different properties:
 1. separable functions: f1 sphere, f2 separable ellipsoid, f3 separable Rastrigin,
    f4 Büche-Rastrigin, f5 linear slope;
 2. moderately ill-conditioned functions: f6 attractive sector, f7 step ellipsoid,
    f8 Rosenbrock, f9 rotated Rosenbrock;
 3. highly ill-conditioned functions: f10 ellipsoid with high conditioning, f11 dis-
    cus, f12 bent cigar, f13 sharp ridge, f14 different powers;
 4. multi-modal functions with global structure: f15 non-separable Rastrigin, f16
    Weierstrass, f17 Schaffers F7, f18 ill-conditioned Schaffers F7, f19 composite
    Griewank-Rosenbrock;
 5. multi-modal weakly structured functions: f20 Schwefel, f21 Gallagher’s Gaus-
    sian 101-me points, f22 Gallagher’s Gaussian 21-hi points, f23 Katsuura, f24
    Lunacek bi-Rastrigin.
All benchmark functions were optimized on the closed cube [−5, 5]d , where d
is the dimension of the input space, and the initial CMA-ES population was
8
112     J.J.Růžička,
            Růžička J.
                        et Koza,
                           al.   J. Tumpach, Z. Pitra, M. Holena

sampled uniformly on [−4, 4]d . For each noiseless function, 25 different variants
were used, obtained as follows:
 – each of the functions is scalable for any dimension d ≥ 2, we have used the
   five dimensions 2, 3, 5, 10, 20;
 – for each function in each dimension, 5 different instances were used, mutually
   differing through translations and/or rotations.
    Each variant f of each benchmark function was optimized for 250d evalua-
tions unless it was terminated earlier due to indicated convergence. To evaluate
the success of optimization at its end, we used the approach used in [22], to cal-
culate the proportion of achieved optimization target in the interval [10−8 , 102 ].
However, instead of calculating such a score from a given number of discrete
targets log-uniformly distributed in that interval as in [22], we calculated its
continuous counterpart as the ratio of the logarithmic length λlog between the
subinterval of [10−8 , 102 ] corresponding to the achieved distance f ∗ to the min-
imum of f at the end of the optimization and the whole interval [10−8 , 102 ],

           λlog ([10−8 , 102 ] ∩ [f ∗ , +∞))   max(0, min(10, 2 − log10 f ∗ ))
      r=                                     =                                 .   (16)
                  λlog ([10−8 , 102 ])                     10

    For all experiments, we used the existing implementation of DTS-CMA-ES
at https://github.com/bajeluk/surrogate-cmaes, into which we implemented the
ANN+GP surrogate models using the system GPyTorch [17], our implementaion
is available at https://github.com/c0zzy/surrogate-networks. As to the tunable
parameters of DTS-CMA-ES, we used the same values as [5]. As to the tunable
parameters of GPyTorch, we fixed the five listed in Table 1 and utilized the
default values of the remaining.
    Finally, for the neural networks in the ANN+GP combinations, we used
fully connected multilayer perceptrons with one hidden layer of a sufficiently
low number of neurons to assure their trainability with comparatively small
archives typically available in the DTS-CMA-ES. More precisely, all the ANNs
used the fully connected topologies (d − nO − nO ) with
                                   
                                   
                                   2 if d = 2,
                             nO = 3 if d = 3, 5,                          (17)
                                   
                                   
                                     5 if d = 10, 20.


4.2    First Results and Their Discussion
The first work-in-progress results presented in this paper, compare ANN+GP
combinations with different covariance functions of the GP. Table 2 reports the
scores of each such combination for each noiseless benchmark function, averaged
over the 25 combinations of 5 instances and 5 dimensions. In Tables 3 and 4, the
scores are reported for the above defined groups of benchmark functions, and
for the considered dimension, respectively. Hence, the score averaging in Table 3
                              GP-Neural
   Combining Gaussian Processes         Active
                                with Neural    Learning
                                            Networks     in Optimization
                                                     for Active Learning        9
                                                                              113

Table 1. Settings for important GPyTorch parameters. For all its remaining tunable
parameters, the default values were used

                          Optimizer           Adam
                          Learning rate       0.0005
                          Iterations          1000
                          Noise               0.0001
                          Length scale bounds 0.01, 100


includes in addition all functions of the respective group, whereas in Table 4 the
scores are averaged over the 120 instances of the 24 benchmark functions. In
all three tables, the ANN+GP combination with the highest average score is in
bold.


Table 2. Score of the 5 compared ANN+GP combinations for each of the noiseless
COCO benchmark functions, averaged over the 25 combinations of the 5 instances
of that function and the 5 considered dimensions. For each function, the ANN+GP
combination with the highest average score is in bold

      κLIN κQUAD κSE κMA5 κRQ κSE+Q       κLIN κQUAD κSE κMA5 κRQ κSE+Q
  f1 0.38    0.45 0.54 0.54 0.56 0.47 f13 0.13   0.15 0.22 0.25 0.27 0.23
  f2 0.02    0.11 0.22 0.20 0.26 0.19 f14 0.41   0.45 0.55 0.54 0.54 0.47
  f3 0.09    0.09 0.10 0.10 0.11 0.09 f15 0.09   0.09 0.09 0.10 0.10 0.09
  f4 0.07    0.07 0.08 0.08 0.08 0.09 f16 0.14   0.14 0.20 0.12 0.15 0.12
  f5 1.00 1.00 1.00 1.00 1.00 1.00 f17 0.26      0.30 0.33 0.34 0.33 0.32
  f6 0.09    0.10 0.13 0.17 0.15 0.16 f18 0.21   0.25 0.27 0.25 0.28 0.27
  f7 0.28    0.38 0.51 0.43 0.54 0.45 f19 0.19   0.20 0.20 0.20 0.20 0.21
  f8 0.15    0.17 0.29 0.29 0.28 0.27 f20 0.17   0.17 0.18 0.17 0.17 0.18
  f9 0.15    0.15 0.18 0.24 0.30 0.26 f21 0.19   0.18 0.16 0.18 0.16 0.17
  f10 0.02   0.07 0.10 0.07 0.10 0.07 f22 0.15   0.11 0.16 0.13 0.16 0.14
  f11 0.05   0.09 0.12 0.10 0.13 0.12 f23 0.17   0.16 0.17 0.17 0.17 0.16
  f12 0.03   0.07 0.08 0.10 0.13 0.09 f24 0.07   0.07 0.07 0.07 0.07 0.07


    From Tables 2–4, we can see that the highest average score has been by far
the most frequently achieved by ANN+GP combinations with rational quadratic
covariance functions κRQ : for 14 out of the 24 functions, 3 out of the 5 groups
of functions, and 4 out of the 5 considered dimensions. The rational quadratic
covariance function shares the first place with the squared exponential covari-
ance κSE for the function f22 Gallagher’s Gaussian 21-hi points, as well as for
the dimension 10. In addition, for the function f5 linear slope, ANN+GP com-
binations achieve with all considered covariance function the highest possible
average score 1. Apart from these shared first places, other covariance functions
lead to ANN+GP combinations with the highest average score in the following
cases:
10
114     J.J.Růžička,
            Růžička J.
                        et Koza,
                           al.   J. Tumpach, Z. Pitra, M. Holena

Table 3. Score of the 5 compared ANN+GP combinations for each group of the
noiseless COCO benchmark functions, averaged over the 100 or 125 (dependent on the
group) combinations of all instances of the functions belonging to that group and the 5
considered dimensions. For each function, the ANN+GP combination with the highest
average score is in bold

                                            κLIN κQUAD κSE κMA5 κRQ κSE+Q
 Separable                                 0.317 0.348 0.390 0.388 0.403 0.370
 Moderately ill-conditioned                0.172 0.204 0.281 0.290 0.323 0.288
 Highly ill-conditioned                    0.134 0.170 0.218 0.218 0.241 0.197
 Multi-modal functions globally structured 0.182 0.200 0.221 0.204 0.218 0.206
 Multi-modal weakly structured            0.153 0.143 0.151 0.149 0.149 0.147

Table 4. Score of the 5 compared ANN+GP combinations for each considered dimen-
sion, averaged over the 120 considered instances of the 24 noiseless COCO benchmark
functions. For each dimension, the ANN+GP combination with the highest average
score is in bold

                         κLIN κQUAD κSE κMA5 κRQ κSE+Q
                     2 0.268 0.322 0.386 0.389 0.398 0.365
                     3 0.23 0.254 0.326 0.307 0.365 0.302
                     5 0.178 0.194 0.225 0.223 0.244 0.224
                     10 0.162 0.166 0.185 0.184 0.185 0.174
                     20 0.124 0.131 0.133 0.138 0.131 0.135


 – the covariance κSE for the function f14 different powers as well as for the
   group of multi-modal globally structured functions;
 – the Matérn 52 covariance κMA5 for the functions f6 attractive sector, f8
   Rosenbrock, f17 Schaffers F7, f23 Katsuura, and f24 Lunacek bi-Rastrigin,
   as well as for the dimension 20;
 – the composite covariance κSE+Q for the functions f4 Büche-Rastrigin, f19
   composite Griewank-Rosenbrock, and f20 Schwefel;
 – the linear covariance κLIN for the function f21 Gallagher’s Gaussian 101-me
   points, as well as for the group of multi-modal weakly structured functions,
   to which also f21 belongs.
    The results in Tables 2–4 reflect both systematic differences due to differ-
ent covariance functions and random differences due to noise. To assess the
influence of different covariance functions without the interference of noise, the
obtained differences were tested for statistical significance. To this end, the null
hypotheses that the means of the random variables that produced the scores
for the individual ANN+GP combinations are all identical were tested, using
the Friedman’s test with post-hoc identification of the pairs that lead to the
rejection of the null hypothesis if it is rejected. Both the Friedman test and its
post-hoc tests were performed on the family-wise significance level for multiple-
hypotheses testing 5%, and the family-wise significances were assessed from the
achieved significances of the individual tests (p-values) by means of the Holm
method [16].
                              GP-Neural
   Combining Gaussian Processes         Active
                                with Neural    Learning
                                            Networks     in Optimization
                                                     for Active Learning            11
                                                                                   115

    For individual functions, the Friedman test was always based on the 25 com-
binations of their 5 instances and the 5 considered dimensions. The test rejected
the null hypothesis for all functions except the f5 linear slope. The result of the
post-hoc tests are summarized in Table 5. The value in each row r and column
c 6= r tells for how many among the remaining 23 functions the covariance func-
tion in the row was as part of an ANN+GP combination significantly better than
the one in the column, or equivalently the one column was significantly worse
than the one in the row. This means that the ANN+GP combination with the
covariance function in the row yielded a higher average score than with the one
in the column, and the post-hoc test rejected on the family-wise significance level
5% the hypothesis of equal mean values of the random variables that produced
both scores. For individual function groups, the Friedman tests were based on
the 100 or 125 combinations of the functions belonging to the group, their 5
instances and the 5 considered dimensions. For individual dimensions, they were
based on the 120 combinations of the 24 functions and their dimensions. Both
the 5 tests for the function groups and the 5 tests for the dimensions rejected
their null hypotheses. The results of their post-hoc tests are summarized in Ta-
ble 6. In Tables 5 and 6, the value in the cell is in bold if the covariance function
in the row was more often significantly better than significantly worse than the
one in the column.


Table 5. The number of functions among those 23 for which the null hypothesis of the
Friedman test was rejected, for which the covariance function in the row was as part
of an ANN+GP combination significantly better than the one in the column. That
value is in bold if it is in addition higher than the number of functions for which the
covariance function in the row was significantly worse than the one in the column

               Covariance function κLIN κQUAD κSE κMA5 κRQ κSE+Q
                      κLIN          –     0    0   0    0     0
                     κQUAD          0     –    0   0    0     0
                       κSE          2     0    –   0    0    14
                      κMA5          1     0    0   –    9    14
                      κRQ           6     1    0 14     –    18
                     κSE+Q          0     0    9   9    5     –


    From Tables 5 and 6, we can observe that the covariance function κLIN was
never either significantly better or significantly worse than κQUAD . This can be
interpreted as interchangeability of both functions from the point of view of being
used as covariances in the ANN+GP surrogate models for DTS-CMA-ES. The
same relationship holds also for the pairs of covariance functions (κSE , κMA5 ) and
(κSE , κRQ ). It is particularly important in connection with the last mentioned
pair, in view of the fact that κRQ was the covariance most often yielding the
highest score, and κSE was in this respect the 3rd among the individual bench-
12
116     J.J.Růžička,
            Růžička J.
                        et Koza,
                           al.   J. Tumpach, Z. Pitra, M. Holena

Table 6. The number of function groups (left), respectively considered dimensions
(right) for which the covariance function in the row was as part of an ANN+GP
combination significantly better than the one in the column. That value is in bold if it
is in addition higher than the number of function groups, respectively dimensions for
which the covariance function in the row was significantly worse than the one in the
column

    Covariance       for function groups             for dimensions
     function κLIN κQUAD κSE κMA5 κRQ κSE+Q κLIN κQUAD κSE κMA5 κRQ κSE+Q
       κLIN    –      0     0    0    0  0   –     0     0    0     0 0
      κQUAD    0      –     0    0    0  0   0     –     0    0     0 0
        κSE    4      2     –    0    0  4   4     2     –    0     0 4
       κMA5    4      2     0    –    0  4   4     2     0    –     1 4
       κRQ     4      2     0    5    –  5   4     2     0    4     – 4
      κSE+Q    4      1     1    1    0  –   4     1     1    1     1 –


mark functions and the 2nd among groups of functions and among the considered
dimensions. Altogether, these two interchangeable covariance functions yielded
the highest score for 15 from the 24 considered benchmarks, for 4 from the 5
benchmark function groups and for 4 from the 5 considered dimensions.


5      Conclusion ad Future Work

This work-in-progress paper presented an experimental investigation of surro-
gate models combining artificial neural networks with Gaussian processes in the
context of a sophisticated surrogate-assisted variant of the black-box optimizer
CMA-ES, the DTS-CMA-ES, which consecutively trains two surrogate mod-
els, using the first for active learning of training data for the second. In the
experiments, a comprehensive comparison of ANN+GP surrogate models with
six different covariance functions was performed on the 24 noiseless benchmark
functions of the COCO platform [10, 22] in 5 dimensions. The results revealed
that the most suitable covariance function for this combined kind of surrogate
models is the rational quadratic, followed by the Matérn 52 , and squared expo-
nential. Moreover, the rational quadratic and squared exponential were found
interchangeable in the sense that for no function, no group of functions, no
dimension, and no function-dimension combination, none of these covariance
functions was significantly better than the other.
    As usually with work in progress, still very much is left for the future. Most
importantly, the ANN+GP surrogate models need to be compared with pure
GP surrogates. Superficially, that should be no problem because the original
implementation of DTS-CMA-ES uses GPs alone. However, the DTS-CMA-ES
implementation relying on the system GPML [44], and our implementation of
the ANN+GP surrogates relying on the system GPyTorch [17] do not allow a
comparison in which the difference between pure GP and ANN+GP surrogates
would reflect only the added combination of both kinds of models. According
                              GP-Neural
   Combining Gaussian Processes         Active
                                with Neural    Learning
                                            Networks     in Optimization
                                                     for Active Learning           13
                                                                                  117

to our experience, that difference is much more due to the incompatibility be-
tween GPML and GPyTorch. The GPML does not include any ANN extension,
whereas the GPyTorch includes also pure GPs, however, their predictive acuracy
is substantially lower than that of their counterparts with the same covariance
function that are implemented in the GPML. Hence, to arrive to an unbiased
comparison of ANN+GP and pure GP surrogate models will still need a lot of
implementation effort.
    Further directions of our future research include on the one hand deep GPs,
mentioned already in the Introducion, on the other hand the reconsideration of
the approach employed in GPyTorch – training the ANN and the GP forming
the combined surrogate model together by means of likelihood maximization.
Whereas maximum likelihood is indeed the commonly used objective function
for GP learning [45], successful and efficient ANN learning algorithms typically
rely on other objectives [18]. Therefore, we would also like to investigate a cyclic
interleaving of a GP-learning phase with an ANN-learning phase, where the
length of the latter will depend on the relationship of its success to the success
of the former.
    Finally, we intend to perform research into transfer learning of such combined
surrogate models: an ANN-GP model with a deep neural network will be trained
on data from many optimization runs, and then the model used in a new run of
the same optimizer will be obtained through additional learning restricted only
to the GP and the last 1-2 layers of the ANN.

Acknowledgment
The research reported in this paper has been supported by the Czech Science
Foundation (GAČR) grant 18-18080S. For J. Tumpach, it has been also partially
supported by the Charles University student grant 260575. Computational re-
sources were supplied by the project e-INFRA LM2018140 provided within the
program Projects of Large Research, Development and Innovations Infrastruc-
tures.


References
 1. Auger, A., Brockhoff, D., Hansen, N.: Benchmarking the local metamodel cma-es
    on the noiseless BBOB’2013 test bed. In: GECCO’13. pp. 1225–1232 (2013)
 2. Auger, A., Schoenauer, M., Vanhaecke, N.: LS-CMA-ES: A second-order algorithm
    for covariance matrix adaptation. In: Parallel Problem Solving from Nature - PPSN
    VIII. pp. 182–191 (2004)
 3. Baerns, M., Holeňa, M.: Combinatorial Development of Solid Catalytic Materials.
    Design of High-Throughput Experiments, Data Analysis, Data Mining. Imperial
    College Press / World Scientific, London (2009)
 4. Bajer, L., Pitra, Z., Holeňa, M.: Benchmarking Gaussian processes and random
    forests surrogate models on the BBOB noiseless testbed. In: GECCO’15 Compan-
    ion. pp. 1143–1150 (2015)
 5. Bajer, L., Pitra, Z., Repický, J., Holeňa, M.: Gaussian process surrogate models
    for the CMA evolution strategy. Evolutionary Computation 27, 665–697 (2019)
14
118     J.J.Růžička,
            Růžička J.
                        et Koza,
                           al.   J. Tumpach, Z. Pitra, M. Holena

 6. Booker, A., Dennis, J., Frank, P., Serafini, D., V., T., Trosset, M.: A rigorous
    framework for optimization by surrogates. Structural and Multidisciplinary Opti-
    mization 17, 1–13 (1999)
 7. Bui, T., Hernandez-Lobato, D., Hernandez-Lobato, J., Li, Y., Turner, R.: Deep
    gaussian processes for regression using approximate expectation propagation. In:
    ICML. pp. 1472–1481 (2016)
 8. Calandra, R., Peters, J., Rasmussen, C., Deisenroth, M.: Manifold gaussian pro-
    cesses for regression. In: IJCNN. pp. 3338–3345 (2016)
 9. Clarke, S., Griebsch, J., Simpson, T.: Analysis of support vector regression for
    approximation of complex engineering analyses. Journal of Mechanical Design 127,
    1077–1087 (2005)
10. The COCO platform (2016), http://coco.gforge.inria.fr
11. Cutajar, K., Bonilla, E., Michiardi, P., Filippone, M.: Random feature expansions
    for deep gaussian processes. In: ICML. pp. 884–893 (2017)
12. El-Beltagy, M., Nair, P., Keane, A.: Metamodeling techniques for evolutionary
    optimization of computaitonally expensive problems: Promises and limitations. In:
    Proceedings of the Genetic and Evolutionary Computation Conference. pp. 196–
    203. Morgan Kaufmann Publishers (1999)
13. Emmerich, M., Giannakoglou, K., Naujoks, B.: Single- and multi-objective evolu-
    tionary optimization assisted by Gaussian random field metamodels,. IEEE Trans-
    actions on Evolutionary Computation 10, 421–439 (2006)
14. Emmerich, M., Giotis, A., Özdemir, M., Bäck, T., Giannakoglou, K.: Metamodel-
    assisted evolution strategies. In: PPSN VII. pp. 361–370. ACM (2002)
15. Forrester, A., Sobester, A., Keane, A.: Engineering Design via Surrogate Modelling:
    A Practical Guide. John Wiley and Sons, New York (2008)
16. Garcia, S., Herrera, F.: An extension on ”Statistical Comparisons of Classifiers
    over Multiple Data Sets” for all pairwise comparisons. Journal of Machine Learning
    Research 9, 2677–2694 (2008)
17. Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: Gpytorch:
    Blackbox matrix-matrix gaussian process inference with gpu acceleration (2019)
18. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
    (2016)
19. Gutmann, H.: A radial basis function method for global optimization. Journal of
    Global Optimization 19, 201–227 (2001)
20. Hansen, N.: The CMA evolution strategy: A comparing review. In: Towards a New
    Evolutionary Computation. pp. 75–102. Springer (2006)
21. Hansen, N.: A global surrogate assisted CMA-ES. In: GECCO’19. pp. 664–672
    (2019)
22. Hansen, N., Auger, A., Ros, R., Merseman, O., Tušar, T., Brockhoff, D.: COCO: a
    platform for comparing continuous optimizers in a black box setting. Optimization
    Methods and Software 35, doi:10.1080/10556788.2020.1808977 (2020)
23. Hansen, N., Ostermaier, A.: Completely derandomized self-adaptation in evolution
    strategies. Evolutionary Computation 9, 159–195 (2001)
24. Hebbal, A., Brevault, L., Balesdent, M., Talbi, E., Melab, N.: Efficient global
    optimization using deep gaussian processes. In: IEEE CEC. pp. 1–12, doi
    10.1109/CEC40672.2018 (2018)
25. Hernández-Muñoz, G., Villacampa-Calvo, C., Hernández-Lobato, D.: Deep gaus-
    sian processes using expectation propagation and monte carlo methods. In: ECML
    PKDD. pp. 1–17, paper no. 128 (2020)
                              GP-Neural
   Combining Gaussian Processes         Active
                                with Neural    Learning
                                            Networks     in Optimization
                                                     for Active Learning               15
                                                                                      119

26. Hosder, S., Watson, L., Grossman, B.: Polynomial response surface approxima-
    tions for the multidisciplinary design optimization of a high speed civil transport.
    Optimization and Engineering 2, 431–452 (2001)
27. Jin, Y., Hüsken, M., Olhofer, M., B., S.: Neural networks for fitness approxima-
    tion in evolutionary optimization. In: Jin, Y. (ed.) Knowledge Incorporation in
    Evolutionary Computation, pp. 281–306. Springer (2005)
28. Jin, Y., Olhofer, M., Sendhoff, B.: Managing approximate models in evolutionary
    aerodynamic design optimization. In: CEC 2001. pp. 592–599 (2001)
29. Jin, Y., Olhofer, M., Sendhoff, B.: A framework for evolutionary optimization with
    approximate fitness functions. IEEE Transactions on Evolutionary Computation
    6, 481–494 (2002)
30. Kern, S., Hansen, N., Koumoutsakos, P.: Local metamodels for optimization using
    evolution strategies. In: PPSN IX. pp. 939–948 (2006)
31. Kruisselbrink, J., Emmerich, M., Deutz, A., Bäck, T.: A robust optimization ap-
    proach using kriging metamodels for robustness approximation in the CMA-ES.
    In: IEEE CEC. pp. 1–8 (2010)
32. Leary, S., Bhaskar, A., Keane, A.: A derivative based surrogate model for approx-
    imating and optimizing the output of an expensive computer simulation. Journal
    of Global Optimization 30, 39–58 (2004)
33. Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., et al.: Deep neural
    networks as Gaussian processes. In: ICLR. pp. 1–17 (2018)
34. Loshchilov, I., Schoenauer, M., Sebag, M.: Intensive surrogate model exploitation
    in self-adaptive surrogate-assisted CMA-ES (saACM-ES). In: GECCO’13. pp. 439–
    446 (2013)
35. Lu, J., Li, B., Jin, Y.: An evolution strategy assisted by an ensemble of local
    gaussian process models. In: GECCO’13. pp. 447–454 (2013)
36. Magnus, J., Neudecker, H.: Matrix Differential Calculus with Applications in
    Statistics and Econometrics. John Wiley and Sons, Chichester (2007)
37. Matthews, A., Hron, J., Rowland, M., Turner, R.: Gaussian process behaviour in
    wide deep neural networks. In: ICLR. pp. 1–15 (2019)
38. Myers, R., Montgomery, D., Anderson-Cook, C.: Response Surface Methodology:
    Proces and Product Optimization Using Designed Experiments. John Wiley and
    Sons, Hoboken (2009)
39. Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., et al.: Bayesian deep convolutional
    networks with many channels are Gaussian processes. In: ICLR. pp. 1–35 (2019)
40. Ong, Y., Nair, P., Keane, A., Wong, K.: Surrogate-assisted evolutionary optimiza-
    tion frameworks for high-fidelity engineering design problems. In: Jin, Y. (ed.)
    Knowledge Incorporation in Evolutionary Computation. pp. 307–331. Springer
    (2005)
41. Pitra, Z., Repický, J., Holeňa, M.: Boosted regression forest for the doubly trained
    surrogate covariance matrix adaptation evolution strategy. In: ITAT 2018. pp. 72–
    79 (2018)
42. Pitra, Z., Hanuš, M., Koza, J., Tumpach, J., Holena, M.: Interaction between model
    and its evolution control in surrogate-assisted cma evolution strategy. In: GECCO
    ’21: Genetic and Evolutionary Computation Conference. pp. 528–536 (2021)
43. Rasheed, K., Ni, X., Vattam, S.: Methods for using surrogate modesl to speed up
    genetic algorithm oprimization: Informed operators and genetic engineering. In:
    Jin, Y. (ed.) Knowledge Incorporation in Evolutionary Computation. pp. 103–123.
    Springer (2005)
44. Rasmussen, C., Williams, C.: Gpml 4.0, matlab toolbox for gaussian processes for
    machine learning, http://www.gaussianprocess.org/gpml/code/matlab/doc/
16
120     J.J.Růžička,
            Růžička J.
                        et Koza,
                           al.   J. Tumpach, Z. Pitra, M. Holena

45. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT Press,
    Cambridge (2006)
46. Ratle, A.: Kriging as a surrogate fitness landscape in evolutionary optimization.
    Artificial Intelligence for Engineering Design, Analysis and Manufacturing 15, 37–
    49 (2001)
47. Ulmer, H., Streichert, F., Zell, A.: Evolution strategies assisted by Gaussian pro-
    cesses with improved pre-selection criterion. In: IEEE CEC. pp. 692–699 (2003)
48. Volz, V., Rudolph, G., Naujoks, B.: Investigating uncertainty propagation in
    surrogate-assisted evolutionary algorithms. In: GECCO’17. pp. 881–888 (2017)
49. Wilson, A., Hu, Z., Salakhutdinov, R., Xing, E.: Deep kernel learning. In: ICAIS.
    pp. 370–378 (2016)
50. Zhou, Z., Ong, Y., Nair, P., Keane, A., Lum, K.: Combining global and local
    surrogate models to accellerate evolutionary optimization. IEEE Transactions on
    Systems, Man and Cybernetics. Part C: Applications and Reviews 37, 66–76 (2007)

</pre>