=Paper=
{{Paper
|id=Vol-3079/ial2021_paper9
|storemode=property
|title=Combining Gaussian Processes with Neural Networks for Active Learning in Optimization
|pdfUrl=https://ceur-ws.org/Vol-3079/ial2021_paper9.pdf
|volume=Vol-3079
|authors=Jiří Růžička,Jan Koza,Jiří Tumpach,Zbyněk Pitra,Martin Holena
|dblpUrl=https://dblp.org/rec/conf/pkdd/RuzickaKTPH21
}}
==Combining Gaussian Processes with Neural Networks for Active Learning in Optimization==
Combining Gaussian Processes with Neural Networks for Active Learning in Optimization Jiřı́ Růžička1 , Jan Koza1 , Jiřı́ Tumpach2 , Zbyněk Pitra1 , and Martin Holeňa1,2,3 1 Czech Technical University, Prague, Czech Republic 2 Charles University, Prague, Czech Republic 3 Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic Abstract. One area where active learning plays an important role is black-box optimization of objective functions with expensive evaluations. To deal with such evaluations, continuous black-box optimization has adopted an approach called surrogate modelling or metamodelling, which consists in replacing the true black-box objective in some of its evalu- ations with a suitable regression model, the selection of evaluations for replacement being an active learning task. This paper concerns surrogate modelling in the context of a surrogate-assisted variant of the continuous black-box optimizer Covariance Matrix Adaptation Evolution Strategy. It reports the experimental investigation of surrogate models combining artificial neural networks with Gaussian processes, for which it considers six different covariance functions. The experiments were performed on the set of 24 noiseless benchmark functions of the platform Comparing Continuous Optimizers COCO with 5 different dimensionalities. Their results revealed that the most suitable covariance function for this com- bined kind of surrogate models is the rational quadratic followed by the Matérn 25 and squared exponential. Moreover, the rational quadratic and squared exponential covariances were found interchangeable in the sense that for no function, no group of functions, no dimension and combina- tion of them, the performance of the respective surrogate models was significantly different. Keywords: active learning, black-box optimization, artificial neural net- works, Gaussian processes, covariance functions 1 Introduction One area where active learning plays a very important role is black-box opti- mization, in particular optimization of black-box objective functions with ex- pensive evaluations. It is immaterial whether that expensiveness is due to time- consuming computation like in long simulations [15], or due to evaluation in costly experiments like in some areas of science [3]. To deal with such ex- pensive evaluations, continuous black-box optimization has in the late 1990s and early 2000s adopted an approach called surrogate modelling or metamod- elling [6, 12, 14, 32, 40, 43, 46]. In this case, the goal of the surrogate model is to © 2021 for this paper by its authors. Use permitted under CC BY 4.0. 2 106 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena decrease the total number of evaluations of the true objective function. Basi- cally, a surrogate model is any regression model that with a sufficient fidelity, approximates the true black-box objective function and replaces it in some of its evaluations. And the decision in which points to evaluate the expensive black- box objective function, and in which to use its surrogate approximation is an active learning task. This work-in-progress paper concerns surrogate modelling in the context of a state-of-the-art method for continuous black-box optimization, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES ) [20, 23]. It reports the first results of our investigation of surrogate models based on combining artificial neural networks (ANNs) with Gaussian processes (GPs). This investigation has been motivated by the importance of surrogate models based on ANNs alone [19,27–29,40,50] and especially on GPs alone [5,6,12–14,31,32,35,47,48], as well as by the high popularity of ANNs in the last 10–15 years. To our knowledge, this is the first time that ANN+GP combinations have been investigated for possible application in surrogate modelling. On the other hand, research into combining parametric ANN models with nonparametric GP models has been around for nearly a decade, at first due to the increasing popularity of neural networks, later also due to recent theoretical results concerning relationships of asymptotic properties of important kinds of ANNs to properties of GPs [33, 37, 39]. The integration of GP with neural learning has been proposed on two different levels: (i) Proper integration of an ANN with a GP, in which the GP forms the final, output layer of the ANN [8, 49]. (ii) Only a transfer of the layered structure, which is a crucial feature of ANNs, to the GP context, leading to the concept of deep GPs (DGPs) [7,11,24,25]. In the reported investigation, we employed proper integration, using a GP as the final layer of an ANN. The rest of the paper is organized as follows. In the next section, the theo- retical fundamentals of GP regression and integration with ANNs are recalled. Section 3 describes active learning in a surrogate-assisted variant of CMA-ES. Replacing GPs in that variant with several ANN+GP combinations is then ex- perimentally tested in Section 4. Finally, the concluding Section 5 also outlines our future research plans. 2 Gaussian Processes and Their Integration with Neural Networks 2.1 Gaussian Processes A Gaussian process on a set X ⊂ Rd , d ∈ N is a collection of random variables (f (x))x∈X , any finite number of which has a joint Gaussian distribution [45]. It is completely specified by a mean function mGP : X → R, typically assumed constant, and by a covariance function κ : X × X → R such that for x, x0 ∈ X , Ef (x) = mGP (1) 0 0 cov(f (x), f (x )) = κ(x, x ). (2) GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 3 107 Therefore, a GP is usually denoted GP(mGP , κ) or GP(mGP , κ(x, x0 )). The value of f (x) is typically accessible only as a noisy observation y = f (x) + ε, where ε is a zero-mean Gausssian noise with a variance σn > 0. Then cov(y, y 0 ) = κ(x, x0 ) + σn2 I(x = x0 ), (3) where I(proposition) equals for a true proposition 1, for a false proposition 0. Consider now the prediction of the random variable f (x? ) in a point x? ∈ X if we already know the observations y1 , . . . , yn in points x1 , . . . , xn . Introduce the vectors x = (x1 , . . . , xn )> , y = (y1 , . . . , yn )> = (f (x1 ) + ε, . . . f (xn ) + ε)> , k? = (κ(x1 , x? ), . . . , κ(xn , x? ))> and the matrix K ∈ Rn×n such that (K)i,j = κ(xi , xj )+σn2 I(i = j). Then the probability density of the vector y of observations is exp − 21 (y − mGP )> K −1 (y − mGP ) p(y; mGP , κ, σn2 ) = p , (4) 2π det(K) where det(A) denotes the determinant of a matrix A. Further, as a consequence of the assumption of Gaussian joint distribution, also the conditional distribution of f (x? ) conditioned on y is Gaussian, namely N mGP (x? ) + k? K −1 (y − mGP ), κ(x? , x? ) − k?> K −1 k? . (5) According to (3), the relationship between the observations y and y 0 is deter- mined by the covariance function κ. In the reported research, we have considered 6 kinds of covariance functions, listed below. In their definitions, the notation r = kx0 − xk is used, and among the parameters of κ, aka hyperparameters of the GP, frequently encountered are σf2 , ` > 0, called signal variance and charac- teristic length scale, respectively. Other parameters will be introduced for each covariance function separately. (i) Linear : κLIN (x, x0 ) = σ02 + x> x0 , with a bias σ02 . (ii) Quadratic is the square of the linear covariance: κQUAD (x, x0 ) = (σ02 + x> x0 )2 . −α r2 (iii) Rational quadratic: κRQ (x, x0 ) = σf2 1 + 2α` 2 , with α > 0. 2 r (iv) Squared exponential : κSE (x, x0 ) = σf2 exp − 2` 2 . √ √ 5r 2 (v) Matérn 2 : κMA5 (x, x ) = σf 1 + ` + 3`2 exp − `5r . 5 0 2 5r (vi) One composite covariance function, namely the sum of κSE and κQUAD : κSE+Q (x, x0 ) = κSE (x, x0 ) + κQUAD (x, x0 ). 2.2 GP as the Output Layer of a Neural Network An approach integrating a GP into an ANN as its output layer has been inde- pendently proposed in [8] and [49]. It relies on the following two assumptions: 1. If nI denotes the number of the ANN input neurons, then the ANN computes a mapping net of nI -dimensional input values into the set X on 4 108 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena which is the GP defined. Consequently, the number nO of neurons in the last hidden layer fulfills X ⊂ RnO , and the ANN maps an input v into a point x = net(v) ∈ X , corresponding to an observation f (x) + ε governed by the GP (Figure 1). From the point of view of the ANN inputs, the GP is now GP(mGP (net(v)), κ(net(v), net(v 0 ))). 2. The GP mean mGP is assumed to be a known constant, thus not con- outputGPl ayer:y ( +dist ribut ionofy) tributing to the GP hyperparameters and independent of net. Due to the assumption 2., the GP depends only on the parameters θκ of the covariance function. As to the ANN, it depends on the one hand on lasthiddenl ayer:x W the vector θ of its weights and bi- ases, on the other hand on the net- work architecture, which we will treat as fixed before network training. Consider now n inputs to the neu- ral network, v1 , . . . ,vn , mapped to the . . inputs x1 = net(v1 ), . . . , xn = net(vn ) . of the GP, corresponding to the ob- 1s thiddenl a yer servations y = (y1 , . . . , yn )> . Then the log-likelihood of θ = (θκ , θW ) is L(θ) = ln p(y; mGP , κ, σn2 ) = 1 = − (y − mGP )> K −1 (y − mGP ) i nputl ayer :i nputv aluesv 2 1 I I − ln(2π) − ln det(K + σn2 In ), (6) 2 Fig. 1. Schema of the integration of a GP where mGP is the constant assumed into an ANN as its output layer. in 2., and (K)i,j = κ(net(vi ), net(vj )). (7) Let model training, searching for the vector (θκ , θW ), be performed in the most simple but, in the context of neural networks, also the most frequent way – as gradient descent. The partial derivatives forming ∇(θκ ,θW ) L can be computed as: Xn ∂L ∂L ∂Ki,j = κ , (8) ∂θ`κ i,j=1 ∂Ki,j ∂θ` X n ∂L ∂L ∂Ki,j ∂ net(vk ) = . (9) ∂θ`W i,j,k=1 ∂Ki,j ∂xk ∂θ`W GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 5 109 ∂L In (8), the partial derivatives ∂K i,j , i, j = 1, . . . , n, are components of the ma- ∂L trix derivative ∂K , for which the calculations of matrix differential calculus [36] together with (4) and (6) yield ∂L 1 = K −1 yy > K −1 − K −1 . (10) ∂K 2 3 Surrogate Modelling in the CMA-ES Context 3.1 Surrogate Models for Continuous Black-Box Optimization Basically, the purpose of surrogate modelling – to approximate an unknown func- tional dependence – coincides with the purpose of response surface modelling in the design of experiments [26, 38]. Therefore, it is not surprising that typical response surface models, i.e., low order polynomials, belong also to the most tra- ditional and most successful surrogate models [1, 2, 21, 30, 43]. Other frequently used kinds of surrogate models are artificial neural networks of the kind multi- layer perceptron (MLP) or radial basis function network [19, 27–29, 40, 50], and the models to which the previous section was devoted – GPs, in surrogate mod- elling also known as kriging [5,6,12–14,31,32,35,47,48]. Occasinally encountered are support vector regression [9, 34] and random forests [4, 41]. From the point of view of active learning, the most attractive kind of surro- gate models are GPs, due to the fact that a GP estimate f (x) of the value of a true objective function for an input x is not a point, but a random variable. Its Gaussian distribution allows to define alternative criteria according to which individuals for evaluation by the true objective function can be selected, most importantly: – Probability of improvement of the estimate f (x) with respect to a reference value V (typically the minimal so far found value of the true objective func- tion), PoI(f (x); V ) = P (f (x) ≤ V ), (11) which can be estimated using the Gaussian distribution of the GP. – Expected improvement with respect to V , EI(f (x), V ) = E(V − f (x))I(f (x) < V )., (12) 3.2 Covariance Matrix Adaptation Evolution Strategy and Its Surrogate-Assisted Variant DTS-CMA-ES The CMA-ES algorithm performs unconstrained optimization on Rd , by means of iterative sampling of populations sized λ from a d-dimensional Gaussian dis- tribution N (m, σ 2 C), and uses a given parent number µ among the sampled points corresponding to the lowest objective function values, to update the pa- rameters of that distribution. Hence, it updates the expected value m, which is 6 110 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena used as the current point estimate of the function optimum, the matrix C and the step-size σ. The CMA-ES is invariant with respect to monotonous transfor- mations of the objective function. Hence, to make use of the evaluations of the objective function in a set of points, it needs to know only the ordering of those evaluations. Details of the algorithm can be found in [20, 23]. During the more than 20 years of CMA-ES existence, a number of surrogate- assisted variants of this algorithm have been proposed, a survey can be found in [5,42]. Here, we will pay attention only to the most recent GP-based among them, the Doubly Trained Surrogate CMA-ES (DTS-CMA-ES) [5], a surrogate-assisted variant of CMA-ES. It employs two GPs f1 and f2 , trained consecutively, to find an evaluation of the population x1 , . . . , xλ , with f1 used for active learning of training data for f2 . Due to the CMA-ES invariance with respect to monotonous transformations, evaluates the difference between predictions only according to the difference in the ordering of those predictions, more precisely, according to the ranking difference error (RDE). The RDE of y ∈ Rλ with respect to y 0 ∈ Rλ considering k best components is defined: P 0 0 i,(ρ(y 0 ))i ≤k |(ρ(y ))i − (ρ(y))i | RDE(y, y ) = Pk , (13) ≤k maxπ∈Π(λ) i=1 |i − π −1 (i)| where Π(λ) denotes the set of permutaions of {1, . . . , λ} and ρ(y) denotes the ordering of the components of y, i.e., (∀y ∈ Rλ ) ρ(y) ∈ Π(λ) & (ρ(y))i < (ρ(y))j ⇒ yi ≤ yj . The algorithm DTS-CMA-ES is described in Algorithm 1, using the following notation: – A for an archive – a set of points that have already been evaluated by the true black-box objective function BB; – dσ2 C for the Mahalanobis distance given by σ 2 C: q dσ2 C (x, x0 ) = (x − x0 )> σ −2 C −1 (x − x0 ); (14) – Nk (x; A) for the set of a given number k of dσ2 C -nearest neighbours of x ∈ Rd with respect to the archive A; – fi (x1 , . . . , xλ ) = (fi (x1 ), . . . , fi (xλ )), for i = 1, 2; Sλ – Th = j=1 {x ∈ Nh (xj ; A)|dσ2 C (x, xj ) < rmax } with rmax > 0 for h = 1, . . . , |A|; – k(A) = max{h||Th | ≤ Nmax }, with Nmax ∈ N; – ρPoI for decreasing ordering of f1 (x1 ), . . . , f1 (xλ ) according to the probability of improvement with respect to the lowest BB value found so far, i < j ⇒ PoI(ρPoI (f1 (x1 , . . . , xλ )))i ; V ) ≥ PoI(ρPoI (f1 (x1 , . . . , xλ )))j ; V ), (15) where V = minx∈A BB(x). GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 7 111 Algorithm 1 Algorithm DTS-CMA-ES Require: x1 , . . . , xλ ∈ Rd , µ, A, σ and C – step size and matrix from the CMA-ES distribution, Nmax ∈ N such that Nmax ≥ λ, rmax > 0, β, min , max , αmin , αmax ∈ (0, 1) 1: if this is the 1st call of the algorithm in the current CMA-ES run then 2: set α = = 0.05 3: else 4: take over the returned values of α, from its previous call in the run 5: end if 6: Train a Gaussian process f1 on Tk(A) , estimating mGP , σn , σf , ` through maximiza- tion of the likelihood (4) 7: Evaluate BB(xj ) for xj such that (ρPoI (f1 (x1 , . . . , xλ )))j ≤ dαλe and not yet BB- evaluated 8: Update A to A ∪ {(xj |(ρPoI (f1 (x1 ), . . . , f1 (xλ )))j ≤ dαλe} 9: Train a Gaussian process f2 on Tk(A) , estimating mGP , σn , σf , ` through maxi- mization of the likelihood (4) 10: For xj such that (ρPoI (f1 (x1 , . . . , xλ )))j ≤ dαλe, update f2 (xj ) = BB(xj ) 11: Update to (1 − β) + β RDEµ (f1 (x1 , . . . , xλ ), (f2 (x1 , . . . , xλ )) and α to αmin + −min max(0, min(1, max −min )) 12: Update the value f2 (xj ) to f2 (xj ) − min{f2 (xj 0 )|(ρPoI (f1 (x1 , . . . , xλ ))j 0 > dαλe} + min{f2 (xj 0 )|(ρPoI (f1 (x1 , . . . , xλ ))j 0 ≤ dαλe} for j fulfilling (ρPoI (f1 (x1 , . . . , xλ ))j > dαλe 13: Return f2 (x1 ), . . . , f2 (xλ ), , α 4 Experiments with ANN+GP Integration in the DTS-CMA-ES 4.1 Experimental Setup For the experiments, we have used the 24 noiseless benchmark functions avail- able on a platform Comparing Continuous Optimizers (COCO) [10, 22]. Those benchmarks form five groups with different properties: 1. separable functions: f1 sphere, f2 separable ellipsoid, f3 separable Rastrigin, f4 Büche-Rastrigin, f5 linear slope; 2. moderately ill-conditioned functions: f6 attractive sector, f7 step ellipsoid, f8 Rosenbrock, f9 rotated Rosenbrock; 3. highly ill-conditioned functions: f10 ellipsoid with high conditioning, f11 dis- cus, f12 bent cigar, f13 sharp ridge, f14 different powers; 4. multi-modal functions with global structure: f15 non-separable Rastrigin, f16 Weierstrass, f17 Schaffers F7, f18 ill-conditioned Schaffers F7, f19 composite Griewank-Rosenbrock; 5. multi-modal weakly structured functions: f20 Schwefel, f21 Gallagher’s Gaus- sian 101-me points, f22 Gallagher’s Gaussian 21-hi points, f23 Katsuura, f24 Lunacek bi-Rastrigin. All benchmark functions were optimized on the closed cube [−5, 5]d , where d is the dimension of the input space, and the initial CMA-ES population was 8 112 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena sampled uniformly on [−4, 4]d . For each noiseless function, 25 different variants were used, obtained as follows: – each of the functions is scalable for any dimension d ≥ 2, we have used the five dimensions 2, 3, 5, 10, 20; – for each function in each dimension, 5 different instances were used, mutually differing through translations and/or rotations. Each variant f of each benchmark function was optimized for 250d evalua- tions unless it was terminated earlier due to indicated convergence. To evaluate the success of optimization at its end, we used the approach used in [22], to cal- culate the proportion of achieved optimization target in the interval [10−8 , 102 ]. However, instead of calculating such a score from a given number of discrete targets log-uniformly distributed in that interval as in [22], we calculated its continuous counterpart as the ratio of the logarithmic length λlog between the subinterval of [10−8 , 102 ] corresponding to the achieved distance f ∗ to the min- imum of f at the end of the optimization and the whole interval [10−8 , 102 ], λlog ([10−8 , 102 ] ∩ [f ∗ , +∞)) max(0, min(10, 2 − log10 f ∗ )) r= = . (16) λlog ([10−8 , 102 ]) 10 For all experiments, we used the existing implementation of DTS-CMA-ES at https://github.com/bajeluk/surrogate-cmaes, into which we implemented the ANN+GP surrogate models using the system GPyTorch [17], our implementaion is available at https://github.com/c0zzy/surrogate-networks. As to the tunable parameters of DTS-CMA-ES, we used the same values as [5]. As to the tunable parameters of GPyTorch, we fixed the five listed in Table 1 and utilized the default values of the remaining. Finally, for the neural networks in the ANN+GP combinations, we used fully connected multilayer perceptrons with one hidden layer of a sufficiently low number of neurons to assure their trainability with comparatively small archives typically available in the DTS-CMA-ES. More precisely, all the ANNs used the fully connected topologies (d − nO − nO ) with 2 if d = 2, nO = 3 if d = 3, 5, (17) 5 if d = 10, 20. 4.2 First Results and Their Discussion The first work-in-progress results presented in this paper, compare ANN+GP combinations with different covariance functions of the GP. Table 2 reports the scores of each such combination for each noiseless benchmark function, averaged over the 25 combinations of 5 instances and 5 dimensions. In Tables 3 and 4, the scores are reported for the above defined groups of benchmark functions, and for the considered dimension, respectively. Hence, the score averaging in Table 3 GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 9 113 Table 1. Settings for important GPyTorch parameters. For all its remaining tunable parameters, the default values were used Optimizer Adam Learning rate 0.0005 Iterations 1000 Noise 0.0001 Length scale bounds 0.01, 100 includes in addition all functions of the respective group, whereas in Table 4 the scores are averaged over the 120 instances of the 24 benchmark functions. In all three tables, the ANN+GP combination with the highest average score is in bold. Table 2. Score of the 5 compared ANN+GP combinations for each of the noiseless COCO benchmark functions, averaged over the 25 combinations of the 5 instances of that function and the 5 considered dimensions. For each function, the ANN+GP combination with the highest average score is in bold κLIN κQUAD κSE κMA5 κRQ κSE+Q κLIN κQUAD κSE κMA5 κRQ κSE+Q f1 0.38 0.45 0.54 0.54 0.56 0.47 f13 0.13 0.15 0.22 0.25 0.27 0.23 f2 0.02 0.11 0.22 0.20 0.26 0.19 f14 0.41 0.45 0.55 0.54 0.54 0.47 f3 0.09 0.09 0.10 0.10 0.11 0.09 f15 0.09 0.09 0.09 0.10 0.10 0.09 f4 0.07 0.07 0.08 0.08 0.08 0.09 f16 0.14 0.14 0.20 0.12 0.15 0.12 f5 1.00 1.00 1.00 1.00 1.00 1.00 f17 0.26 0.30 0.33 0.34 0.33 0.32 f6 0.09 0.10 0.13 0.17 0.15 0.16 f18 0.21 0.25 0.27 0.25 0.28 0.27 f7 0.28 0.38 0.51 0.43 0.54 0.45 f19 0.19 0.20 0.20 0.20 0.20 0.21 f8 0.15 0.17 0.29 0.29 0.28 0.27 f20 0.17 0.17 0.18 0.17 0.17 0.18 f9 0.15 0.15 0.18 0.24 0.30 0.26 f21 0.19 0.18 0.16 0.18 0.16 0.17 f10 0.02 0.07 0.10 0.07 0.10 0.07 f22 0.15 0.11 0.16 0.13 0.16 0.14 f11 0.05 0.09 0.12 0.10 0.13 0.12 f23 0.17 0.16 0.17 0.17 0.17 0.16 f12 0.03 0.07 0.08 0.10 0.13 0.09 f24 0.07 0.07 0.07 0.07 0.07 0.07 From Tables 2–4, we can see that the highest average score has been by far the most frequently achieved by ANN+GP combinations with rational quadratic covariance functions κRQ : for 14 out of the 24 functions, 3 out of the 5 groups of functions, and 4 out of the 5 considered dimensions. The rational quadratic covariance function shares the first place with the squared exponential covari- ance κSE for the function f22 Gallagher’s Gaussian 21-hi points, as well as for the dimension 10. In addition, for the function f5 linear slope, ANN+GP com- binations achieve with all considered covariance function the highest possible average score 1. Apart from these shared first places, other covariance functions lead to ANN+GP combinations with the highest average score in the following cases: 10 114 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena Table 3. Score of the 5 compared ANN+GP combinations for each group of the noiseless COCO benchmark functions, averaged over the 100 or 125 (dependent on the group) combinations of all instances of the functions belonging to that group and the 5 considered dimensions. For each function, the ANN+GP combination with the highest average score is in bold κLIN κQUAD κSE κMA5 κRQ κSE+Q Separable 0.317 0.348 0.390 0.388 0.403 0.370 Moderately ill-conditioned 0.172 0.204 0.281 0.290 0.323 0.288 Highly ill-conditioned 0.134 0.170 0.218 0.218 0.241 0.197 Multi-modal functions globally structured 0.182 0.200 0.221 0.204 0.218 0.206 Multi-modal weakly structured 0.153 0.143 0.151 0.149 0.149 0.147 Table 4. Score of the 5 compared ANN+GP combinations for each considered dimen- sion, averaged over the 120 considered instances of the 24 noiseless COCO benchmark functions. For each dimension, the ANN+GP combination with the highest average score is in bold κLIN κQUAD κSE κMA5 κRQ κSE+Q 2 0.268 0.322 0.386 0.389 0.398 0.365 3 0.23 0.254 0.326 0.307 0.365 0.302 5 0.178 0.194 0.225 0.223 0.244 0.224 10 0.162 0.166 0.185 0.184 0.185 0.174 20 0.124 0.131 0.133 0.138 0.131 0.135 – the covariance κSE for the function f14 different powers as well as for the group of multi-modal globally structured functions; – the Matérn 52 covariance κMA5 for the functions f6 attractive sector, f8 Rosenbrock, f17 Schaffers F7, f23 Katsuura, and f24 Lunacek bi-Rastrigin, as well as for the dimension 20; – the composite covariance κSE+Q for the functions f4 Büche-Rastrigin, f19 composite Griewank-Rosenbrock, and f20 Schwefel; – the linear covariance κLIN for the function f21 Gallagher’s Gaussian 101-me points, as well as for the group of multi-modal weakly structured functions, to which also f21 belongs. The results in Tables 2–4 reflect both systematic differences due to differ- ent covariance functions and random differences due to noise. To assess the influence of different covariance functions without the interference of noise, the obtained differences were tested for statistical significance. To this end, the null hypotheses that the means of the random variables that produced the scores for the individual ANN+GP combinations are all identical were tested, using the Friedman’s test with post-hoc identification of the pairs that lead to the rejection of the null hypothesis if it is rejected. Both the Friedman test and its post-hoc tests were performed on the family-wise significance level for multiple- hypotheses testing 5%, and the family-wise significances were assessed from the achieved significances of the individual tests (p-values) by means of the Holm method [16]. GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 11 115 For individual functions, the Friedman test was always based on the 25 com- binations of their 5 instances and the 5 considered dimensions. The test rejected the null hypothesis for all functions except the f5 linear slope. The result of the post-hoc tests are summarized in Table 5. The value in each row r and column c 6= r tells for how many among the remaining 23 functions the covariance func- tion in the row was as part of an ANN+GP combination significantly better than the one in the column, or equivalently the one column was significantly worse than the one in the row. This means that the ANN+GP combination with the covariance function in the row yielded a higher average score than with the one in the column, and the post-hoc test rejected on the family-wise significance level 5% the hypothesis of equal mean values of the random variables that produced both scores. For individual function groups, the Friedman tests were based on the 100 or 125 combinations of the functions belonging to the group, their 5 instances and the 5 considered dimensions. For individual dimensions, they were based on the 120 combinations of the 24 functions and their dimensions. Both the 5 tests for the function groups and the 5 tests for the dimensions rejected their null hypotheses. The results of their post-hoc tests are summarized in Ta- ble 6. In Tables 5 and 6, the value in the cell is in bold if the covariance function in the row was more often significantly better than significantly worse than the one in the column. Table 5. The number of functions among those 23 for which the null hypothesis of the Friedman test was rejected, for which the covariance function in the row was as part of an ANN+GP combination significantly better than the one in the column. That value is in bold if it is in addition higher than the number of functions for which the covariance function in the row was significantly worse than the one in the column Covariance function κLIN κQUAD κSE κMA5 κRQ κSE+Q κLIN – 0 0 0 0 0 κQUAD 0 – 0 0 0 0 κSE 2 0 – 0 0 14 κMA5 1 0 0 – 9 14 κRQ 6 1 0 14 – 18 κSE+Q 0 0 9 9 5 – From Tables 5 and 6, we can observe that the covariance function κLIN was never either significantly better or significantly worse than κQUAD . This can be interpreted as interchangeability of both functions from the point of view of being used as covariances in the ANN+GP surrogate models for DTS-CMA-ES. The same relationship holds also for the pairs of covariance functions (κSE , κMA5 ) and (κSE , κRQ ). It is particularly important in connection with the last mentioned pair, in view of the fact that κRQ was the covariance most often yielding the highest score, and κSE was in this respect the 3rd among the individual bench- 12 116 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena Table 6. The number of function groups (left), respectively considered dimensions (right) for which the covariance function in the row was as part of an ANN+GP combination significantly better than the one in the column. That value is in bold if it is in addition higher than the number of function groups, respectively dimensions for which the covariance function in the row was significantly worse than the one in the column Covariance for function groups for dimensions function κLIN κQUAD κSE κMA5 κRQ κSE+Q κLIN κQUAD κSE κMA5 κRQ κSE+Q κLIN – 0 0 0 0 0 – 0 0 0 0 0 κQUAD 0 – 0 0 0 0 0 – 0 0 0 0 κSE 4 2 – 0 0 4 4 2 – 0 0 4 κMA5 4 2 0 – 0 4 4 2 0 – 1 4 κRQ 4 2 0 5 – 5 4 2 0 4 – 4 κSE+Q 4 1 1 1 0 – 4 1 1 1 1 – mark functions and the 2nd among groups of functions and among the considered dimensions. Altogether, these two interchangeable covariance functions yielded the highest score for 15 from the 24 considered benchmarks, for 4 from the 5 benchmark function groups and for 4 from the 5 considered dimensions. 5 Conclusion ad Future Work This work-in-progress paper presented an experimental investigation of surro- gate models combining artificial neural networks with Gaussian processes in the context of a sophisticated surrogate-assisted variant of the black-box optimizer CMA-ES, the DTS-CMA-ES, which consecutively trains two surrogate mod- els, using the first for active learning of training data for the second. In the experiments, a comprehensive comparison of ANN+GP surrogate models with six different covariance functions was performed on the 24 noiseless benchmark functions of the COCO platform [10, 22] in 5 dimensions. The results revealed that the most suitable covariance function for this combined kind of surrogate models is the rational quadratic, followed by the Matérn 52 , and squared expo- nential. Moreover, the rational quadratic and squared exponential were found interchangeable in the sense that for no function, no group of functions, no dimension, and no function-dimension combination, none of these covariance functions was significantly better than the other. As usually with work in progress, still very much is left for the future. Most importantly, the ANN+GP surrogate models need to be compared with pure GP surrogates. Superficially, that should be no problem because the original implementation of DTS-CMA-ES uses GPs alone. However, the DTS-CMA-ES implementation relying on the system GPML [44], and our implementation of the ANN+GP surrogates relying on the system GPyTorch [17] do not allow a comparison in which the difference between pure GP and ANN+GP surrogates would reflect only the added combination of both kinds of models. According GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 13 117 to our experience, that difference is much more due to the incompatibility be- tween GPML and GPyTorch. The GPML does not include any ANN extension, whereas the GPyTorch includes also pure GPs, however, their predictive acuracy is substantially lower than that of their counterparts with the same covariance function that are implemented in the GPML. Hence, to arrive to an unbiased comparison of ANN+GP and pure GP surrogate models will still need a lot of implementation effort. Further directions of our future research include on the one hand deep GPs, mentioned already in the Introducion, on the other hand the reconsideration of the approach employed in GPyTorch – training the ANN and the GP forming the combined surrogate model together by means of likelihood maximization. Whereas maximum likelihood is indeed the commonly used objective function for GP learning [45], successful and efficient ANN learning algorithms typically rely on other objectives [18]. Therefore, we would also like to investigate a cyclic interleaving of a GP-learning phase with an ANN-learning phase, where the length of the latter will depend on the relationship of its success to the success of the former. Finally, we intend to perform research into transfer learning of such combined surrogate models: an ANN-GP model with a deep neural network will be trained on data from many optimization runs, and then the model used in a new run of the same optimizer will be obtained through additional learning restricted only to the GP and the last 1-2 layers of the ANN. Acknowledgment The research reported in this paper has been supported by the Czech Science Foundation (GAČR) grant 18-18080S. For J. Tumpach, it has been also partially supported by the Charles University student grant 260575. Computational re- sources were supplied by the project e-INFRA LM2018140 provided within the program Projects of Large Research, Development and Innovations Infrastruc- tures. References 1. Auger, A., Brockhoff, D., Hansen, N.: Benchmarking the local metamodel cma-es on the noiseless BBOB’2013 test bed. In: GECCO’13. pp. 1225–1232 (2013) 2. Auger, A., Schoenauer, M., Vanhaecke, N.: LS-CMA-ES: A second-order algorithm for covariance matrix adaptation. In: Parallel Problem Solving from Nature - PPSN VIII. pp. 182–191 (2004) 3. Baerns, M., Holeňa, M.: Combinatorial Development of Solid Catalytic Materials. Design of High-Throughput Experiments, Data Analysis, Data Mining. Imperial College Press / World Scientific, London (2009) 4. Bajer, L., Pitra, Z., Holeňa, M.: Benchmarking Gaussian processes and random forests surrogate models on the BBOB noiseless testbed. In: GECCO’15 Compan- ion. pp. 1143–1150 (2015) 5. Bajer, L., Pitra, Z., Repický, J., Holeňa, M.: Gaussian process surrogate models for the CMA evolution strategy. Evolutionary Computation 27, 665–697 (2019) 14 118 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena 6. Booker, A., Dennis, J., Frank, P., Serafini, D., V., T., Trosset, M.: A rigorous framework for optimization by surrogates. Structural and Multidisciplinary Opti- mization 17, 1–13 (1999) 7. Bui, T., Hernandez-Lobato, D., Hernandez-Lobato, J., Li, Y., Turner, R.: Deep gaussian processes for regression using approximate expectation propagation. In: ICML. pp. 1472–1481 (2016) 8. Calandra, R., Peters, J., Rasmussen, C., Deisenroth, M.: Manifold gaussian pro- cesses for regression. In: IJCNN. pp. 3338–3345 (2016) 9. Clarke, S., Griebsch, J., Simpson, T.: Analysis of support vector regression for approximation of complex engineering analyses. Journal of Mechanical Design 127, 1077–1087 (2005) 10. The COCO platform (2016), http://coco.gforge.inria.fr 11. Cutajar, K., Bonilla, E., Michiardi, P., Filippone, M.: Random feature expansions for deep gaussian processes. In: ICML. pp. 884–893 (2017) 12. El-Beltagy, M., Nair, P., Keane, A.: Metamodeling techniques for evolutionary optimization of computaitonally expensive problems: Promises and limitations. In: Proceedings of the Genetic and Evolutionary Computation Conference. pp. 196– 203. Morgan Kaufmann Publishers (1999) 13. Emmerich, M., Giannakoglou, K., Naujoks, B.: Single- and multi-objective evolu- tionary optimization assisted by Gaussian random field metamodels,. IEEE Trans- actions on Evolutionary Computation 10, 421–439 (2006) 14. Emmerich, M., Giotis, A., Özdemir, M., Bäck, T., Giannakoglou, K.: Metamodel- assisted evolution strategies. In: PPSN VII. pp. 361–370. ACM (2002) 15. Forrester, A., Sobester, A., Keane, A.: Engineering Design via Surrogate Modelling: A Practical Guide. John Wiley and Sons, New York (2008) 16. Garcia, S., Herrera, F.: An extension on ”Statistical Comparisons of Classifiers over Multiple Data Sets” for all pairwise comparisons. Journal of Machine Learning Research 9, 2677–2694 (2008) 17. Gardner, J.R., Pleiss, G., Bindel, D., Weinberger, K.Q., Wilson, A.G.: Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration (2019) 18. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 19. Gutmann, H.: A radial basis function method for global optimization. Journal of Global Optimization 19, 201–227 (2001) 20. Hansen, N.: The CMA evolution strategy: A comparing review. In: Towards a New Evolutionary Computation. pp. 75–102. Springer (2006) 21. Hansen, N.: A global surrogate assisted CMA-ES. In: GECCO’19. pp. 664–672 (2019) 22. Hansen, N., Auger, A., Ros, R., Merseman, O., Tušar, T., Brockhoff, D.: COCO: a platform for comparing continuous optimizers in a black box setting. Optimization Methods and Software 35, doi:10.1080/10556788.2020.1808977 (2020) 23. Hansen, N., Ostermaier, A.: Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation 9, 159–195 (2001) 24. Hebbal, A., Brevault, L., Balesdent, M., Talbi, E., Melab, N.: Efficient global optimization using deep gaussian processes. In: IEEE CEC. pp. 1–12, doi 10.1109/CEC40672.2018 (2018) 25. Hernández-Muñoz, G., Villacampa-Calvo, C., Hernández-Lobato, D.: Deep gaus- sian processes using expectation propagation and monte carlo methods. In: ECML PKDD. pp. 1–17, paper no. 128 (2020) GP-Neural Combining Gaussian Processes Active with Neural Learning Networks in Optimization for Active Learning 15 119 26. Hosder, S., Watson, L., Grossman, B.: Polynomial response surface approxima- tions for the multidisciplinary design optimization of a high speed civil transport. Optimization and Engineering 2, 431–452 (2001) 27. Jin, Y., Hüsken, M., Olhofer, M., B., S.: Neural networks for fitness approxima- tion in evolutionary optimization. In: Jin, Y. (ed.) Knowledge Incorporation in Evolutionary Computation, pp. 281–306. Springer (2005) 28. Jin, Y., Olhofer, M., Sendhoff, B.: Managing approximate models in evolutionary aerodynamic design optimization. In: CEC 2001. pp. 592–599 (2001) 29. Jin, Y., Olhofer, M., Sendhoff, B.: A framework for evolutionary optimization with approximate fitness functions. IEEE Transactions on Evolutionary Computation 6, 481–494 (2002) 30. Kern, S., Hansen, N., Koumoutsakos, P.: Local metamodels for optimization using evolution strategies. In: PPSN IX. pp. 939–948 (2006) 31. Kruisselbrink, J., Emmerich, M., Deutz, A., Bäck, T.: A robust optimization ap- proach using kriging metamodels for robustness approximation in the CMA-ES. In: IEEE CEC. pp. 1–8 (2010) 32. Leary, S., Bhaskar, A., Keane, A.: A derivative based surrogate model for approx- imating and optimizing the output of an expensive computer simulation. Journal of Global Optimization 30, 39–58 (2004) 33. Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., et al.: Deep neural networks as Gaussian processes. In: ICLR. pp. 1–17 (2018) 34. Loshchilov, I., Schoenauer, M., Sebag, M.: Intensive surrogate model exploitation in self-adaptive surrogate-assisted CMA-ES (saACM-ES). In: GECCO’13. pp. 439– 446 (2013) 35. Lu, J., Li, B., Jin, Y.: An evolution strategy assisted by an ensemble of local gaussian process models. In: GECCO’13. pp. 447–454 (2013) 36. Magnus, J., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley and Sons, Chichester (2007) 37. Matthews, A., Hron, J., Rowland, M., Turner, R.: Gaussian process behaviour in wide deep neural networks. In: ICLR. pp. 1–15 (2019) 38. Myers, R., Montgomery, D., Anderson-Cook, C.: Response Surface Methodology: Proces and Product Optimization Using Designed Experiments. John Wiley and Sons, Hoboken (2009) 39. Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., et al.: Bayesian deep convolutional networks with many channels are Gaussian processes. In: ICLR. pp. 1–35 (2019) 40. Ong, Y., Nair, P., Keane, A., Wong, K.: Surrogate-assisted evolutionary optimiza- tion frameworks for high-fidelity engineering design problems. In: Jin, Y. (ed.) Knowledge Incorporation in Evolutionary Computation. pp. 307–331. Springer (2005) 41. Pitra, Z., Repický, J., Holeňa, M.: Boosted regression forest for the doubly trained surrogate covariance matrix adaptation evolution strategy. In: ITAT 2018. pp. 72– 79 (2018) 42. Pitra, Z., Hanuš, M., Koza, J., Tumpach, J., Holena, M.: Interaction between model and its evolution control in surrogate-assisted cma evolution strategy. In: GECCO ’21: Genetic and Evolutionary Computation Conference. pp. 528–536 (2021) 43. Rasheed, K., Ni, X., Vattam, S.: Methods for using surrogate modesl to speed up genetic algorithm oprimization: Informed operators and genetic engineering. In: Jin, Y. (ed.) Knowledge Incorporation in Evolutionary Computation. pp. 103–123. Springer (2005) 44. Rasmussen, C., Williams, C.: Gpml 4.0, matlab toolbox for gaussian processes for machine learning, http://www.gaussianprocess.org/gpml/code/matlab/doc/ 16 120 J.J.Růžička, Růžička J. et Koza, al. J. Tumpach, Z. Pitra, M. Holena 45. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006) 46. Ratle, A.: Kriging as a surrogate fitness landscape in evolutionary optimization. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 15, 37– 49 (2001) 47. Ulmer, H., Streichert, F., Zell, A.: Evolution strategies assisted by Gaussian pro- cesses with improved pre-selection criterion. In: IEEE CEC. pp. 692–699 (2003) 48. Volz, V., Rudolph, G., Naujoks, B.: Investigating uncertainty propagation in surrogate-assisted evolutionary algorithms. In: GECCO’17. pp. 881–888 (2017) 49. Wilson, A., Hu, Z., Salakhutdinov, R., Xing, E.: Deep kernel learning. In: ICAIS. pp. 370–378 (2016) 50. Zhou, Z., Ong, Y., Nair, P., Keane, A., Lum, K.: Combining global and local surrogate models to accellerate evolutionary optimization. IEEE Transactions on Systems, Man and Cybernetics. Part C: Applications and Reviews 37, 66–76 (2007)