Suitability of Modern Neural Networks for Active and Transfer Learning in Surrogate-Assisted Black-Box Optimization Martin Holeňa1,2 , Jan Koza2 1 Czech Academy of Sciences, Institute of Computer Science, Prague, Czech Republic 2 Czech Technical University, Faculty of Information Technology, Prague, Czech Republic Abstract Active learning plays a crucial role in black-box optimization, especially for objective functions that are expensive to evaluate. Continuous black-box optimization has adopted an approach called surrogate modelling, where the original black-box objective is approximated with a regression model. An active learning task in this context is to decide which points should be evaluated using the original objective to update the surrogate model. Apart from low-order polynomials, the first surrogate models were artificial neural networks of the kinds multilayer perceptron and radial basis function network. In the late 2000s, neural networks have been superseded by other kinds of surrogate models, primarily Gaussian processes. However, over the last 15 years, neural networks have seen significant and successful development, suggesting that they once again have the potential to serve as promising surrogate models. This paper reviews possible research directions concerning that potential, and recalls initial results from investigations in some of these directions. Finally, it contributes to those results by investigating the state-of-the-art black-box optimizer CMA-ES surrogate-assisted by two variants of random-activation-function neural network ensembles. 1. Introduction One area where active learning plays a really important role is black-box optimization (BBO), i.e., opti- mization of objective functions for which no analytical description is provided. It employs optimization methods that need as input only points in the search space paired with respective values of the objective function obtained in a non-analytical way, e.g. from sensors, in experiments or through numerical simulations. Most frequently used are evolutionary optimization approaches, such as evolution strate- gies, genetic algorithms, and differential evolution, or other metaheuristics, such as particle swarm optimization. Because BBO methods receive only information about values of the objective function, they typically need many such values. This is a problem in situations when evaluating the black-box objective function is time-consuming and/or expensive. That is frequently the case if it is evaluated empirically in experiments. For example, for the evolutionary optimization tasks described in the book [1], the evaluation of a comparatively small generation of a genetic algorithm can sometimes take more than a week and cost more than 10000 e. To deal with expensive evaluations, continuous BBO has in the late 1990s and early 2000s adopted an approach called surrogate modelling or metamodelling [2, 3, 4, 5, 6, 7, 8]. In principle, a surrogate model is any regression model that with a sufficient fidelity approximates the original black-box objective function, restricting the necessity of its evaluation only to a small proportion of points, whereas everywehere else, only the surrogate model is used. Selecting the points in which the original objective function should be evaluated is a step in which active learning is involved. However, it is not active learning of a regression model although the surrogate model itself is a regression model. The reason is that its utility functions are not based on the model, like are the commoly used utility functions uncertainty decrease, model performance, diversity, or surprise-novelty. Instead, they are based on the BBO, the most common being minimizing the objective function for a given evaluation budget, and minimizing the evaluation budget for a given IAL@ECML-PKDD’24: 8th Intl. Worksh. & Tutorial on Interactive Adaptive Learning, Sep. 9th , 2024, Vilnius, Lithuania " martin@cs.cas.cz (M. Holeňa); kozajan@fit.cvut.cz (J. Koza) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 47 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 objective-function threshold. Nevertheless, even active learning in surrogate-assisted BBO follows the basic priciple of active learning: to actively select next model inputs according to the considered utility function. The earliest kinds of surrogate models in continuous BBO were low-order polynomials and artificial neural networks (ANNs) of the kind multilayer perceptron (MLP). The former have always remained a suitable choice in situations when enough evaluations of the original black-box objective function are affordable for the approximation properties of polynomials to be in effect. On the other hand, surrogate modelling for substantially less evaluations of the original objective has during the last two decades undergone further development. MLPs were soon replaced with another kind of ANNs, radial basis function networks (RBFs), which better fit local peculiarities of an objective function landscape. Those networks, however, have since the late 2000s been superseded by other kinds of surrogate models, primarily Gaussian processes (GPs), but also ranking support vector machines (RSVMs), and random forests (RFs). GPs are currently the most successful kind of surrogate models for BBO with small evaluation budget of functions with complicated multimodal landscapes, mainly due to their ability to assess the uncertainty of the estimate of the original objective function in a given point, more precisely, to provide the probability distribution of this estimate. That property of GPs allows to combine the original BBO method, e.g. an evolutionary one, with Bayesian optimization. Consequently, only little attention has been paid to ANN-based surrogate models in continuous BBO during the last 15 years. This contrasts with the intense and successful development of the ANN area during that time, which suggests that ANNs again have the potential to serve as promising surrogate models. This paper attempts to bring a small contribution to research into that potential, presenting in addition a review of possible directions for such a research, connected with different classes of neural networks. Moreover, it also points out that ANNs can serve as the basis for transfer learning between surrogate-assisted BBO of different functions. The next section surveys important aspects and key methods concerning surrogate-assisted con- tinuous BBO. The review of possible research directions concerning the usability of modern neural networks in surrogate-assisted BBO is presented in Section 3. Finally, Section 4 reports an experimental contribution to one of those research directions. 2. Surrogate-Assisted Continuous BBO Surrogate modelling for continuous BBO relies on combination and interaction of three components: a regression model serving as a surrogate of the original black-box objective function, a BBO method seeking the optimum of that objective function, and a strategy when to evaluate the original objective function and when its surrogate model. That strategy is in the context of evolutionary BBO usually called evolution control [9, 10, 11, 12, 13]. There are two other aspects, namely observing constraints on the feasible set of the black-box objective function (cf. e.g. [14, 15]), and generalizing surrogate modelling from single objective to multiple objectives (cf. e.g. [16, 17]), however, we will restrict our attention to single-objective unconstrained optimization. As already mentioned in the introduction, the regression models that are the most suitable kind of surrogate models if sufficiently many evaluations of the original black-box objective function are affordable, are low-order polynomials, typically quadratic functions [18, 19, 20, 21, 22]. For substantially less evaluations, the most traditional kind have been MLPs [23, 9], soon replaced with RBFs [24, 25, 26, 21, 22], and since the late 2000s with GPs a.k.a. kriging [27, 28, 11, 29, 30]. Occasionally, RBFs were used as local models in combination with GP-based global models [31]. Other kinds of surrogate models employed during the last decade include decision trees [32], RFs [33, 34, 32], and RSVMs [35, 36]. The last one has an exceptional property of invariance with respect to order-preserving transformations of the objective function. This is important in situations when the BBO algorithm possesses such invariance, a frequently encountered property of evolutionary algorithms. On the other hand, the surrogate modelling methods proposed in [11] and [28] use GPs to perform preselection based on a partial ordering that is also invariant with respect to order-preserving transformations. More importantly, the adaptive 48 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 function value warping approach recently proposed in [37] aims at providing such invariance to any surrogate model. As a final remark to different kinds of surrogate models, important works about that topic always consider several kinds [38, 12, 39, 20, 32], to compare them and select the best among them, and in [22, 39] also to aggregate their results, thus providing a team of surrogate models. As to the BBO methods, not only the two most important kinds of surrogate models, i.e. low-order polynomials [18, 19, 20], and GPs [26, 28, 11, 29, 30], but also the less common RBFs, RFs, and RSVMs [24, 36, 33, 34] are most often combined with the Covariance matrix adaptation evolution strategy (CMA-ES). That is not surprising because CMA-ES has already in the 2000s become a state-of-the-art approach to single-objective unconstrained continuous BBO. Basically, the CMA-ES evolves a Gaussian estimate of the position of the minimum of the original objective function. That evolution relies on simultaneous adaptation of the vector mean of the Gaussian estimate, of the scalar step size, and of the covariance matrix. For more details of this sophisticated evolution strategy, the reader is referred to the journal papers [40, 41]. GPs were also combined with other evolutionary optimization methods [27, 42], and GPs, polynomials, and RBFs were combined with particle swarm optimization [22] and with memetic optimization [25]. Moreover, GPs are used in black-box optimization in two different ways. In connection with evolutionary and similar BBO methods, they serve as a regression model evaluated instead of the original objective function. In addition, they also play a key role in Bayesian optimization, which then relies on GP-estimates of probability distributions of values of the original objective. Those probability distributions enable several ways of searching for optima of that objective function, each of them governed by a specific assessment of uncertainty of the objective function estimate, commonly called acquisition function [43, 44, 45]. Occasionally, Bayesian optimization is combined with CMA-ES. For example in [46], optimization switches from the most traditional Bayesian optimization method, EGO (Efficient Global Optimization) [43], to CMA-ES. Finally, evolution control has been since the first surrogate-assisted BBO methods performed basically in two ways, generation-based, and individual-based. In the generation based, all points are in some generations evaluated with the true objective function, and in the remaining generations with the model. On the other hand, in every generation of the individual-based evolution control, based on the evaluation of all points with the model, a preselection of points to be evaluated with the true objective function is performed [9]. In most of the surrogate-assisted methods, however, the evolution control is specifically tailored to the respective method. Noteworthy, the authors of [13] investigated mutually replacing the evolution control of two important polynomial-assisted methods lmm-CMA [18, 19] and lq-CMA-ES [20], and of two variants of the GP-assisted method DTS-CMA-ES [47, 12] with the evolution control of the others. According to their findings, the success of those important methods is definitely not limited to using the respective specific tailored evolution control. The surrogate-assisted black-box optimization methods constructing several surrogate models simultaneously either aggregate them to a team [25, 22] or complement the evolution control by a classifier selecting the most appropriate among those models. Important examples of classifiers used in this context are ANNs [48, 49, 50], and classification trees [51, 52]. Their learning can be viewed as metalearning because it is based on metafeatures, i.e. properties empirically characterizing the objective function landscape and the BBO method [21, 32, 49, 53]. Apart from classification according to the appropriateness of the surrogate model for the considered data, metalearning can be also used for regression of model error on the combination of values of metafeatures [54]. 3. Usability of Modern Neural Networks in Surrogate-Assisted BBO This section primarily reviews eight kinds of modern neural networks that we consider worth a research into their ability to serve as surrogate models in BBO. A high-level overview of those kinds of ANNs is given in Table 1, which for each of them mentions whether such research has already started. In Subsection 3.1, two kinds integrating GPs into ANNs are recalled. Subsection 3.2 recalls three kinds of ANNs providing the most advantageous property of GPs, their ability to estimate the distribution of black-box objective function values. Finally, in Subsection 3.3, three well-known kinds of modern neural 49 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 Table 1 A high-level overview of kinds of ANNs that we consider worth a research with respect to surrogate modelling for BBO ANNs Research into its ability + main references to serve as surrogate model in BBO MLPs with a GP as the final layer [55, 56] First investigations [57, 58] Deep GP networks [59, 60, 61, 62, 63] Not Tangent kernel networks [64, 65] Not Prior networks [66, 67, 68, 69, 70] First investigations [71] Ensembles of neural networks [72, 73, 74, 75, 76] First investigations [this paper] Variational autoencoders [77, 78] Not Generative adversarial networks [79, 80] Not Transformers [81, 82] Not networks, namely variational autoencoders, transformers, and generative adversarial networks, are recalled due to the fact that they have already proven useful in the related area of Bayesian optimization. In addition, Subsection 3.4 is devoted to knowledge transfer in surrogate-assisted BBO, which relates to the usability of modern neural networks through their important role in transfer learning. 3.1. Integration of GPs into ANNs The integration of GPs into ANNs has been proposed on two different levels: 1. At the layer level – a GP serves as the final layer of an MLP [55, 56]. Integration on that level is based on the following two assumptions: (i) If 𝑛𝐼 denotes the number of the ANN input neurons, then the ANN computes a mapping net of 𝑛𝐼 -dimensional input values into the set 𝒳 on which is the GP defined. Consequently, the number 𝑛𝑂 of neurons in the last hidden layer fulfills 𝒳 ⊂ R𝑛𝑂 , and the ANN maps an input 𝑣 into a point 𝑥 = net(𝑣) ∈ 𝒳 , corresponding to an observation 𝑓 (𝑥 + 𝜀) governed by the GP, where 𝜀 is a zero-mean Gaussian noise. From the point of view of the ANN inputs, the GP is now 𝒢𝒫(𝑚GP (net(·)), 𝜅(net(·), net(·))), where 𝑚GP is the mean function, and 𝜅 is the covariance function of the GP [83]. (ii) The GP mean 𝜇 is assumed to be a known constant, thus not contributing to the GP hyperpa- rameters, and independent of net 2. At the level of individual neurons – GPs can replace all hidden and output neurons of an MLP. This kind of neural networks is commonly called deep Gaussian process [59, 60, 61, 62, 63, 84, 85, 86, 87, 88, 89, 90]. Integration on both levels has been developed primarily for Bayesian modelling and optimization. Nevertheless, GPs integrated as the last layer of MLPs have been used as surrogate models in a CMA- ES-driven BBO [57, 58]. In particular, those surrogate models incorporate GPs with five commonly employed covariance functions linear, quadratic, rational quadratic, squared exponential, and Matérn 52 , as well as with one composite covariance function superposing the quadratic and squared exponential. Those 6 models were compared in [57] from the point of view of regression accuracy, evaluated on a large dataset collected during many previous runs of DTS-CMA-ES on the collection of 24 noiseless benchmarks from the Comparing Continuous Optimizers platform [91, 92] (cf. Section 4) in dimensions 2, 3, 5, 10, and 20. Then in [58], they were compared on the same benchmarks in the same dimensions from the point of view of the success of surrogate-assisted optimization with CMA-ES. Unfortunately, neither of those comparisons included more traditional surrogate models nor the CMA-ES without surrogate assistance. To our knowledge, the only comparison that included both a GP integrated as the last layer of an MLP, and more traditional surrogate models, was the comparison from the point of view of regression accuracy in [93]. However, it included only one such integrated surrogate model, with the 50 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 GP using the most simple covariance function – the linear one, in addition to the traditional GP-based surrogate models with eight different covariance functions, including the five listed above. 3.2. ANNs Estimating the Distribution of Black-Box Objective Function Values In our opinion, the property of GPs most advantageous from the point of view of surrogate modelling is that they estimate the whole distribution of a predicted value of the original black-box objective function. Recall from Section 2 that due to that property also ensembles of regression trees – RFs – are used as surrogate models [33, 34, 32]. This draws attention to those modern neural networks that also allow estimation of such a distribution. Basically, there are three classes of them, differing in the way how that estimate can be obtained. 1. The multivariate normal distribution underlying GPs is actually the asymptotic distribution for network width increasing to infinity. Such results have been established for several kinds of ANNs [94, 88, 95, 96, 97]. In addition, closely related is the infinite width limit of the neural tangent kernel, which governs the kernel gradient of the functional cost used in MLP regression [64, 65]. Although those results have great theoretical value, there can be a serious disparity between the infinite width results and their finite width counterparts [77]. Therefore, it is unclear whether they can be applied to surrogate modelling. 2. The distribution of a predicted value, or more precisely the parameters of such a distribution, can be directly learned by an ANN. The best-known kind of such neural networks are the prior networks, learning the parameters of a normal-inverse Wishart distribution, which is the conjugate prior to a multivariate normal distribution [66, 67, 68, 69, 70, 98]. Prior networks belong to a broader class of evidential neural networks [99, 100, 101, 102, 103]. Their name refers to the fact that they follow the basic principle of the Dempster-Shafer theory of evidence [104] – to fall back onto prior belief for unfamiliar data. 3. An estimate of the distribution of a predicted value is produced by an ensemble of neural networks. Important kinds of such ensembles are ensembles obtained through diversification of training data [105, 106], ensembles obtained through diversification of network properties [107, 108, 109], a specific subgroup of which are ensembles in which the diversification is achieved through diverse activation functions [76], ensembles obtained through negative correlation learning [110, 111, 112], bagging ensembles [72, 113], boosting ensembles [114, 115] deep ensembles [73, 74, 116] including deep echo-state network ensembles [117], and anchored ensembles [75] with a later modification random activation function (RAF) ensembles [76]. RAF ensembles take over the principle of anchored ensembles that regularization is performed not with respect to zero, but with respect to the initialization values of the parameters, which are assumed normally distributed. Differently to an anchored ensemble, however, an RAF ensemble uses varied ativation fuctions from an a priori specified set of size 𝑛AF . From that set, the activation function is chosen randomly, apart from the first 𝑛AF members of the ensemble, among which each activation function occurs exactly once. We consider this last mentioned kind of ensembles as the state of the art. To our knowledge, the only ANNs estimating the distribution of function values that have already been used as surrogate models in BBO, are prior networks. In [71], the prediction accuracy of four versions has been evaluated on the above mentioned dataset from previous runs of DTS-CMA-ES. This direction of research is continued by the present paper: Section 4 reports results for CMA-ES surrogate-assisted by two variants of RAF ensembles. 3.3. ANNs Found Useful in Bayesian Optimization Recall from Section 2 that GPs, simultaneously with their importance as surrogate models in BBO with non-Bayesian methods, such as CMA-ES, also play a crucial role in Bayesian optimization. That is why this subsection lists three well-known kinds of modern neural networks that have been recently found useful in Bayesian optimization. In our opinion, this indicates that they are worth investigating whether they could be used also in surrogate-assisted BBO. 51 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 1. Variational autoencoders have been utilized in Bayesian optimization because they allow for optimization in a lower-dimensional latent space [77, 78]. 2. The generative adversarial networks (GANs) paradigm has been recently shown to be applicable to BBO: A generator proposes samples that align with the distribution of low values (or even the optimal value) of the black-box function, while one or more discriminators classify samples based on whether they belong to that distribution [79, 80]. 3. Transformers have proven effective in estimating complex prior distributions for Bayesian opti- mization [81, 82]. Notably, an OptFormer transformer trained on Google Vizier [118], the largest hyperparameter optimization (HPO) database, achieved superior HPO outcomes compared to GP-based Bayesian optimization [81]. Furthermore, the recently introduced transformer-based Prior-data Fitted Networks [82] can mimic Gaussian Processes (GPs) and Bayesian networks, while also incorporating additional information into the prior. 3.4. ANN-Based Transfer Learning for Surrogate-Assisted Black-Box Optimization Obtaining accurate surrogate models in the initial stages of BBO is challenging due to the scarcity of data points with evaluated objective function values. That can be mitigated by leveraging knowledge- transfer learning. And a connection of modern kinds of neural networks with transfer learning is even more obvious than with active learning. Indeed, transfer learning is nowadays one of the areas where ANNs play most important role [119, 120, 121]. Different types of ANNs have been utilized to this end, including convolutional [122, 123], recurrent [124], autoencoder [125, 126], GAN [127, 128, 129], and transformer [81]. In the context of the research direction pursued in this paper, most interesting are those that also have connections to BBO: (i) Four ANN-based transfer learning approaches draw inspiration from the GAN paradigm. CoGAN trains two GANs to generate the source and target, respectively, achieves a domain invariant feature space by tying the high layers parameters of the two GANs, and performs domain adaptation by training a classifier on the discriminator output [130]. Adversarial discriminative domain adaptation learns first a discriminative representation using the labels in the source domain, and then, using a domain-adversarial loss, a separate encoding that maps the target data to the same space through an asymmetric mapping [127]. Minimax-game-based selective transfer learning employs a selector and a discriminator to identify source domain data resembling the target domain’s distribution, and distinguish genuine target domain data from selected source domain data, respectively [129]. Selective adversarial network addresses negative transfer by excluding outlier classes from the source domain selection, and maximizing the similarity between source and target domain data distributions [128]. (ii) An autoencoder for transfer learning, described in [125, 126], incorporates embedding and label encoding layers. The embedding layer reduces the disparity between instance distributions from the source and target domains, while the label encoding layer utilizes a softmax regression model to encode label information from the source domain. (iii) The transformer OptFormer has demonstrated competitiveness with specific transfer learning methods, although its usage leans more toward metalearning than traditional transfer learning [81]. 4. Experimental Evaluation of RAF Ensembles This section describes a small experimental contribution to one of the above surveyed possible research directions: RAF ensembles are experimentally evaluated as surrogate models for CMA-ES. The experi- ments were performed on the probably most commonly used platform for experimenting in continuous optimization – COCO ( Comparing Continuous Optimizers) [92]. COCO contains severeal suites of benchmark functions, our evaluation was performed with the most traditional suite, which is the bbob suite [92]. It consists of 24 dimension-scalable noiseless benchmark functions, the definitions of which 52 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 have been given in [91]. Each function is used in 15 differently rotated and/or translated instances. The employed benchmarks forming the bobo suite are surveyed in Appendix A. 4.1. Considered Variants of RAF Ensembles As activation functions forming an RAF ensemble, we employed those included in the implementation [131], to which the RAF paper refers [76]. They are listed in Appendix B. We used them in two variants of RAF ensembles: 1. An RAF ensemble of size 5 trained directly using the above mentioned implementation [131], and aggregated by the empirical mean. In the results, it will be denoted simply RAF. 2. An ensemble of size 5, in which the differences of values of the original black-box objective function with respect to its median are first transformed to their logarithms before using [131] in the logarithmic scale to train the ensemble. This transformation attempts to deal with situations when the function returns in many points values close to the median. The aggregation function is again the empirical mean, which in terms of the data before the logarithmic transformation actually corresponds to the empirical geometric mean. That version will be in the results denoted RAF-log. 4.2. Considered CMA-ES Variants for Comparison CMA-ES surrogate-assisted by the above mentioned two variants of RAF ensembles was compared with CMA-ES without surrogate modelling, as well as with two earlier surrogate-assisted variants of CMA-ES: 3. CMA-ES without surrogate modelling was used in an implementation that is in the COCO data archive [132] called default-CMA-ES, and described as "default CMA-ES from the pycma module, version 3.3.0". Here, it will be in the results denoted simply default. 4. DTS-CMA-ES [12], using a surrogate GP with the covariance function Matérn 52 . In the results, it will be denoted simply DTS. 5. lq-CMA-ES [20], which will be in the results denoted simply lq. 4.3. Evolution Control Whereas DTS-CMA-ES and lq-CMA-ES have each their own evolution control, for the two variants of RAF ensembles was necessary to propose when to evaluate a given point 𝑥 by the original black-box objective function 𝐹bb , and when by its surrogate model 𝐹sm . We decided to use a modification of the lq-CMA-ES evolution control. That modification is described below in Algorithm 1 using the notation 𝜏 ((𝑦1 , . . . , 𝑦𝑘 ), (𝑧1 , . . . , 𝑧𝑘 )) for the Kendall correlation coefficient between the sequences (𝑦1 , . . . , 𝑦𝑘 ) and (𝑧1 , . . . , 𝑧𝑘 ), and the notation 𝜌 for the ranking function on R𝑑 , i.e., 𝜌 : R𝑑 → Π(𝑑) with Π(𝑑) denoting the set of permutations of {1, . . . , 𝑑} such that ∀𝑦 ∈ R𝑑 : (𝜌(𝑦))𝑖 < (𝜌(𝑦))𝑗 ⇒ 𝑦𝑖 ≤ 𝑦𝑗 . (1) 4.4. Results In Tables 2–3, the two considered variants of RAF ensembles, and three considered other CMA-ES variants, are compared based on the difference between the optimal value of the objective function, and its value achieved for a given evaluation budget. The achieved values were averaged over the 15 instances provided by the COCO benchmark suite in each dimension for each of the 24 noiseless functions listed in Appendix A. The comparisons were performed separately for each of the five above described groups of those functions, and subsequently also for all 24 of them, each time including the instances in dimensions 2, 3, 5, 10, and 20. For each evaluation budget, hence, six evaluations were 53 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 Table 2 Comparison of CMA-ES surrogate-assisted by RAF, and by RAF-log, with CMA-ES without surrogate modelling, with lq-CMA-ES, and with DTS-CMA-ES, for evaluation budget 3×dimension. Each cell of each sub-table records the number of function-dimension combinations, for which the method in the row achieved with the evaluation budget a lower value, averaged over the 15 COCO instances, than the method in the column. Ties within the considered precision are halved between both methods. If the Friedman test rejected the hypothesis of equivalence of all methods, and according to the subsequent Wilcoxon signed-rank test with Holm correction, the method in the row is significantly better than the method in the column, the number in the cell is in bold with * for the familywise level 5 %, and with ** for the familywise level 1 %. Separable functions RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 10.5 7 0 13.5 RAF-log 14.5 - 10 0.5 13.5 DTS-CMA-ES 18 15 - 1 17.5 lq-CMA-ES 25** 24.5** 24** - 23** default CMA-ES 11.5 11.5 7.5 2 - Functions with low or moderate conditioning RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 10 3.5 2 6.5 RAF-log 10 - 4.5 3.5 7 DTS-CMA-ES 16.5* 15.5 - 8 15* lq-CMA-ES 18** 16.5 12 - 16** default CMA-ES 13.5 13 5 4 - Unimodal functions with high conditioning RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 12.5 4 2 10.5 RAF-log 12.5 - 9.5 5 12 DTS-CMA-ES 21* 15.5 - 10 16 lq-CMA-ES 23** 20* 15 - 18.5 default CMA-ES 14.5 13 9 6.5 - Multi-modal functions with adequate global structure RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 16 9.5 9 10 RAF-log 9 - 8.5 8.5 6.5 DTS-CMA-ES 15.5 16.5 - 14.5 15 lq-CMA-ES 16 17 10.5 - 13 default CMA-ES 15 18.5 10 12 - Multi-modal functions with weak global structure RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 11 5.5 5 6.5 RAF-log 14 - 8.5 5 6 DTS-CMA-ES 19.5 16.5 - 13.5 15 lq-CMA-ES 20* 20 11.5 - 15.5 default CMA-ES 18.5 18 10 9.5 - All noiseless benchmark functions RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 60 29.5 18 47 RAF-log 60 - 41 22 45 DTS-CMA-ES 90.5** 79** - 47 78.5** lq-CMA-ES 102** 98** 73 - 86** default CMA-ES 73 75 41.5 34 - 54 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 Table 3 Comparison of CMA-ES surrogate-assisted by RAF, and by RAF-log, with CMA-ES without surrogate modelling, with lq-CMA-ES, and with DTS-CMA-ES, for evaluation budget 50×dimension. Each cell of each sub-table records the number of function-dimension combinations, for which the method in the row achieved with the evaluation budget a lower value, averaged over the 15 COCO instances, than the method in the column. Ties within the considered precision are halved between both methods. If the Friedman test rejected the hypothesis of equivalence of all methods, and according to the subsequent Wilcoxon signed-rank test with Holm correction, the method in the row is significantly better than the method in the column, the number in the cell is in bold with * for the familywise level 5 %, and with ** for the familywise level 1 %. Separable functions RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 7 4 1.5 11 RAF-log 18 - 6 0.5 12 DTS-CMA-ES 21** 19 - 7.5 19 lq-CMA-ES 23.5** 24.5** 17.5 - 22.5** default CMA-ES 14 13 6 2.5 - Functions with low or moderate conditioning RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 15.5 6 0.5 4.5 RAF-log 4.5 - 6 0 2 DTS-CMA-ES 14 14 - 11.5 14 lq-CMA-ES 19.5** 20** 8.5 - 18.5* default CMA-ES 15.5 18** 6 1.5 - Unimodal functions with high conditioning RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 21* 0 0 4 RAF-log 4 - 0 0 1 DTS-CMA-ES 25** 25** - 7.5 25** lq-CMA-ES 25** 25** 17.5* - 25** default CMA-ES 21 24** 0 0 - Multi-modal functions with adequate global structure RAF RAF-log DTS-CMA-ES lq-CMA-ES CMA-ES alone RAF - 15 13 10 11.5 RAF-log 10 - 11.5 9 12 DTS-CMA-ES 12 13.5 - 13.5 15 lq-CMA-ES 15 16 11.5 - 13 default CMA-ES 13.5 13 10 12 - Multi-modal functions with weak global structure RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 9 3.5 8 14 RAF-log 16 - 6.5 12 18.5 DTS-CMA-ES 21.5 18.5 - 20 23** lq-CMA-ES 17 13 5 - 20* default CMA-ES 11 6.5 2 5 - All noiseless benchmark functions RAF RAF-log DTS-CMA-ES lq-CMA-ES default CMA-ES RAF - 67.5 26.5 20 45 RAF-log 52.5 - 30 21.5 45.5 DTS-CMA-ES 93.5** 90** - 60 96** lq-CMA-ES 100** 98.5** 60 - 99** default CMA-ES 75 74.5 24 21 - 55 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 Algorithm 1 Evolution control used for RAF and RAF-log ensembles. Require: points 𝑥1 , . . . , 𝑥𝜆 ∈ R𝑑 , in which the surrogate model 𝐹sm trained on some archive 𝐴 has been evaluated; thus 𝜆 is the population size 1: Set 𝑘 = ⌊1 + max(0.02𝜆, 4)⌋; the number of 𝐹bb evaluations 2: Set 𝑄 = {𝑥𝑗 |(𝜌(𝐹sm (𝑥1 ), . . . , 𝐹sm (𝑥𝜆 )))𝑗 ≤ 𝑘}; points with the 𝑘 smallest 𝐹sm values 3: In 𝑥 ∈ 𝑄 for which 𝐹bb (𝑥) is not yet known, evaluate 𝐹bb (𝑥) 4: Order the elements of 𝑄 as (𝑥1𝑄 , . . . , 𝑥𝑘𝑄 ) decreasingly with respect to their 𝐹bb (𝑥) values 5: Set ℓ = max(1, ⌊𝑘 + 1 − max(15, 0.75𝜆)⌋); the lower index for computing 𝜏 between 𝐹bb and 𝐹sm 6: while 𝑘 < 𝜆 & 𝜏 ((𝐹bb (𝑥ℓ ), . . . , 𝐹bb (𝑥𝑘 )), (𝐹sm (𝑥ℓ ), . . . , 𝐹sm (𝑥𝑘 ))) < 0.85 do 7: Update 𝑄 = 𝑄 ∪ {𝑥𝑗 |(𝜌(𝐹sm (𝑥1 ), . . . , 𝐹sm (𝑥𝜆 )))𝑗 ≤ ⌈1.5𝑘⌉} 8: In 𝑥 ∈ 𝑄 for which 𝐹bb (𝑥) is not yet known, evaluate 𝐹bb (𝑥) 9: Update 𝑘 = ⌈1.5𝑘⌉, ℓ = max(1, ⌊𝑘 + 1 − max(15, 0.75𝜆)⌋) 10: Order the elements of 𝑄 as (𝑥1𝑄 , . . . , 𝑥𝑘𝑄 ) decreasingly with respect to their 𝐹bb (𝑥) values 11: end while 12: Update 𝐴 = 𝐴 ∪ {𝑥𝑗 |𝐹bb (𝑥) has been evaluated in 𝑥𝑗 } 13: if 𝑘 = 𝜆 then 14: Return 𝐴 and {(𝑥1 , 𝐹bb (𝑥1 )), . . . , (𝑥𝜆 ), 𝐹bb (𝑥𝜆 ))} 15: else 16: Return 𝐴 and {(𝑥𝑖 , 𝐹bb (𝑥𝑖 ))|𝑥𝑖 ∈ 𝑄} ∪ {(𝑥𝑖 , 𝐹sm (𝑥𝑖 ))|𝑖 = 1, . . . , 𝜆, 𝑥𝑖 ̸∈ 𝑄} 17: end if performed. The comparisons in Table 2 were conducted for the evaluation budget 3×dimension, while the comparisons in Table 3 were conducted for the evaluation budget 50×dimension. The results of each of those 12 comparisons were subsequently assessed for statistical significance. First, the hypothesis that all five considered methods are equivalent was tested by the Friedman test. With the exception of both comparisons for multi-modal functions with adequate global structure, the test rejected that hypothesis on the familywise significance level 5%, using the Holm procedure for multiple-hypothesis correction [133]. This rejection justified testing the equivalence of any two among the five methods. We adopted the arguments of [134] that, in machine learning, the Wilcoxon signed-rank test is more appropriate for this purpose than the post-hoc tests presented in [135] and [133]. If for particular two methods, the Wilcoxon signed-rank test rejected the hypothesis that they are equivalent, then in the respective table, their comparison in the row corresponding to the method that was more frequently better is shown in bold italics. The results in Tables 2–3 primarily confirm the superior performance of the methods lq-CMA-ES, and DTS-CM-ES. In the two comparisons based on all 120 noiseless benchmark functions, each of them is for both considered budgets significantly better not only than default CMA-ES, but also than CMA-ES surrogate-assisted by the two variants of RAF ensembles. Moreover, lq-CMA-ES is also among the 10 comparisons based on individual groups of functions 6 times significantly better than default CMA-ES, and 7 times, respectively 5 times significantly better than CMA-ES surrogate-assisted by RAF, respectively by RAF-log. For DTS-CMA-ES, the results of the 10 comparisons based on individual groups of functions are less convincing: 3 times significantly better than default CMA-ES, 3 times than CMA-ES surrogate-assisted by RAF, and only once than CMA-ES assisted by RAF-log. As to a comparison between the two variants of RAF ensembles, the differences among them were not significant apart from unimodal functions with high conditioning, for which CMA-ES achieves significantly better results if assisted by RAF than if assisted by RAF-log. The different progress of optimization performed by each of the compared methods is illustrated, always in three particular dimensions, by means of optimization-progress plots. They show the average difference Δ𝑓 between the optimal and achieved value of the objective function over the 15 COCO instances. For that illustration, we have chosen the functions 𝑓9 (Figure 1), 𝑓18 (Figure 2), and 𝑓20 (Figure 3). We can see that optimisation using CMA-ES surrogate-assisted by RAF or RAF-log sometimes leads to similarly fast decrease of the objective function as, or even faster than, optimization using 56 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 2-D 4 RAF RAF-log 2 DTS-CMA-ES lq-CMA-ES log10 ( f) 0 default-CMA-ES 2 4 6 8 3-D 4 2 0 log10 ( f) 2 4 6 8 5-D 4 2 0 log10 ( f) 2 4 6 8 0 50 100 150 200 250 Number of evaluations / D Figure 1: Progress of optimization by the compared methods up to the budget 250×dimension for the benchmark function 𝑓9 – Rosenbrock rotated. Each curve is the average of the 15 COCO instances of this function. the state-of-the-art methods DTS-CMA-ES or lq-CMA-ES. In Figure 1, this is the case for RAF-log in dimension 2. In Figure 2, dimesnion 3, CMA-ES surrogate-assisted by RAF reaches lower values of the objective function than any other of the compared methods, whereas in dimension 2, CMA-ES surrogate-assisted by any of RAF or RAF-log leads to a similarly fast decrease of 𝑓18 as DTS-CMA-ES but slower than lq-CMA-ES. Finally, in Figure 3, dimensions 3 and 5, CMA-ES surrogate-assisted by any of RAF or RAF-log leads to a similarly fast decrease of 𝑓18 as lq-CMA-ES, but slower than DTS-CMA-ES. 57 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 2-D RAF 2 RAF-log DTS-CMA-ES 0 lq-CMA-ES log10 ( f) default-CMA-ES 2 4 6 8 3-D 2 0 2 log10 ( f) 4 6 8 5-D 2 0 2 log10 ( f) 4 6 8 0 50 100 150 200 250 Number of evaluations / D Figure 2: Progress of optimization by the compared methods up to the budget 250×dimension for the benchmark function 𝑓18 – Schaffers F7 function, moderately ill-conditioned. Each curve is the average of the 15 COCO instances of this function. 5. Conclusion The paper was motivated by our opinion that the intense and successful development of artificial neural networks during the last 15 years suggests that they again have the potential to be important for active learning in surrogate-assisted BBO. It surveyed possible directions of research into that potential, including closely connected research into neural-network-based transfer learning for surrogate modelling. Moreover, it recalled the first published investigations in some of those directions, and added a new contribution to the emerging mosaic of those investigations. 58 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 2-D 4 RAF RAF-log 2 DTS-CMA-ES lq-CMA-ES default-CMA-ES log10 ( f) 0 2 4 6 8 3-D 4 2 0 log10 ( f) 2 4 6 8 5-D 4 2 0 log10 ( f) 2 4 6 8 0 50 100 150 200 250 Number of evaluations / D Figure 3: Progress of optimization by the compared methods up to the budget 250×dimension for the benchmark function 𝑓20 – Schwefel. Each curve is the average of the 15 COCO instances of this function. The fact that the main purpose of the experimental section of the paper is to contribute to the mosaic of emerging investigations should be epmhasized especially in context of the obtained experimental results. It justifiess that there is no significant difference between using CMA-ES surrogate-assisted by RAF ensembles and using it alone, as well as that results with RAF-ensemble-based surrogate models are significantly worse than results with the state-of-the-art surrogate-assisted CMA-ES variants, lq-CMA- ES, and DTS-CMA-ES. This is an obvious limitation not only of RAF ensembles, but of all above surveyed kinds of neural networks that have been so far investigated as surrogate models for CMA-ES. On the other hand, as the survey has shown, there are many more other possibilities for such investigations within future research. 59 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 Acknowledgemengt The research reported in this paper has been supported by the German Research Foundation (DFG) funded project 467401796, and by the Czech Technical University grant SGS 23/205/OHK3/3T/18. The authors are very grateful to Jaroslav Langer for his crucial contribution to the RAF experiments. References [1] M. Baerns, M. Holeňa, Combinatorial Development of Solid Catalytic Materials. Design of High- Throughput Experiments, Data Analysis, Data Mining, Imperial College Press / World Scientific, London, 2009. [2] A. Booker, J. Dennis, P. Frank, D. Serafini, T. V., M. Trosset, A rigorous framework for optimization by surrogates, Structural and Multidisciplinary Optimization 17 (1999) 1–13. [3] M. El-Beltagy, P. Nair, A. Keane, Metamodeling techniques for evolutionary optimization of computaitonally expensive problems: Promises and limitations, in: Proceedings of the Genetic and Evolutionary Computation Conference, Morgan Kaufmann Publishers, 1999, pp. 196–203. [4] A. Ratle, Kriging as a surrogate fitness landscape in evolutionary optimization, Artificial Intelligence for Engineering Design, Analysis and Manufacturing 15 (2001) 37–49. [5] M. Emmerich, A. Giotis, M. Özdemir, T. Bäck, K. Giannakoglou, Metamodel-assisted evolution strategies, in: PPSN, ACM, 2002, pp. 361–370. [6] S. Leary, A. Bhaskar, A. Keane, A derivative based surrogate model for approximating and optimizing the output of an expensive computer simulation, Journal of Global Optimization 30 (2004) 39–58. [7] Y. Ong, P. Nair, A. Keane, K. Wong, Surrogate-assisted evolutionary optimization frameworks for high-fidelity engineering design problems, in: Y. Jin (Ed.), Knowledge Incorporation in Evolutionary Computation, Springer, 2005, pp. 307–331. [8] K. Rasheed, X. Ni, S. Vattam, Methods for using surrogate modesl to speed up genetic algo- rithm oprimization: Informed operators and genetic engineering, in: Y. Jin (Ed.), Knowledge Incorporation in Evolutionary Computation, Springer, 2005, pp. 103–123. [9] Y. Jin, M. Olhofer, B. Sendhoff, A framework for evolutionary optimization with approximate fitness functions, IEEE Transactions on Evolutionary Computation 6 (2002) 481–494. [10] D. Büche, N. Schraudolph, P. Koumoutsakos, Accelerating evolutionary algorithms with Gaussian process fitness function models, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 35 (2005) 183–194. [11] C. Huang, B. Radi, A. El Hami, H. Bai, CMA evolution strategy assisted by kriging model and approximate ranking, Applied Intelligence 48 (2018) 4288–4204. [12] L. Bajer, Z. Pitra, J. Repický, M. Holeňa, Gaussian process surrogate models for the CMA evolution strategy, Evolutionary Computation 27 (2019) 665–697. [13] Z. Pitra, M. Hanuš, J. Koza, J. Tumpach, M. Holeňa, Interaction between model and its evolution control in surrogate-assisted CMA evolution strategy, in: GECCO, 2021, p. 358 (paper no.). [14] P. Dufossé, N. Hansen, Augmented Lagrangian, penalty techniques and surrogate modeling for constrained optimization with CMA-ES, in: GECCO, 2021, pp. 519–527. [15] N. Sakamoto, Y. Akimoto, Adaptive ranking-based constraint handling for explicitly constrained black-box optimization, Evolutionary Computation 30 (2022) 503–529. [16] I. Loshchilov, M. Schoenauer, M. Sebag, A mono surrogate for objective optimization, in: GECCO, 2010, pp. 471–478. [17] F. Gibson, R. Everson, J. Fieldsend, Guiding surrogate-assisted multi-objective optimisation with decision maker preferences, in: GECCO, 2022, pp. 786–795. [18] S. Kern, N. Hansen, P. Koumoutsakos, Local metamodels for optimization using evolution strategies, in: PPSN, 2006, pp. 939–948. [19] A. Auger, D. Brockhoff, N. Hansen, Benchmarking the local metamodel cma-es on the noiseless BBOB’2013 test bed, in: GECCO, 2013, pp. 1225–1232. 60 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 [20] N. Hansen, A global surrogate assisted CMA-ES, in: GECCO, 2019, pp. 664–672. [21] H. Yu, C. Sun, Y. Tan, J. Zeng, Y. Jin, An adaptive model selection strategy for surrogate- assisted particle swarm optimization algorithm, in: IEEE SCI, 2016, pp. 1–8. [22] H. Wang, Y. Jin, J. Doherty, Committee-based active learning for surrogate-assisted particle swarm optimization of expensive problems, IEEE Transactions on Cybernetics 47 (2017) 2664–2677. [23] M. Papadrakakis, N. Lagaros, Y. Tsompanakis, Structural optimization using evolution strategies and neural networks, Computer Methods in Applied Mechanics and Engineering 156 (1998) 309–333. [24] H. Ulmer, F. Streichert, A. Zell, Model-assisted steady state evolution strategies, in: GECCO, Springer, 2003, pp. 610–621. [25] D. Lim, Y. Ong, Y. Jin, S. B., A study on metamodeling techniques, ensembles, and multi-surrogates in evolutionary computation, in: GECCO, 2007, pp. 1288–1295. [26] L. Bajer, M. Holeňa, Surrogate model for continuous and discrete genetic optimization based on RBF networks, in: Intelligent Data Engineering and Automated Learning, Springer, 2010, pp. 251–258. [27] L. Na, Q. Feng, Z. Liang, W. Zhong, Gaussian process assisted coevolutionary estimation of distri- bution algorithm for computationally expensive problems, Journal of Central South University of Technology 19 (2012) 443–452. [28] V. Volz, G. Rudolph, B. Naujoks, Investigating uncertainty propagation in surrogate-assisted evolutionary algorithms, in: GECCO, 2017, pp. 881–888. [29] L. Toal, D. Arnold, Simple surrogate model assisted optimization with covariance matrix adapta- tion, in: PPSN, 2020, pp. 184–197. [30] Z. Li, T. Gao, B. Wang, Elite-driven surrogate assisted CMA-ES algorithm by improved lower confidence bound method, in: Engineering with Computers, Springer, 2022, pp. 10.1007/s00366– 022–01642–5 (doi). [31] Z. Zhou, Y. Ong, P. Nair, A. Keane, K. Lum, Combining global and local surrogate models to accellerate evolutionary optimization, IEEE Transactions on Systems, Man and Cybernetics. Part C: Applications and Reviews 37 (2007) 66–76. [32] B. Saini, M. Lópey-Ibañez, K. Miettinen, Automatic surrogate modelling technique selection based on features of optimization problems, in: GECCO, 2019, pp. 1765–1772. [33] N. Belkhir, J. Dréo, P. Savéant, M. Schoenauer, Per instance algorithm configuration of CMA-ES with limited budget, in: GECCO, 2017, pp. 681–688. [34] Z. Pitra, J. Repický, M. Holeňa, Boosted regression forest for the doubly trained surrogate covariance matrix adaptation evolution strategy, in: ITAT 2018, 2018, pp. 72–79. [35] T. Runarsson, Ordinal regression in evolutionary computation, in: PPSN, 2006, pp. 1048–1057. [36] I. Loshchilov, M. Schoenauer, M. Sebag, Comparison-based optimizers need comparison-based surrogates, in: PPSN, 2010, pp. 364–373. [37] A. A., D. Arnold, Adaptive function value warping for surrogate model assisted evolutionary optimization, in: PPSN, 2022, pp. 76–89. [38] I. Couckuyt, D. Gorissen, Automatic surrogate model type selection during the optimization of expensive black-box problems, in: Winter Simulation Conference, 2011, pp. 4285–4293. [39] H. Dong, S. Sun, B. Song, P. Wang, Multi-surrogate-based global optimization using a score-based infill criterion, Structural and Multidisciplinary Optimization 59 (2019) 485–506. [40] N. Hansen, A. Ostermaier, Completely derandomized self-adaptation in evolution strategies, Evolutionary Computation 9 (2001) 159–195. [41] N. Hansen, The CMA evolution strategy: A comparing review, in: Towards a New Evolutionary Computation, Springer, 2006, pp. 75–102. [42] M. Wu, A. Karkar, B. Liu, A. Yakovlev, G. Gielen, V. Grout, Network on chip optimization based on surrogate model assisted evolutionary algorithms, in: IEEE CEC, 2014, pp. 3266–3271. [43] D. Jones, M. Schonlau, W. Welch, Efficient global optimization of expensive black-box functions, Journal of Global Optimization 13 (1998) 455–492. [44] J. Knowles, ParEGO: a hybrid algorithm with on-line landscape approximation for expensive 61 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 multiobjective optimization problems, IEEE Transactions on Evolutionary Computation 10 (2006) 50–66. [45] Y. Diouane, V. Picheny, R. Le Riche, A. Di Perrotolo, TREGO: a trust-region framework for efficient global optimization, Journal of Global Optimization 85 (2022) 10.1007/s10898–022–01245–w (doi). [46] H. Mohammadi, R. Riche, E. Touboul, Making EGO and CMA-ES complementary for global optimization, in: Learning and Intelligent Optimization, Springer, 2015, pp. 287–292. [47] Z. Pitra, L. Bajer, M. Holeňa, Doubly trained evolution control for the surrogate cma-es, in: PPSN, 2016, pp. 59–68. [48] Y. He, Y. Yuen, Black box algorithm selection by convolutional neural network, in: LOD, 2020, pp. 264–280. [49] M. Pikalov, V. Mironovich, Automated parameter choice with exploratory landscape analysis and machine learning, in: GECCO, 2021, pp. 1982–1985. [50] R. Prager, M. Seiler, H. Trautman, P. Kerschke, Towards feature-free automated algorithm selection for single-objective continuous black box optimization, in: IEEE SCI, 2021, pp. 1–8. [51] Z. Pitra, L. Bajer, M. Holeňa, Knowledge-based selection of gaussian process surrogates, in: ECML Workshop IAL, 2019, pp. 48–63. [52] Z. Pitra, J. Repický, M. Holeňa, Landscape analysis of Gaussian process surrogates for the covariance matrix adaptation evolution strategy, in: GECCO, ACM, 2019, pp. 691–699. [53] R. Seiler, M.V.and Prager, P. Kerschke, H. Trautmann, A collection of deep learning-based feature- free approaches for characterizing single-objective continuous fitness landscapes, in: GECCO, 2022, pp. 657–665. [54] A. Jankovic, G. Popovski, T. Eftimov, C. Doerr, The impact of hyper-parameter tuning for landscape-aware performance regression and algorithm selection, in: GECCO, 2021, pp. 687–696. [55] R. Calandra, J. Peters, C. Rasmussen, M. Deisenroth, Manifold Gaussian processes for regression, in: IJCNN, 2016, pp. 3338–3345. [56] A. Wilson, Z. Hu, R. Salakhutdinov, E. Xing, Deep kernel learning, in: ICAIS, 2016, pp. 370–378. [57] J. Koza, J. Tumpach, Z. Pitra, M. Holeňa, Combining gaussian processes and neural networks in surrogate modeling for covariance matrix adaptation evolution strategy, in: IAL Workshop, ECML PKDD, 2021, pp. 1–10. ‌ [58] J. Ružička, J. Koza, J. Tumpach, Z. Pitra, M. Holeňa, Combining gaussian processes with neural networks for active learning in optimization, in: ECML Workshop IAL, 2021, pp. 105–120. [59] H. Salimbeni, M. Deisnroth, Doubly stochastic variational inference for deep Gaussian processes, in: NeurIPS, 2017, pp. 1–16. [60] K. Blomqvist, S. Kaski, M. Heinonen, Deep convolutional Gaussian processes, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2020, pp. 582–597. [61] G. Hernández-Muñoz, C. Villacampa-Calvo, D. Hernández Lobato, Deep Gaussian processes using expectation propagation and Monte Carlo methods, in: ECML PKDD, 2020, pp. 479–494. [62] D. Ming, D. Williamson, S. Guillas, Deep Gaussian process emulation using stochastic imputation, Techometrics 65 (2022) 150–161. [63] A. Sauer, R. Gramacy, D. Hugdon, Active learning for deep gaussian process surrogates, Techno- metrics 65 (2023) 4–18. [64] A. Jacot, F. Gabriel, C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, in: NeurIPS, 2018, pp. 1–10. [65] R. Novak, L. Xiao, J. Hron, J. Lee, A. Alemi, et al., Neural tangents: Fast and easy infinite neural networks in python, in: ICLR, 2020, pp. 1–19. [66] A. Malinin, M. Gales, Predictive uncertainty estimation via prior networks, in: NeurIPS, 2018, pp. 1–17. [67] M. Biloš, B. Charpentier, S. Günnemann, Uncertainty on asynchronous time event prediction, in: NeurIPS, 2019, pp. 1–10. [68] A. Malinin, M. Gales, Reverse KL-divergence training of prior networks: Improved uncertainty and adversarial robustness, in: NeurIPS, 2019, pp. 1–12. [69] J. Nandy, W. Hsu, M. Lee, Towards maximizing the representation gap between in-domain and 62 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 out-of-distribution examples, in: NeurIPS, 2020, pp. 1–12. [70] X. Zhao, F. Chen, S. Hu, J. Cho, Uncertainty aware semi-supervised learning on graph data, in: NeurIPS, 2020, pp. 1–10. [71] J. Tumpach, J. Koza, Z. Pitra, M. Holeňa, Neural-network-based estimation of normal distributions in black-box optimization, in: ESANN, 2022, pp. 1–6. [72] C. Valle, F. Saravia, H. Allende, R. Monge, C. Fernández, Parallel approach for ensemble learning with locally coupled neural networks, Neural Processing Letters 32 (2010) 277–291. [73] B. Lakshminaraynan, A. Prityel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, in: NeurIPS, 2017, pp. 1–12. [74] R. Hu, Q. Huang, S. Chang, H. Wang, J. He, The MBPEP: A deep ensemble pruning algorithm providing high quality uncertainty prediction, Applied Intelligence 49 (2019) 2942–2955. [75] T. Pearce, F. Leibfried, A. Brintrup, M. Zaki, A. Neely, Uncertainty in neural networks: Approxi- mately bayesian ensembling, in: AISTATS, 2020, pp. 1–30. [76] Y. Stoyanova, S. Ghandi, M. Tavakol, Toward robust uncertainty estimation with random activation functions, in: AAAI Conference on Artificial Intelligence, 2023, pp. 1–13. [77] S. Kim, P. Lu, C. Lob, J. Smith, J. Snoek, et al., Deep learning for bayesian optimization of scientific problems with high-dimensional structure, Transactions on Machine Learning Research 1 (2022) openreview tPMQ6Je2rB. [78] A. Tripp, E. Daxberger, J. Hernández-Lobato, Sample-efficient optimization in the latent space of deep generative models viaweighted retraining, in: NeurIPS, 2020, pp. 1–14. [79] M. Gillhofer, H. Ramsauer, J. Brandstetter, B. Schäfl, S. Hochreiter, A GAN based solver of black-box inverse problems, in: NeurIPS, 2019, pp. 1–5. [80] M. Lu, S. Ning, S. Liu, F. Sun, B. Zhang, et al., OPT-GAN: A broad-spectrum global optimizer for black-box problems by learning distribution, 2022. Arxiv 2102.03888v5. [81] Y. Chen, X. Song, C. Lee, Z. Wang, Q. Zhang, et al., Towards learning universal hyperparameter optimizers with transformers, in: NeurIPS, 2022, pp. 1–16. [82] S. Muller, M. Feurer, N. Hollmann, F. Hutter, PFNs4BO: In-context learning for Bayesian opti- mization, in: ICML, 2023, pp. 1–27. [83] E. Rasmussen, C. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, 2006. [84] A. Damianou, N. Lawrence, Deep Gaussian processes, in: AISTATS, 2013, pp. 1–9. [85] T. Bui, D. Hernandez-Lobato, J. Hernandez-Lobato, Y. Li, R. Turner, Deep Gaussian processes for regression using approximate expectation propagation, in: ICML, 2016, pp. 1472–1481. [86] K. Cutajar, E. Bonilla, P. Michiardi, M. Filippone, Random feature expansions for deep Gaussian processes, in: ICML, 2017, pp. 884–893. [87] A. Hebbal, L. Brevault, M. Balesdent, E. Talbi, N. Melab, Efficient global optimization using deep Gaussian processes, in: IEEE CEC, 2018, pp. 1–12. [88] A. Matthews, J. Hron, M. Rowland, R. Turner, Gaussian process behaviour in wide deep neural networks, in: ICLR, 2019, pp. 1–15. [89] A. Hebbal, L. Brevault, M. Balesdent, E. Talbi, N. Mela, Bayesian optimization using deep Gaussian processes, 2019. ArXiv: 1905.03350v1. [90] H. Yu, B. Low, P. Jaillet, D. Liu, Convolutional normalizing flows for deep Gaussian processes, in: IJCNN, 2021, pp. 1–5. [91] N. Hansen, S. Finck, R. Ros, A. Auger, Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Definitions, Technical Report, INRIA, Paris Saclay, 2009. [92] N. Hansen, A. Auger, R. Ros, O. Merseman, T. Tušar, D. Brockhoff, COCO: a platform for comparing continuous optimizers in a black box setting, Optimization Methods and Software 35 (2021) 114–144. [93] J. Koza, J. Tumpach, Z. Pitra, M. Holeňa, Using past experience for configuration of Gaussian processes in black-box optimization, in: LION, 2021, pp. 167–182. [94] J. Lee, Y. Bahri, R. Novak, S. Schoenholz, et al., Deep neural networks as Gaussian processes, in: ICLR, 2018, pp. 1–17. 63 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 [95] R. Novak, L. Xiao, J. Lee, Y. Bahri, et al., Bayesian deep convolutional networks with many channels are Gaussian processes, in: ICLR, 2019, pp. 1–35. [96] B. He, B. Lakshminarayanan, Y. Teh, Bayesian deep ensembles via the neural tangent kernel, in: NeurIPS, 2020, pp. 1–13. [97] B. Paria, B. Pòczos, K. Ravikumar, S. J., A. Suggala, et al., Be greedy -– a simple algorithm for blackbox optimization using neural networks, in: ICML Workshop on Adaptive Experimental Design and Active Learning in the Real World, 2022, pp. 1–27. [98] A. Malinin, S. Chervontsev, I. Povilkov, M. Gales, Regression prior networks, 2020. ArXiv: 2006.11590v2. [99] A. Amini, W. Schwarting, A. Soleimany, D. Rus, Deep evidential regression, in: NeurIPS, 2020, pp. 1–11. [100] M. Sensoy, L. Kaplan, M. Kandmir, Evidential deep learning to quantify classification uncertainty, in: NeurIPS, 2018, pp. 1–11. [101] D. Oh, B. Shin, Improving evidential deep learning via multi-task learning, in: AAAI Conference on Artificial Intelligence, 2022, pp. 1–14. [102] Y. Tong, P. Xu, T. Denoeux, An evidential classifier based on Dempster-Shafer theory and deep learning, Neurocomputing 450 (2021) 275–293. [103] D. Ulmer, A survey on evidential deep learning for single-pass uncertainty estimation, 2021. ArXiv: 2110.03051v2. [104] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, 1976. [105] J. Ling, Z. Zhou, Causal discovery based on neural network ensemble method, Journal of Software 15 (2004) 1479–1484. [106] H. Chen, S. Yuan, K. Jiang, Wrapper approach for learning neural network ensemble by feature selection, in: Advances in Neural Networks – ISNN 2005, Springer, 202, pp. 526–531. [107] D. Partridge, Network generalization differences quantified, Neural Networks 9 (1996) 263–271. [108] M. Jang, S. Cho, Observational learning algorithm for an ensemble of neural networks, Pattern Analysis and Applications 5 (2002) 154–167. [109] Z. Wang, S. Chen, Z. Chen, An active learning approach for neural network ensemble, Journal of Computer Research and Development 42 (2005) 375–380. [110] M. Islam, X. Yao, K. Murase, A constructive algorithm for training cooperative neural network ensembles, IEEE Transactions on Neural Networks 14 (2003) 820–834. [111] M. Alhamdoosh, D. Wang, Fast decorrelated neural network ensembles with random weights, Information Sciences 264 (2014) 104–117. [112] K. Dai, J. Zhao, F. Cao, A novel decorrelated neural network ensemble algorithm for face recognition, Knowledge Based Systems 89 (2015) 541–552. [113] J. Ling, Z. Chen, Z. Zhou, Feature selection based neural network ensemble method, Journal of Fudan University (Natural Sciences) 43 (2004) 685–688. [114] T. Yang, C. Zhang, Freeway incident detection based on Adaboost RBF neural network, Computer Engineering and Applications 32 (2008) 223–225. [115] H. Liu, G. Chen, G. Song, T. Han, AdaBoost based ensemble of neural networks in analog circuit fault diagnosis, Chinese Journal of Scientific Instrument 4 (2010) 851–856. [116] A. Ashukha, A. Lyzhov, D. Molchanov, D. Vetrov, Pitfalls of in-domain uncertainty estimation and ensembling in deep learning, in: ICLR, 2020, pp. 1–30. [117] P. McDermott, C. Wikle, Deep echo state networks with uncertainty quantification for spatio- temporal forecasting, Environmetrics 30 (2019) e2553 (paper no.). [118] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, et al., Google vizier: A service for black-box optimization, in: Knowledge Discovery and Data Mining, 2017, pp. 1487–1496. [119] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features in deep neural networks?, in: NeurIPS, 2014, pp. 1–9. [120] E. Tzeng, J. Hoffman, T. Darell, K. Saenko, Simultaneous deep transfer across domains and tasks, in: ICCV, 2015, pp. 4068–4076. [121] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, D. Erhan, Domain separation networks, 64 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 in: NeurIPS, 2016, pp. 1–9. [122] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1717–1724. [123] M. Long, H. Zhu, J. Wang, M. Jordan, Deep transfer learning with joint adaptation networks, in: ICML, 2017, pp. 3470–3479. [124] W. Cui, G. Zheng, Z. Shen, S. Jiang, W. Wang, Transfer learning for sequences via learning to collocate, in: ICLR, 2019, pp. 1487–1496. [125] F. Zhuang, X. Cheng, P. Luo, S. Pan, Supervised representation learning: Transfer learning with deep autoencoders, in: IJCAI, 2015, pp. 4119–4125. [126] F. Zhuang, X. Cheng, P. Luo, S. Pan, Q. He, Supervised representation learning with double encoding-layer autoencoder for transfer learning, ACM Transactions on Intelligent Systems and Technology 9 (2018) 1–17. [127] E. Tzeng, J. Hoffman, K. Saenko, T. Darell, Adversarial discriminative domain adaptation, in: CVPR, 2017, pp. 1–10. [128] Z. Cao, M. Long, J. Wang, M. Jordan, Partial transfer learning with selective adversarial networks, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2724–2732. [129] B. Wang, M. Qiu, X. Wang, Y. Li, Y. Gon, et al., A minimax game for instance based selective transfer learning, in: KDD, 2019, pp. 34–43. [130] M. Liu, Coupled generative adversarial networks, in: NeurIPS, 2016, pp. 1–9. [131] Y. Stoyanova, YanasGH/RAFs, 2023. https://github.com/YanasGH/RAFs. [132] C. D. Archive, Algorithm data sets for the bbob test suite, 2023. https://numbbo.github.io/data- archive/bbob/. [133] S. Garcia, F. Herrera, An extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all pairwise comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694. [134] A. Benavoli, G. Corani, F. Mangili, Should we really use post-hoc tests based on mean-ranks?, Journal of Machine Learning Research 17 (2016) 1–10. [135] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. A. Employed Benchmarks The functions in the bbob suite are divided into five groups: 1. Separable functions (Figure 4). • 𝑓1 : sphere; • 𝑓2 : ellipsoidal; • 𝑓3 : Rastrigin; • 𝑓4 : Büche-Rastrigin; • 𝑓5 : linear slope. 2. Functions with low or moderate conditioning (Figure 5). • 𝑓6 : attractive sector; • 𝑓7 : step ellipsoidal; • 𝑓8 : Rosenbrock; • 𝑓8 : Rosenbrock rotated. 3. Unimodal functions with high conditioning (Figure 6). • 𝑓10 : ellipsoidal; • 𝑓11 : discus; • 𝑓12 : bent cigar; 65 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 Figure 4: Separable functions. From left to right: sphere, ellipsoidal, Rastrigin, Büche-Rastrigin, maximized linear slope. Figure 5: Functions with low or moderate conditioning. From left to right: attractive sector, step ellipsoidal, Rosenbrock, Rosenbrock rotated. Figure 6: Unimodal functions with high conditioning. From left to right: ellipsoidal, discus, bent cigar, sharp ridge, different powers. • 𝑓13 : sharp ridge; • 𝑓14 : different powers. 4. Multi-modal functions with adequate global structure (Figure 7). • 𝑓15 : Rastrigin; • 𝑓16 : Weierstrass; • 𝑓17 : Schaffers F7 function; • 𝑓18 : Schaffers F7 function, moderately ill-conditioned; • 𝑓19 : composite Griewank-Rosenbrock function F8F2. Figure 7: Multi-modal functions with adequate global structure. From left to right: Rastrigin, Weierstrass, Schaffers F7 function, moderately ill-conditioned Schaffers F7 function, composite Griewank-Rosenbrock function F8F2. 66 Martin Holeňa et al. CEUR Workshop Proceedings 47–67 5. Multi-modal functions with weak global structure (Figure 8). • 𝑓20 : Schwefel; • 𝑓21 : Gallagher’s Gaussian 101-me peaks; • 𝑓22 : Gallagher’s Gaussian 21-hi peaks; • 𝑓23 : Katsuura; • 𝑓24 : Lunacek bi-Rastrigin. Figure 8: Multi-modal functions with weak global structure. From left to right: Schwefel, Gallagher’s Gaussian 101-me peaks, Gallagher’s Gaussian 21-hi peaks, Katsuura, Lunacek bi-Rastrigin. B. Activation Functions Employed to Form an RAF Ensemble • Gauss error function {︃∫︀ 𝑥 2 e−𝑡 d𝑡 if 𝑥 ≥ 0, erf(𝑥) = 0 ∫︀ −𝑥 2 (2) − 0 e−𝑡 d𝑡 if 𝑥 < 0. • Gaussian error linear unit 𝑥 𝑥 gelu(𝑥) = (1 + erf( √ )). (3) 2 2 • Scaled exponential linear unit {︃ 𝑐𝑥 if 𝑥 ≥ 0, selu(𝑥) = (4) 𝑐𝛼(e − 1), where 𝑐, 𝛼 > 0. In the employed Tensorflow implementation, 𝑐 = 1.05070098, 𝛼 = 1.67326324. • Softsign activation function 𝑥 softsign(𝑥) = . (5) |𝑥| + 1 • Hyperbolic tangent e𝑥 − e−𝑥 tanh(𝑥) = . (6) e𝑥 + e−𝑥 67