Suitability of Modern Neural Networks for Active and Transfer Learning in Surrogate-Assisted Black-Box Optimization

Suitability of Modern Neural Networks for Active and Transfer Learning in Surrogate-Assisted Black-Box Optimization MartinHoleňa martin@cs.cas.cz Czech Academy of Sciences Institute of Computer Science

Prague Czech Republic

Faculty of Information Technology Czech Technical University

Prague Czech Republic

JanKoza kozajan@fit.cvut.cz Faculty of Information Technology Czech Technical University

Prague Czech Republic

Suitability of Modern Neural Networks for Active and Transfer Learning in Surrogate-Assisted Black-Box Optimization 1613-0073 6C30A2C47C606570F7BCA77DFF9D1E5E GROBID - A machine learning software for extracting information from scholarly documents

Active learning plays a crucial role in black-box optimization, especially for objective functions that are expensive to evaluate. Continuous black-box optimization has adopted an approach called surrogate modelling, where the original black-box objective is approximated with a regression model. An active learning task in this context is to decide which points should be evaluated using the original objective to update the surrogate model. Apart from low-order polynomials, the first surrogate models were artificial neural networks of the kinds multilayer perceptron and radial basis function network. In the late 2000s, neural networks have been superseded by other kinds of surrogate models, primarily Gaussian processes. However, over the last 15 years, neural networks have seen significant and successful development, suggesting that they once again have the potential to serve as promising surrogate models. This paper reviews possible research directions concerning that potential, and recalls initial results from investigations in some of these directions. Finally, it contributes to those results by investigating the state-of-the-art black-box optimizer CMA-ES surrogate-assisted by two variants of random-activation-function neural network ensembles.

Introduction

One area where active learning plays a really important role is black-box optimization (BBO), i.e., optimization of objective functions for which no analytical description is provided. It employs optimization methods that need as input only points in the search space paired with respective values of the objective function obtained in a non-analytical way, e.g. from sensors, in experiments or through numerical simulations. Most frequently used are evolutionary optimization approaches, such as evolution strategies, genetic algorithms, and differential evolution, or other metaheuristics, such as particle swarm optimization.

Because BBO methods receive only information about values of the objective function, they typically need many such values. This is a problem in situations when evaluating the black-box objective function is time-consuming and/or expensive. That is frequently the case if it is evaluated empirically in experiments. For example, for the evolutionary optimization tasks described in the book [1], the evaluation of a comparatively small generation of a genetic algorithm can sometimes take more than a week and cost more than 10000 e. To deal with expensive evaluations, continuous BBO has in the late 1990s and early 2000s adopted an approach called surrogate modelling or metamodelling [2,3,4,5,6,7,8]. In principle, a surrogate model is any regression model that with a sufficient fidelity approximates the original black-box objective function, restricting the necessity of its evaluation only to a small proportion of points, whereas everywehere else, only the surrogate model is used.

Selecting the points in which the original objective function should be evaluated is a step in which active learning is involved. However, it is not active learning of a regression model although the surrogate model itself is a regression model. The reason is that its utility functions are not based on the model, like are the commoly used utility functions uncertainty decrease, model performance, diversity, or surprise-novelty. Instead, they are based on the BBO, the most common being minimizing the objective function for a given evaluation budget, and minimizing the evaluation budget for a given objective-function threshold. Nevertheless, even active learning in surrogate-assisted BBO follows the basic priciple of active learning: to actively select next model inputs according to the considered utility function.

The earliest kinds of surrogate models in continuous BBO were low-order polynomials and artificial neural networks (ANNs) of the kind multilayer perceptron (MLP). The former have always remained a suitable choice in situations when enough evaluations of the original black-box objective function are affordable for the approximation properties of polynomials to be in effect. On the other hand, surrogate modelling for substantially less evaluations of the original objective has during the last two decades undergone further development. MLPs were soon replaced with another kind of ANNs, radial basis function networks (RBFs), which better fit local peculiarities of an objective function landscape. Those networks, however, have since the late 2000s been superseded by other kinds of surrogate models, primarily Gaussian processes (GPs), but also ranking support vector machines (RSVMs), and random forests (RFs). GPs are currently the most successful kind of surrogate models for BBO with small evaluation budget of functions with complicated multimodal landscapes, mainly due to their ability to assess the uncertainty of the estimate of the original objective function in a given point, more precisely, to provide the probability distribution of this estimate. That property of GPs allows to combine the original BBO method, e.g. an evolutionary one, with Bayesian optimization.

Consequently, only little attention has been paid to ANN-based surrogate models in continuous BBO during the last 15 years. This contrasts with the intense and successful development of the ANN area during that time, which suggests that ANNs again have the potential to serve as promising surrogate models. This paper attempts to bring a small contribution to research into that potential, presenting in addition a review of possible directions for such a research, connected with different classes of neural networks. Moreover, it also points out that ANNs can serve as the basis for transfer learning between surrogate-assisted BBO of different functions.

The next section surveys important aspects and key methods concerning surrogate-assisted continuous BBO. The review of possible research directions concerning the usability of modern neural networks in surrogate-assisted BBO is presented in Section 3. Finally, Section 4 reports an experimental contribution to one of those research directions.

Surrogate-Assisted Continuous BBO

Surrogate modelling for continuous BBO relies on combination and interaction of three components: a regression model serving as a surrogate of the original black-box objective function, a BBO method seeking the optimum of that objective function, and a strategy when to evaluate the original objective function and when its surrogate model. That strategy is in the context of evolutionary BBO usually called evolution control [9,10,11,12,13]. There are two other aspects, namely observing constraints on the feasible set of the black-box objective function (cf. e.g. [14,15]), and generalizing surrogate modelling from single objective to multiple objectives (cf. e.g. [16,17]), however, we will restrict our attention to single-objective unconstrained optimization.

As already mentioned in the introduction, the regression models that are the most suitable kind of surrogate models if sufficiently many evaluations of the original black-box objective function are affordable, are low-order polynomials, typically quadratic functions [18,19,20,21,22]. For substantially less evaluations, the most traditional kind have been MLPs [23,9], soon replaced with RBFs [24,25,26,21,22], and since the late 2000s with GPs a.k.a. kriging [27,28,11,29,30]. Occasionally, RBFs were used as local models in combination with GP-based global models [31]. Other kinds of surrogate models employed during the last decade include decision trees [32], RFs [33,34,32], and RSVMs [35,36]. The last one has an exceptional property of invariance with respect to order-preserving transformations of the objective function. This is important in situations when the BBO algorithm possesses such invariance, a frequently encountered property of evolutionary algorithms. On the other hand, the surrogate modelling methods proposed in [11] and [28] use GPs to perform preselection based on a partial ordering that is also invariant with respect to order-preserving transformations. More importantly, the adaptive function value warping approach recently proposed in [37] aims at providing such invariance to any surrogate model. As a final remark to different kinds of surrogate models, important works about that topic always consider several kinds [38,12,39,20,32], to compare them and select the best among them, and in [22,39] also to aggregate their results, thus providing a team of surrogate models.

As to the BBO methods, not only the two most important kinds of surrogate models, i.e. low-order polynomials [18,19,20], and GPs [26,28,11,29,30], but also the less common RBFs, RFs, and RSVMs [24,36,33,34] are most often combined with the Covariance matrix adaptation evolution strategy (CMA-ES). That is not surprising because CMA-ES has already in the 2000s become a state-of-the-art approach to single-objective unconstrained continuous BBO. Basically, the CMA-ES evolves a Gaussian estimate of the position of the minimum of the original objective function. That evolution relies on simultaneous adaptation of the vector mean of the Gaussian estimate, of the scalar step size, and of the covariance matrix. For more details of this sophisticated evolution strategy, the reader is referred to the journal papers [40,41]. GPs were also combined with other evolutionary optimization methods [27,42], and GPs, polynomials, and RBFs were combined with particle swarm optimization [22] and with memetic optimization [25]. Moreover, GPs are used in black-box optimization in two different ways. In connection with evolutionary and similar BBO methods, they serve as a regression model evaluated instead of the original objective function. In addition, they also play a key role in Bayesian optimization, which then relies on GP-estimates of probability distributions of values of the original objective. Those probability distributions enable several ways of searching for optima of that objective function, each of them governed by a specific assessment of uncertainty of the objective function estimate, commonly called acquisition function [43,44,45]. Occasionally, Bayesian optimization is combined with CMA-ES. For example in [46], optimization switches from the most traditional Bayesian optimization method, EGO (Efficient Global Optimization) [43], to CMA-ES.

Finally, evolution control has been since the first surrogate-assisted BBO methods performed basically in two ways, generation-based, and individual-based. In the generation based, all points are in some generations evaluated with the true objective function, and in the remaining generations with the model. On the other hand, in every generation of the individual-based evolution control, based on the evaluation of all points with the model, a preselection of points to be evaluated with the true objective function is performed [9]. In most of the surrogate-assisted methods, however, the evolution control is specifically tailored to the respective method. Noteworthy, the authors of [13] investigated mutually replacing the evolution control of two important polynomial-assisted methods lmm-CMA [18,19] and lq-CMA-ES [20], and of two variants of the GP-assisted method DTS-CMA-ES [47,12] with the evolution control of the others. According to their findings, the success of those important methods is definitely not limited to using the respective specific tailored evolution control. The surrogate-assisted black-box optimization methods constructing several surrogate models simultaneously either aggregate them to a team [25,22] or complement the evolution control by a classifier selecting the most appropriate among those models. Important examples of classifiers used in this context are ANNs [48,49,50], and classification trees [51,52]. Their learning can be viewed as metalearning because it is based on metafeatures, i.e. properties empirically characterizing the objective function landscape and the BBO method [21,32,49,53]. Apart from classification according to the appropriateness of the surrogate model for the considered data, metalearning can be also used for regression of model error on the combination of values of metafeatures [54].

Usability of Modern Neural Networks in Surrogate-Assisted BBO

This section primarily reviews eight kinds of modern neural networks that we consider worth a research into their ability to serve as surrogate models in BBO. A high-level overview of those kinds of ANNs is given in Table 1, which for each of them mentions whether such research has already started. In Subsection 3.1, two kinds integrating GPs into ANNs are recalled. Subsection 3.2 recalls three kinds of ANNs providing the most advantageous property of GPs, their ability to estimate the distribution of black-box objective function values. Finally, in Subsection 3.3, three well-known kinds of modern neural

ANNs

Research into its ability + main references to serve as surrogate model in BBO MLPs with a GP as the final layer [55,56] First investigations [57,58] Deep GP networks [59,60,61,62,63] Not Tangent kernel networks [64,65] Not Prior networks [66,67,68,69,70] First investigations [71] Ensembles of neural networks [72,73,74,75,76] First investigations [this paper] Variational autoencoders [77,78] Not Generative adversarial networks [79,80] Not Transformers [81,82] Not networks, namely variational autoencoders, transformers, and generative adversarial networks, are recalled due to the fact that they have already proven useful in the related area of Bayesian optimization. In addition, Subsection 3.4 is devoted to knowledge transfer in surrogate-assisted BBO, which relates to the usability of modern neural networks through their important role in transfer learning.

Integration of GPs into ANNs

The integration of GPs into ANNs has been proposed on two different levels:

1. At the layer level -a GP serves as the final layer of an MLP [55,56]. Integration on that level is based on the following two assumptions: (i) If 𝑛 𝐼 denotes the number of the ANN input neurons, then the ANN computes a mapping net of 𝑛 𝐼 -dimensional input values into the set 𝒳 on which is the GP defined. Consequently, the number 𝑛 𝑂 of neurons in the last hidden layer fulfills 𝒳 ⊂ R 𝑛 𝑂 , and the ANN maps an input 𝑣 into a point 𝑥 = net(𝑣) ∈ 𝒳 , corresponding to an observation 𝑓 (𝑥 + 𝜀) governed by the GP, where 𝜀 is a zero-mean Gaussian noise. From the point of view of the ANN inputs, the GP is now 𝒢𝒫(𝑚 GP (net(•)), 𝜅(net(•), net(•))), where 𝑚 GP is the mean function, and 𝜅 is the covariance function of the GP [83]. (ii) The GP mean 𝜇 is assumed to be a known constant, thus not contributing to the GP hyperparameters, and independent of net 2. At the level of individual neurons -GPs can replace all hidden and output neurons of an MLP. This kind of neural networks is commonly called deep Gaussian process [59,60,61,62,63,84,85,86,87,88,89,90].

Integration on both levels has been developed primarily for Bayesian modelling and optimization. Nevertheless, GPs integrated as the last layer of MLPs have been used as surrogate models in a CMA-ES-driven BBO [57,58]. In particular, those surrogate models incorporate GPs with five commonly employed covariance functions linear, quadratic, rational quadratic, squared exponential, and Matérn 5 2 , as well as with one composite covariance function superposing the quadratic and squared exponential. Those 6 models were compared in [57] from the point of view of regression accuracy, evaluated on a large dataset collected during many previous runs of DTS-CMA-ES on the collection of 24 noiseless benchmarks from the Comparing Continuous Optimizers platform [91,92] (cf. Section 4) in dimensions 2, 3, 5, 10, and 20. Then in [58], they were compared on the same benchmarks in the same dimensions from the point of view of the success of surrogate-assisted optimization with CMA-ES. Unfortunately, neither of those comparisons included more traditional surrogate models nor the CMA-ES without surrogate assistance. To our knowledge, the only comparison that included both a GP integrated as the last layer of an MLP, and more traditional surrogate models, was the comparison from the point of view of regression accuracy in [93]. However, it included only one such integrated surrogate model, with the GP using the most simple covariance function -the linear one, in addition to the traditional GP-based surrogate models with eight different covariance functions, including the five listed above.

ANNs Estimating the Distribution of Black-Box Objective Function Values

In our opinion, the property of GPs most advantageous from the point of view of surrogate modelling is that they estimate the whole distribution of a predicted value of the original black-box objective function. Recall from Section 2 that due to that property also ensembles of regression trees -RFs -are used as surrogate models [33,34,32]. This draws attention to those modern neural networks that also allow estimation of such a distribution. Basically, there are three classes of them, differing in the way how that estimate can be obtained.

1. The multivariate normal distribution underlying GPs is actually the asymptotic distribution for network width increasing to infinity. Such results have been established for several kinds of ANNs [94,88,95,96,97]. In addition, closely related is the infinite width limit of the neural tangent kernel, which governs the kernel gradient of the functional cost used in MLP regression [64,65].

Although those results have great theoretical value, there can be a serious disparity between the infinite width results and their finite width counterparts [77]. Therefore, it is unclear whether they can be applied to surrogate modelling. 2. The distribution of a predicted value, or more precisely the parameters of such a distribution, can be directly learned by an ANN. The best-known kind of such neural networks are the prior networks, learning the parameters of a normal-inverse Wishart distribution, which is the conjugate prior to a multivariate normal distribution [66,67,68,69,70,98]. Prior networks belong to a broader class of evidential neural networks [99,100,101,102,103]. Their name refers to the fact that they follow the basic principle of the Dempster-Shafer theory of evidence [104] -to fall back onto prior belief for unfamiliar data. 3. An estimate of the distribution of a predicted value is produced by an ensemble of neural networks.

Important kinds of such ensembles are ensembles obtained through diversification of training data [105,106], ensembles obtained through diversification of network properties [107,108,109], a specific subgroup of which are ensembles in which the diversification is achieved through diverse activation functions [76], ensembles obtained through negative correlation learning [110,111,112], bagging ensembles [72,113], boosting ensembles [114,115] deep ensembles [73,74,116] including deep echo-state network ensembles [117], and anchored ensembles [75] with a later modification random activation function (RAF) ensembles [76]. RAF ensembles take over the principle of anchored ensembles that regularization is performed not with respect to zero, but with respect to the initialization values of the parameters, which are assumed normally distributed. Differently to an anchored ensemble, however, an RAF ensemble uses varied ativation fuctions from an a priori specified set of size 𝑛 AF . From that set, the activation function is chosen randomly, apart from the first 𝑛 AF members of the ensemble, among which each activation function occurs exactly once. We consider this last mentioned kind of ensembles as the state of the art.

To our knowledge, the only ANNs estimating the distribution of function values that have already been used as surrogate models in BBO, are prior networks. In [71], the prediction accuracy of four versions has been evaluated on the above mentioned dataset from previous runs of DTS-CMA-ES. This direction of research is continued by the present paper: Section 4 reports results for CMA-ES surrogate-assisted by two variants of RAF ensembles.

ANNs Found Useful in Bayesian Optimization

Recall from Section 2 that GPs, simultaneously with their importance as surrogate models in BBO with non-Bayesian methods, such as CMA-ES, also play a crucial role in Bayesian optimization. That is why this subsection lists three well-known kinds of modern neural networks that have been recently found useful in Bayesian optimization. In our opinion, this indicates that they are worth investigating whether they could be used also in surrogate-assisted BBO.

1. Variational autoencoders have been utilized in Bayesian optimization because they allow for optimization in a lower-dimensional latent space [77,78]. 2. The generative adversarial networks (GANs) paradigm has been recently shown to be applicable to BBO: A generator proposes samples that align with the distribution of low values (or even the optimal value) of the black-box function, while one or more discriminators classify samples based on whether they belong to that distribution [79,80]. 3. Transformers have proven effective in estimating complex prior distributions for Bayesian optimization [81,82]. Notably, an OptFormer transformer trained on Google Vizier [118], the largest hyperparameter optimization (HPO) database, achieved superior HPO outcomes compared to GP-based Bayesian optimization [81]. Furthermore, the recently introduced transformer-based Prior-data Fitted Networks [82] can mimic Gaussian Processes (GPs) and Bayesian networks, while also incorporating additional information into the prior.

ANN-Based Transfer Learning for Surrogate-Assisted Black-Box Optimization

Obtaining accurate surrogate models in the initial stages of BBO is challenging due to the scarcity of data points with evaluated objective function values. That can be mitigated by leveraging knowledgetransfer learning. And a connection of modern kinds of neural networks with transfer learning is even more obvious than with active learning. Indeed, transfer learning is nowadays one of the areas where ANNs play most important role [119,120,121]. Different types of ANNs have been utilized to this end, including convolutional [122,123], recurrent [124], autoencoder [125,126], GAN [127,128,129], and transformer [81]. In the context of the research direction pursued in this paper, most interesting are those that also have connections to BBO: (i) Four ANN-based transfer learning approaches draw inspiration from the GAN paradigm. CoGAN trains two GANs to generate the source and target, respectively, achieves a domain invariant feature space by tying the high layers parameters of the two GANs, and performs domain adaptation by training a classifier on the discriminator output [130]. Adversarial discriminative domain adaptation learns first a discriminative representation using the labels in the source domain, and then, using a domain-adversarial loss, a separate encoding that maps the target data to the same space through an asymmetric mapping [127]. Minimax-game-based selective transfer learning employs a selector and a discriminator to identify source domain data resembling the target domain's distribution, and distinguish genuine target domain data from selected source domain data, respectively [129]. Selective adversarial network addresses negative transfer by excluding outlier classes from the source domain selection, and maximizing the similarity between source and target domain data distributions [128]. (ii) An autoencoder for transfer learning, described in [125,126], incorporates embedding and label encoding layers. The embedding layer reduces the disparity between instance distributions from the source and target domains, while the label encoding layer utilizes a softmax regression model to encode label information from the source domain.

(iii)

The transformer OptFormer has demonstrated competitiveness with specific transfer learning methods, although its usage leans more toward metalearning than traditional transfer learning [81].

Experimental Evaluation of RAF Ensembles

This section describes a small experimental contribution to one of the above surveyed possible research directions: RAF ensembles are experimentally evaluated as surrogate models for CMA-ES. The experiments were performed on the probably most commonly used platform for experimenting in continuous optimization -COCO ( Comparing Continuous Optimizers) [92]. COCO contains severeal suites of benchmark functions, our evaluation was performed with the most traditional suite, which is the bbob suite [92]. It consists of 24 dimension-scalable noiseless benchmark functions, the definitions of which have been given in [91]. Each function is used in 15 differently rotated and/or translated instances. The employed benchmarks forming the bobo suite are surveyed in Appendix A.

Considered Variants of RAF Ensembles

As activation functions forming an RAF ensemble, we employed those included in the implementation [131], to which the RAF paper refers [76]. They are listed in Appendix B. We used them in two variants of RAF ensembles:

1. An RAF ensemble of size 5 trained directly using the above mentioned implementation [131], and aggregated by the empirical mean. In the results, it will be denoted simply RAF. 2. An ensemble of size 5, in which the differences of values of the original black-box objective function with respect to its median are first transformed to their logarithms before using [131] in the logarithmic scale to train the ensemble. This transformation attempts to deal with situations when the function returns in many points values close to the median. The aggregation function is again the empirical mean, which in terms of the data before the logarithmic transformation actually corresponds to the empirical geometric mean. That version will be in the results denoted RAF-log.

Considered CMA-ES Variants for Comparison

CMA-ES surrogate-assisted by the above mentioned two variants of RAF ensembles was compared with CMA-ES without surrogate modelling, as well as with two earlier surrogate-assisted variants of CMA-ES:

3. CMA-ES without surrogate modelling was used in an implementation that is in the COCO data archive [132] called default-CMA-ES, and described as "default CMA-ES from the pycma module, version 3.3.0". Here, it will be in the results denoted simply default. 4. DTS-CMA-ES [12], using a surrogate GP with the covariance function Matérn 5 2 . In the results, it will be denoted simply DTS. [20], which will be in the results denoted simply lq.

lq-CMA-ES

Evolution Control

Whereas DTS-CMA-ES and lq-CMA-ES have each their own evolution control, for the two variants of RAF ensembles was necessary to propose when to evaluate a given point 𝑥 by the original black-box objective function 𝐹 bb , and when by its surrogate model 𝐹 sm . We decided to use a modification of the lq-CMA-ES evolution control. That modification is described below in Algorithm 1 using the notation 𝜏 ((𝑦 1 , . . . , 𝑦 𝑘 ), (𝑧 1 , . . . , 𝑧 𝑘 )) for the Kendall correlation coefficient between the sequences (𝑦 1 , . . . , 𝑦 𝑘 ) and (𝑧 1 , . . . , 𝑧 𝑘 ), and the notation 𝜌 for the ranking function on R 𝑑 , i.e., 𝜌 : R 𝑑 → Π(𝑑) with Π(𝑑) denoting the set of permutations of {1, . . . , 𝑑}

such that ∀𝑦 ∈ R 𝑑 : (𝜌(𝑦)) 𝑖 < (𝜌(𝑦)) 𝑗 ⇒ 𝑦 𝑖 ≤ 𝑦 𝑗 . (1)

Results

In Tables 2-3, the two considered variants of RAF ensembles, and three considered other CMA-ES variants, are compared based on the difference between the optimal value of the objective function, and its value achieved for a given evaluation budget. The achieved values were averaged over the 15 instances provided by the COCO benchmark suite in each dimension for each of the 24 noiseless functions listed in Appendix A. The comparisons were performed separately for each of the five above described groups of those functions, and subsequently also for all 24 of them, each time including the instances in dimensions 2, 3, 5, 10, and 20. For each evaluation budget, hence, six evaluations were

Table 2

Comparison of CMA-ES surrogate-assisted by RAF, and by RAF-log, with CMA-ES without surrogate modelling, with lq-CMA-ES, and with DTS-CMA-ES, for evaluation budget 3×dimension. Each cell of each sub-table records the number of function-dimension combinations, for which the method in the row achieved with the evaluation budget a lower value, averaged over the 15 COCO instances, than the method in the column. Ties within the considered precision are halved between both methods. If the Friedman test rejected the hypothesis of equivalence of all methods, and according to the subsequent Wilcoxon signed-rank test with Holm correction, the method in the row is significantly better than the method in the column, the number in the cell is in bold with * for the familywise level 5 %, and with ** for the familywise level 1 %.

Table 3

Comparison of CMA-ES surrogate-assisted by RAF, and by RAF-log, with CMA-ES without surrogate modelling, with lq-CMA-ES, and with DTS-CMA-ES, for evaluation budget 50×dimension. Each cell of each sub-table records the number of function-dimension combinations, for which the method in the row achieved with the evaluation budget a lower value, averaged over the 15 COCO instances, than the method in the column. Ties within the considered precision are halved between both methods. If the Friedman test rejected the hypothesis of equivalence of all methods, and according to the subsequent Wilcoxon signed-rank test with Holm correction, the method in the row is significantly better than the method in the column, the number in the cell is in bold with * for the familywise level 5 %, and with ** for the familywise level 1 %. 2 were conducted for the evaluation budget 3×dimension, while the comparisons in Table 3 were conducted for the evaluation budget 50×dimension.

The results of each of those 12 comparisons were subsequently assessed for statistical significance. First, the hypothesis that all five considered methods are equivalent was tested by the Friedman test. With the exception of both comparisons for multi-modal functions with adequate global structure, the test rejected that hypothesis on the familywise significance level 5%, using the Holm procedure for multiple-hypothesis correction [133]. This rejection justified testing the equivalence of any two among the five methods. We adopted the arguments of [134] that, in machine learning, the Wilcoxon signed-rank test is more appropriate for this purpose than the post-hoc tests presented in [135] and [133]. If for particular two methods, the Wilcoxon signed-rank test rejected the hypothesis that they are equivalent, then in the respective table, their comparison in the row corresponding to the method that was more frequently better is shown in bold italics.

The results in Tables 2-3 primarily confirm the superior performance of the methods lq-CMA-ES, and DTS-CM-ES. In the two comparisons based on all 120 noiseless benchmark functions, each of them is for both considered budgets significantly better not only than default CMA-ES, but also than CMA-ES surrogate-assisted by the two variants of RAF ensembles. Moreover, lq-CMA-ES is also among the 10 comparisons based on individual groups of functions 6 times significantly better than default CMA-ES, and 7 times, respectively 5 times significantly better than CMA-ES surrogate-assisted by RAF, respectively by RAF-log. For DTS-CMA-ES, the results of the 10 comparisons based on individual groups of functions are less convincing: 3 times significantly better than default CMA-ES, 3 times than CMA-ES surrogate-assisted by RAF, and only once than CMA-ES assisted by RAF-log. As to a comparison between the two variants of RAF ensembles, the differences among them were not significant apart from unimodal functions with high conditioning, for which CMA-ES achieves significantly better results if assisted by RAF than if assisted by RAF-log.

The different progress of optimization performed by each of the compared methods is illustrated, always in three particular dimensions, by means of optimization-progress plots. They show the average difference Δ 𝑓 between the optimal and achieved value of the objective function over the 15 COCO instances. For that illustration, we have chosen the functions 𝑓 9 (Figure 1), 𝑓 18 (Figure 2), and 𝑓 20 (Figure 3). We can see that optimisation using CMA-ES surrogate-assisted by RAF or RAF-log sometimes leads to similarly fast decrease of the objective function as, or even faster than, optimization using the state-of-the-art methods DTS-CMA-ES or lq-CMA-ES. In Figure 1, this is the case for RAF-log in dimension 2. In Figure 2, dimesnion

Conclusion

The paper was motivated by our opinion that the intense and successful development of artificial neural networks during the last 15 years suggests that they again have the potential to be important for active learning in surrogate-assisted BBO. It surveyed possible directions of research into that potential, including closely connected research into neural-network-based transfer learning for surrogate modelling. Moreover, it recalled the first published investigations in some of those directions, and added a new contribution to the emerging mosaic of those investigations. The fact that the main purpose of the experimental section of the paper is to contribute to the mosaic of emerging investigations should be epmhasized especially in context of the obtained experimental results. It justifiess that there is no significant difference between using CMA-ES surrogate-assisted by RAF ensembles and using it alone, as well as that results with RAF-ensemble-based surrogate models are significantly worse than results with the state-of-the-art surrogate-assisted CMA-ES variants, lq-CMA-ES, and DTS-CMA-ES. This is an obvious limitation not only of RAF ensembles, but of all above surveyed kinds of neural networks that have been so far investigated as surrogate models for CMA-ES. On the other hand, as the survey has shown, there are many more other possibilities for such investigations within future research.

Figure 1 :1Figure 1: Progress of optimization by the compared methods up to the budget 250×dimension for the benchmark function 𝑓 9 -Rosenbrock rotated. Each curve is the average of the 15 COCO instances of this function.

Figure 2 :2Figure 2: Progress of optimization by the compared methods up to the budget 250×dimension for the benchmark function 𝑓 18 -Schaffers F7 function, moderately ill-conditioned. Each curve is the average of the 15 COCO instances of this function.

Figure 3 :3Figure 3: Progress of optimization by the compared methods up to the budget 250×dimension for the benchmark function 𝑓 20 -Schwefel. Each curve is the average of the 15 COCO instances of this function.

Figure 4 :4Figure 4: Separable functions. From left to right: sphere, ellipsoidal, Rastrigin, Büche-Rastrigin, maximized linear slope.

Figure 5 :5Figure 5: Functions with low or moderate conditioning. From left to right: attractive sector, step ellipsoidal, Rosenbrock, Rosenbrock rotated.

Figure 6 :6Figure 6: Unimodal functions with high conditioning. From left to right: ellipsoidal, discus, bent cigar, sharp ridge, different powers.

Figure 7 :7Figure 7: Multi-modal functions with adequate global structure. From left to right: Rastrigin, Weierstrass, Schaffers F7 function, moderately ill-conditioned Schaffers F7 function, composite Griewank-Rosenbrock function F8F2.

Table 11A high-level overview of kinds of ANNs that we consider worth a research with respect to surrogate modelling for BBO

Evolution control used for RAF and RAF-log ensembles.Require: points 𝑥 1 , . . . , 𝑥 𝜆 ∈ R 𝑑 , in which the surrogate model 𝐹 sm trained on some archive 𝐴 has been evaluated; thus 𝜆 is the population size 1: Set 𝑘 = ⌊1 + max(0.02𝜆, 4)⌋; the number of 𝐹 bb evaluations 2: Set 𝑄 = {𝑥 𝑗 |(𝜌(𝐹 sm (𝑥 1 ), . . . , 𝐹 sm (𝑥 𝜆 ))) 𝑗 ≤ 𝑘}; points with the 𝑘 smallest 𝐹 sm values 3: In 𝑥 ∈ 𝑄 for which 𝐹 bb (𝑥) is not yet known, evaluate 𝐹 bb (𝑥) 4: Order the elements of 𝑄 as (𝑥 1 𝑄 , . . . , 𝑥 𝑘 𝑄 ) decreasingly with respect to their 𝐹 bb (𝑥) values 5: Set ℓ = max(1, ⌊𝑘 + 1 − max(15, 0.75𝜆)⌋); the lower index for computing 𝜏 between 𝐹 bb and 𝐹 sm 6: while 𝑘 < 𝜆 & 𝜏 ((𝐹 bb (𝑥 ℓ ), . . . , 𝐹 bb (𝑥 𝑘 )), (𝐹 sm (𝑥 ℓ ), . . . , 𝐹 sm (𝑥 𝑘 ))) < 0.85 do 𝑄 ∪ {𝑥 𝑗 |(𝜌(𝐹 sm (𝑥 1 ), . . . , 𝐹 sm (𝑥 𝜆 ))) 𝑗 ≤ ⌈1.5𝑘⌉} 𝑄 for which 𝐹 bb (𝑥) is not yet known, evaluate 𝐹 bb (𝑥) Order the elements of 𝑄 as (𝑥 1 𝑄 , . . . , 𝑥 𝑘 𝑄 ) decreasingly with respect to their 𝐹 bb (𝑥) values 11: end while 12: Update 𝐴 = 𝐴 ∪ {𝑥 𝑗 |𝐹 bb (𝑥) has been evaluated in 𝑥 𝑗 } 13: if 𝑘 = 𝜆 then Return 𝐴 and {(𝑥 1 , 𝐹 bb (𝑥 1 )), . . . , (𝑥 𝜆 ), 𝐹 bb (𝑥 𝜆 ))} Return 𝐴 and {(𝑥 𝑖 , 𝐹 bb (𝑥 𝑖 ))|𝑥 𝑖 ∈ 𝑄} ∪ {(𝑥 𝑖 , 𝐹 sm (𝑥 𝑖 ))|𝑖 = 1, . . . , 𝜆, 𝑥 𝑖 ̸ ∈ 𝑄} 17: end if performed. The comparisons in Table

Martin Holeňa et al. CEUR Workshop Proceedings47-67Separable functions RAF-log DTS-CMA-ES 7 4 -6 Update 𝑘 = ⌈1.5𝑘⌉, ℓ = max(1, ⌊𝑘 + 1 − max(15, 0.75𝜆)⌋) RAF -RAF-log RAF 18 Algorithm 1 7: Update 𝑄 = 8: In 𝑥 ∈ 9:lq-CMA-ES 1.5 0.5default CMA-ES 11 12DTS-CMA-ES 10:21**19-7.519lq-CMA-ES23.5**24.5**17.5-22.5**default CMA-ES141362.5-14: RAF 15: elseRAF -Functions with low or moderate conditioning RAF-log DTS-CMA-ES lq-CMA-ES 15.5 6 0.5default CMA-ES 4.5RAF-log 16:4.5-602DTS-CMA-ES1414-11.514lq-CMA-ES19.5**20**8.5-18.5*default CMA-ES15.518**61.5-Unimodal functions with high conditioningRAFRAF-logDTS-CMA-ESlq-CMA-ESdefault CMA-ESRAF-21*004RAF-log4-001DTS-CMA-ES25**25**-7.525**lq-CMA-ES25**25**17.5*-25**default CMA-ES2124**00-Multi-modal functions with adequate global structureRAFRAF-logDTS-CMA-ESlq-CMA-ESCMA-ES aloneRAF-15131011.5RAF-log10-11.5912DTS-CMA-ES1213.5-13.515lq-CMA-ES151611.5-13default CMA-ES13.5131012-Multi-modal functions with weak global structureRAFRAF-logDTS-CMA-ESlq-CMA-ESdefault CMA-ESRAF-93.5814RAF-log16-6.51218.5DTS-CMA-ES21.518.5-2023**lq-CMA-ES17135-20*default CMA-ES116.525-All noiseless benchmark functionsRAFRAF-logDTS-CMA-ESlq-CMA-ESdefault CMA-ESRAF-67.526.52045RAF-log52.5-3021.545.5DTS-CMA-ES93.5**90**-6096**lq-CMA-ES100**98.5**60-99**default CMA-ES7574.52421-

3, CMA-ES surrogate-assisted by RAF reaches lower values of the objective function than any other of the compared methods, whereas in dimension 2, CMA-ES surrogate-assisted by any of RAF or RAF-log leads to a similarly fast decrease of 𝑓 18 as DTS-CMA-ES but slower than lq-CMA-ES. Finally, in Figure3, dimensions 3 and 5, CMA-ES surrogate-assisted by any of RAF or RAF-log leads to a similarly fast decrease of 𝑓 18 as lq-CMA-ES, but slower than DTS-CMA-ES.Martin Holeňa et al. CEUR Workshop Proceedings47-67

Martin Holeňa et al. CEUR Workshop Proceedings 47-67

Acknowledgemengt

The research reported in this paper has been supported by the German Research Foundation (DFG) funded project 467401796, and by the Czech Technical University grant SGS 23/205/OHK3/3T/18. The authors are very grateful to Jaroslav Langer for his crucial contribution to the RAF experiments.

A. Employed Benchmarks

The functions in the bbob suite are divided into five groups:

1. Separable functions (Figure 4).

• 𝑓 1 : sphere; • 𝑓 2 : ellipsoidal; • 𝑓 3 : Rastrigin; • 𝑓 4 : Büche-Rastrigin; • 𝑓 5 : linear slope.

2. Functions with low or moderate conditioning (Figure 5).

• 𝑓 6 : attractive sector; • 𝑓 7 : step ellipsoidal; • 𝑓 8 : Rosenbrock; • 𝑓 8 : Rosenbrock rotated.

3. Unimodal functions with high conditioning (Figure 6).

• 𝑓 10 : ellipsoidal; • 𝑓 11 : discus; • 𝑓 12 : bent cigar; 5. Multi-modal functions with weak global structure (Figure 8).

B. Activation Functions Employed to Form an RAF Ensemble

• Gauss error function

• Gaussian error linear unit

• Scaled exponential linear unit

where 𝑐, 𝛼 > 0. In the employed Tensorflow implementation, 𝑐 = 1.05070098, 𝛼 = 1.67326324.

Combinatorial Development of Solid Catalytic Materials MBaerns MHoleňa Design of High-Throughput Experiments, Data Analysis, Data Mining

London

Imperial College Press / World Scientific 2009 A rigorous framework for optimization by surrogates ABooker JDennis PFrank DSerafini TV MTrosset Structural and Multidisciplinary Optimization 17 1999 Metamodeling techniques for evolutionary optimization of computaitonally expensive problems: Promises and limitations MEl-Beltagy PNair AKeane Proceedings of the Genetic and Evolutionary Computation Conference the Genetic and Evolutionary Computation Conference Morgan Kaufmann Publishers 1999 Kriging as a surrogate fitness landscape in evolutionary optimization ARatle Artificial Intelligence for Engineering Design, Analysis and Manufacturing 15 2001 Metamodel-assisted evolution strategies MEmmerich AGiotis MÖzdemir TBäck KGiannakoglou PPSN, ACM 2002 A derivative based surrogate model for approximating and optimizing the output of an expensive computer simulation SLeary ABhaskar AKeane Journal of Global Optimization 30 2004 Surrogate-assisted evolutionary optimization frameworks for high-fidelity engineering design problems YOng PNair AKeane KWong Knowledge Incorporation in Evolutionary Computation YJin Springer 2005 Methods for using surrogate modesl to speed up genetic algorithm oprimization: Informed operators and genetic engineering KRasheed XNi SVattam Knowledge Incorporation in Evolutionary Computation YJin Springer 2005 A framework for evolutionary optimization with approximate fitness functions YJin MOlhofer BSendhoff IEEE Transactions on Evolutionary Computation 6 2002 Accelerating evolutionary algorithms with Gaussian process fitness function models DBüche NSchraudolph PKoumoutsakos IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 35 2005 CMA evolution strategy assisted by kriging model and approximate ranking CHuang BRadi AElHami HBai Applied Intelligence 48 2018 Gaussian process surrogate models for the CMA evolution strategy LBajer ZPitra JRepický MHoleňa Evolutionary Computation 27 2019 Interaction between model and its evolution control in surrogate-assisted CMA evolution strategy ZPitra MHanuš JKoza JTumpach MHoleňa GECCO 358 2021 paper no Augmented Lagrangian, penalty techniques and surrogate modeling for constrained optimization with CMA-ES PDufossé NHansen GECCO 2021 Adaptive ranking-based constraint handling for explicitly constrained black-box optimization NSakamoto YAkimoto Evolutionary Computation 30 2022 A mono surrogate for objective optimization ILoshchilov MSchoenauer MSebag GECCO 2010 Guiding surrogate-assisted multi-objective optimisation with decision maker preferences FGibson REverson JFieldsend GECCO 2022 Local metamodels for optimization using evolution strategies SKern NHansen PKoumoutsakos PPSN 2006 Benchmarking the local metamodel cma-es on the noiseless BBOB'2013 test bed AAuger DBrockhoff NHansen GECCO 2013 MartinHoleňa CEUR Workshop Proceedings A global surrogate assisted CMA-ES NHansen GECCO 2019 An adaptive model selection strategy for surrogate-assisted particle swarm optimization algorithm HYu CSun YTan JZeng YJin IEEE SCI 2016 Committee-based active learning for surrogate-assisted particle swarm optimization of expensive problems HWang YJin JDoherty IEEE Transactions on Cybernetics 47 2017 Structural optimization using evolution strategies and neural networks MPapadrakakis NLagaros YTsompanakis Computer Methods in Applied Mechanics and Engineering 156 1998 Model-assisted steady state evolution strategies HUlmer FStreichert AZell GECCO 2003 Springer A study on metamodeling techniques, ensembles, and multi-surrogates in evolutionary computation DLim YOng YJin SB GECCO 2007 Surrogate model for continuous and discrete genetic optimization based on RBF networks LBajer MHoleňa Intelligent Data Engineering and Automated Learning Springer 2010 Gaussian process assisted coevolutionary estimation of distribution algorithm for computationally expensive problems LNa QFeng ZLiang WZhong Journal of Central South University of Technology 19 2012 Investigating uncertainty propagation in surrogate-assisted evolutionary algorithms VVolz GRudolph BNaujoks GECCO 2017 Simple surrogate model assisted optimization with covariance matrix adaptation LToal DArnold PPSN 2020 Elite-driven surrogate assisted CMA-ES algorithm by improved lower confidence bound method ZLi TGao BWang .1007/s00366- 022-01642-5 Engineering with Computers 10 2022 Springer Combining global and local surrogate models to accellerate evolutionary optimization ZZhou YOng PNair AKeane KLum IEEE Transactions on Systems, Man and Cybernetics. Part C: Applications and Reviews 37 2007 Automatic surrogate modelling technique selection based on features of optimization problems BSaini MLópey-Ibañez KMiettinen GECCO 2019 Per instance algorithm configuration of CMA-ES with limited budget NBelkhir JDréo PSavéant MSchoenauer GECCO 2017 Boosted regression forest for the doubly trained surrogate covariance matrix adaptation evolution strategy ZPitra JRepický MHoleňa ITAT 2018. 2018 Ordinal regression in evolutionary computation TRunarsson PPSN 2006 Comparison-based optimizers need comparison-based surrogates ILoshchilov MSchoenauer MSebag PPSN 2010 Adaptive function value warping for surrogate model assisted evolutionary optimization AA DArnold PPSN 2022 Automatic surrogate model type selection during the optimization of expensive black-box problems ICouckuyt DGorissen Winter Simulation Conference 2011 Multi-surrogate-based global optimization using a score-based infill criterion HDong SSun BSong PWang Structural and Multidisciplinary Optimization 59 2019 Completely derandomized self-adaptation in evolution strategies NHansen AOstermaier Evolutionary Computation 9 2001 The CMA evolution strategy: A comparing review NHansen Towards a New Evolutionary Computation Springer 2006 Network on chip optimization based on surrogate model assisted evolutionary algorithms MWu AKarkar BLiu AYakovlev GGielen VGrout IEEE CEC 2014 Efficient global optimization of expensive black-box functions DJones MSchonlau WWelch Journal of Global Optimization 13 1998 ParEGO: a hybrid algorithm with on-line landscape approximation for expensive multiobjective optimization problems JKnowles IEEE Transactions on Evolutionary Computation 10 2006 TREGO: a trust-region framework for efficient global optimization YDiouane VPicheny RLe Riche ADi Perrotolo .1007/s10898-022-01245-w Journal of Global Optimization 85 10 2022 Making EGO and CMA-ES complementary for global optimization HMohammadi RRiche ETouboul Learning and Intelligent Optimization Springer 2015 Doubly trained evolution control for the surrogate cma-es ZPitra LBajer MHoleňa PPSN 2016 Black box algorithm selection by convolutional neural network YHe YYuen LOD 2020 Automated parameter choice with exploratory landscape analysis and machine learning MPikalov VMironovich GECCO 2021 Towards feature-free automated algorithm selection for single-objective continuous black box optimization RPrager MSeiler HTrautman PKerschke IEEE SCI 2021 Knowledge-based selection of gaussian process surrogates ZPitra LBajer MHoleňa ECML Workshop IAL 2019 Landscape analysis of Gaussian process surrogates for the covariance matrix adaptation evolution strategy ZPitra JRepický MHoleňa GECCO, ACM 2019 A collection of deep learning-based featurefree approaches for characterizing single-objective continuous fitness landscapes RSeiler MVPrager PKerschke HTrautmann GECCO 2022 The impact of hyper-parameter tuning for landscape-aware performance regression and algorithm selection AJankovic GPopovski TEftimov CDoerr GECCO 2021 Manifold Gaussian processes for regression RCalandra JPeters CRasmussen MDeisenroth IJCNN 2016 Deep kernel learning AWilson ZHu RSalakhutdinov EXing ICAIS 2016 Combining gaussian processes and neural networks in surrogate modeling for covariance matrix adaptation evolution strategy JKoza JTumpach ZPitra MHoleňa IAL Workshop, ECML PKDD 2021 Combining gaussian processes with neural networks for active learning in optimization JUžička JKoza JTumpach ZPitra MHoleňa ECML Workshop IAL 2021 Doubly stochastic variational inference for deep Gaussian processes HSalimbeni MDeisnroth NeurIPS 2017 Deep convolutional Gaussian processes KBlomqvist SKaski MHeinonen Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2020 Deep Gaussian processes using expectation propagation and Monte Carlo methods GHernández-Muñoz CVillacampa-Calvo DHernández Lobato ECML PKDD 2020 Deep Gaussian process emulation using stochastic imputation DMing DWilliamson SGuillas Techometrics 65 2022 Active learning for deep gaussian process surrogates ASauer RGramacy DHugdon Technometrics 65 2023 Neural tangent kernel: Convergence and generalization in neural networks AJacot FGabriel CHongler NeurIPS 2018 Neural tangents: Fast and easy infinite neural networks in python RNovak LXiao JHron JLee AAlemi ICLR 2020 Predictive uncertainty estimation via prior networks AMalinin MGales NeurIPS 2018 Uncertainty on asynchronous time event prediction MBiloš BCharpentier SGünnemann NeurIPS 2019 Reverse KL-divergence training of prior networks: Improved uncertainty and adversarial robustness AMalinin MGales NeurIPS 2019 Towards maximizing the representation gap between in-domain and out-of-distribution examples JNandy WHsu MLee NeurIPS 2020 Uncertainty aware semi-supervised learning on graph data XZhao FChen SHu JCho NeurIPS 2020 Neural-network-based estimation of normal distributions in black-box optimization JTumpach JKoza ZPitra MHoleňa ESANN 2022 Parallel approach for ensemble learning with locally coupled neural networks CValle FSaravia HAllende RMonge CFernández Neural Processing Letters 32 2010 Simple and scalable predictive uncertainty estimation using deep ensembles BLakshminaraynan APrityel CBlundell NeurIPS 2017 The MBPEP: A deep ensemble pruning algorithm providing high quality uncertainty prediction RHu QHuang SChang HWang JHe Applied Intelligence 49 2019 Uncertainty in neural networks: Approximately bayesian ensembling TPearce FLeibfried ABrintrup MZaki ANeely AISTATS 2020 Toward robust uncertainty estimation with random activation functions YStoyanova SGhandi MTavakol AAAI Conference on Artificial Intelligence 2023 Deep learning for bayesian optimization of scientific problems with high-dimensional structure SKim PLu CLob JSmith JSnoek Transactions on Machine Learning Research 1 2022 openreview tPMQ6Je2rB Sample-efficient optimization in the latent space of deep generative models viaweighted retraining ATripp EDaxberger JHernández-Lobato NeurIPS 2020 A GAN based solver of black-box inverse problems MGillhofer HRamsauer JBrandstetter BSchäfl SHochreiter NeurIPS 2019 OPT-GAN: A broad-spectrum global optimizer for black-box problems by learning distribution MLu SNing SLiu FSun BZhang Arxiv 2102.03888v5 2022 Towards learning universal hyperparameter optimizers with transformers YChen XSong CLee ZWang QZhang NeurIPS 2022 PFNs4BO: In-context learning for Bayesian optimization SMuller MFeurer NHollmann FHutter 2023 ICML Gaussian Processes for Machine Learning ERasmussen CWilliams 2006 MIT Press Cambridge Deep Gaussian processes ADamianou NLawrence AISTATS 2013 Deep Gaussian processes for regression using approximate expectation propagation TBui DHernandez-Lobato JHernandez-Lobato YLi RTurner 2016 ICML Random feature expansions for deep Gaussian processes KCutajar EBonilla PMichiardi MFilippone 2017 ICML Efficient global optimization using deep Gaussian processes AHebbal LBrevault MBalesdent ETalbi NMelab IEEE CEC 2018 Gaussian process behaviour in wide deep neural networks AMatthews JHron MRowland RTurner ICLR 2019 Bayesian optimization using deep Gaussian processes AHebbal LBrevault MBalesdent ETalbi NMela ArXiv: 1905.03350v1 2019 Convolutional normalizing flows for deep Gaussian processes HYu BLow PJaillet DLiu IJCNN 2021 NHansen SFinck RRos AAuger Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Definitions

Paris Saclay

2009 INRIA Technical Report COCO: a platform for comparing continuous optimizers in a black box setting NHansen AAuger RRos OMerseman TTušar DBrockhoff Optimization Methods and Software 35 2021 JKoza JTumpach ZPitra MHoleňa Using past experience for configuration of Gaussian processes in black-box optimization 2021 Deep neural networks as Gaussian processes JLee YBahri RNovak SSchoenholz ICLR 2018 Bayesian deep convolutional networks with many channels are Gaussian processes RNovak LXiao JLee YBahri ICLR 2019 Bayesian deep ensembles via the neural tangent kernel BHe BLakshminarayanan YTeh NeurIPS 2020 Be greedy --a simple algorithm for blackbox optimization using neural networks BParia BPòczos KRavikumar SJ ASuggala ICML Workshop on Adaptive Experimental Design and Active Learning in the Real World 2022 Regression prior networks AMalinin SChervontsev IPovilkov MGales ArXiv: 2006.11590v2 2020 Deep evidential regression AAmini WSchwarting ASoleimany DRus 2020 NeurIPS Evidential deep learning to quantify classification uncertainty MSensoy LKaplan MKandmir NeurIPS 2018 Improving evidential deep learning via multi-task learning DOh BShin AAAI Conference on Artificial Intelligence 2022 An evidential classifier based on Dempster-Shafer theory and deep learning YTong PXu TDenoeux Neurocomputing 450 2021 A survey on evidential deep learning for single-pass uncertainty estimation DUlmer ArXiv: 2110.03051v2 2021 A Mathematical Theory of Evidence GShafer 1976 Princeton University Press Causal discovery based on neural network ensemble method JLing ZZhou Journal of Software 15 2004 Wrapper approach for learning neural network ensemble by feature selection HChen SYuan KJiang Advances in Neural Networks -ISNN 2005 Springer 202 Network generalization differences quantified DPartridge Neural Networks 9 1996 Observational learning algorithm for an ensemble of neural networks MJang SCho Pattern Analysis and Applications 5 2002 An active learning approach for neural network ensemble ZWang SChen ZChen Journal of Computer Research and Development 42 2005 A constructive algorithm for training cooperative neural network ensembles MIslam XYao KMurase IEEE Transactions on Neural Networks 14 2003 Fast decorrelated neural network ensembles with random weights MAlhamdoosh DWang Information Sciences 264 2014 A novel decorrelated neural network ensemble algorithm for face recognition KDai JZhao FCao Knowledge Based Systems 89 2015 Feature selection based neural network ensemble method JLing ZChen ZZhou Journal of Fudan University (Natural Sciences) 43 2004 Freeway incident detection based on Adaboost RBF neural network TYang CZhang Computer Engineering and Applications 32 2008 AdaBoost based ensemble of neural networks in analog circuit fault diagnosis HLiu GChen GSong THan Chinese Journal of Scientific Instrument 4 2010 Pitfalls of in-domain uncertainty estimation and ensembling in deep learning AAshukha ALyzhov DMolchanov DVetrov ICLR 2020 Deep echo state networks with uncertainty quantification for spatiotemporal forecasting PMcdermott CWikle Environmetrics 30 e2553 2019 paper no Google vizier: A service for black-box optimization DGolovin BSolnik SMoitra GKochanski JKarro Knowledge Discovery and Data Mining 2017 How transferable are features in deep neural networks? JYosinski JClune YBengio HLipson NeurIPS 2014 Simultaneous deep transfer across domains and tasks ETzeng JHoffman TDarell KSaenko ICCV 2015 Domain separation networks KBousmalis GTrigeorgis NSilberman DKrishnan DErhan 2016 NeurIPS Learning and transferring mid-level image representations using convolutional neural networks MOquab LBottou ILaptev JSivic IEEE Conference on Computer Vision and Pattern Recognition 2019 Deep transfer learning with joint adaptation networks MLong HZhu JWang MJordan 2017 ICML Transfer learning for sequences via learning to collocate WCui GZheng ZShen SJiang WWang ICLR 2019 Supervised representation learning: Transfer learning with deep autoencoders FZhuang XCheng PLuo SPan IJCAI 2015 Supervised representation learning with double encoding-layer autoencoder for transfer learning FZhuang XCheng PLuo SPan QHe ACM Transactions on Intelligent Systems and Technology 9 2018 Adversarial discriminative domain adaptation ETzeng JHoffman KSaenko TDarell CVPR 2017 Partial transfer learning with selective adversarial networks ZCao MLong JWang MJordan IEEE Conference on Computer Vision and Pattern Recognition 2018 A minimax game for instance based selective transfer learning BWang MQiu XWang YLi YGon KDD 2019 Coupled generative adversarial networks MLiu NeurIPS 2016 YStoyanova YanasGH/RAFs 2023 Algorithm data sets for the bbob test suite CDArchive 2023 An extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all pairwise comparisons SGarcia FHerrera Journal of Machine Learning Research 9 2008 Should we really use post-hoc tests based on mean-ranks? ABenavoli GCorani FMangili Journal of Machine Learning Research 17 2016 Statistical comparisons of classifiers over multiple data sets JDemšar Journal of Machine Learning Research 7 2006