Assessment of Surrogate Model Settings Using Landscape Analysis Mikuláš Dvořák1 , Zbyněk Pitra2,3 , Martin Holeňa3 1 Faculty of Information Technology, Czech Technical University, Thákurova 7, Prague, Czech Republic 2 Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Trojanova 13, Prague, Czech Republic 3 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic Abstract: This work in progress concerns assessment of characterize the structure of a fitness landscape with mea- surrogate model settings for expensive black-box opti- surable features. As these features are describing the struc- mization. The assessment is performed in the context of ture of a fitness function, they could provide information Gaussian process models used in the Doubly Trained Sur- based on which the most suitable surrogate model could rogate (DTS) variant of the state-of-the-art black-box opti- be obtained. mizer, the Covariance Matrix Adaptation Evolution Strat- This paper addresses the problem of how to select the egy (CMA-ES). This work focuses on the connection be- most convenient surrogate model, in the context of vari- tween Gaussian process surrogate model predictive accu- ous metrics quantifying the quality of the considered sur- racy and an essential model hyper-parameter – the covari- rogate model, in every generation of the Doubly Trained ance function. The performance of DTS-CMA-ES is re- Surrogate Covariance Matrix Adaptation Evolution Strat- lated to the results of landscape analysis of the objective egy (DTS-CMA-ES [14]). Later, this metric can be used, function. To this end various classification and regression with a set of features from fitness landscape analysis, to methods are used, proposed in the traditional framework train a classification model that selects the surrogate model for algorithm selection by Rice. Several single-label clas- for any black-box function. This idea is depicted in the sification, multi-label classification, and regression meth- Figure 1. An accurate model selection method could be ods are experimentally evaluated on data from DTS-CMA- used for the DTS-CMA-ES algorithm and could poten- ES runs on the noiseless benchmark functions from the tially speed up the optimization process. This work might COCO platform for comparing continuous optimizers in provide valuable insight for such a goal. black-box settings. To select a surrogate model, various classification strategies can be used, and by assessing their performance the most suitable classification model can be later utilized. 1 Introduction The selection is described in the context of a framework for algorithm selection proposed by Rice in [18]. Optimization is a field of mathematics that has been stud- We have started from the research in [15], where authors ied for centuries. Many problems can be reduced to a used a classification tree for selection mapping. However, problem of finding global optima of a function. Gradi- the accuracy of the classification tree was not very satis- ent descent methods or analytical solutions are often used factory. Therefore, we test more classification models. to solve these problems. This paper is structured as follows. In Section 2, an Expensive black-box optimization is addressing opti- introduction to surrogate models for surrogate-assisted mization problems in situations when a mathematical def- CMA-ES is presented. Section 3 discusses the design for inition of the optimized objective is unknown and its eval- algorithm selection utilizing fitness landscape analysis and uation costs valuable resources such as money or time. Rice’s framework. Finally, in Section 4, various classifi- The Covariance Matrix Adaptation Evolution Strategy cation and regression methods for selecting the most con- (CMA-ES [4]) is a stochastic method suitable for op- venient surrogate model for DTS-CMA-ES are shown. timization of black-box functions. A surrogate model is a regression model that can be used to approximate the unknown black-box function. Instead of evaluating 2 Surrogate Models in the Context of CMA the black-box function in every search point, the surro- Evolution Strategy gate model is used to decrease the number of expensive evaluations based on already evaluated points. However, the combination of the CMA-ES with a surrogate model The CMA-ES is an algorithm for numerical black-box op- presents new challenges in tuning surrogate models to timization. The algorithm can be simplified into a repeti- make the optimization more effective. Finally, fitness tion of the following three steps: landscape analysis (FLA) is a technique that is trying to (1) sample a new population of size λ by sampling from a multivariate normal distribution N (m m, Σ ), Copyright c 2020 for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC BY (2) select the µ best offspring from the sampled popula- 4.0). tion based on their respective function values, Data Feature space 5.00 4.75 200 100 4.50 f(x) 0 2 100 4.25 200 4.00 x 1 0 3.75 2 1 2 1 x 0 2 x1 3.50 1 3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1 Surrogate model 1 Surrogate model 2 200 200 150 150 100 100 50 f(x) f(x) 50 0 0 50 50 100 100 150 150 1 1 0 0 2 1 2 1 1 1 2 2 x x 0 2 0 2 x1 1 x1 1 2 3 2 3 Figure 1: This figure is an artificial illustration of the relation between fitness landscape analysis and surrogate modeling. The top left graph represents the data that are modeled by the surrogate model. The top right graph shows how this data could be represented in the space described by features derived from the fitness landscape analysis. The decision boundary represents a trained classification model that should choose more accurate surrogate model in the fitness landscape analysis space. The bottom figures shows that two surrogate models could model the space in different way and hence one might be more accurate then other. (3) update parameters of the multivariate distribution m of predicting a whole distribution instead of just a value and Σ with respect to the selected µ offspring. of the objective function. The covariance function of the used Gaussian process is a hyper-parameter, which we are In step (2), all λ offspring need to be evaluated in order trying to set by utilizing features from the FLA. to select the best µ offspring. A surrogate model that ap- proximates the underlying black-box function can be used to decrease the number of needed expensive evaluations. Surrogate modeling is a technique based on building re- 2.1 Gaussian process gression models of the original function using the already evaluated data points. This technique originated from response surface modeling where the regression models A Gaussian process is a collection of random variables, are usually simple polynomial models. Response surface any finite number of which have a joint Gaussian distribu- modeling was introduced by George E. P. Box and K. B. tion. Wilson in 1951 [2]. Due to a joint Gaussian distribution, Gaussian process DTS-CMA-ES is a version of the CMA-ES algorithm is described by its mean and covariance function. The utilizing surrogate models. This algorithm uses regres- mean function m(xx) and the covariance function κ(xx, x 0 ) sion models such as Gaussian process for their capability of a random variable g(xx) assigned to point x are defined √ ! as 5 2 0 5a κMat (xx, x ) = 1+a+ exp (−a) , where m(xx) = E [g(xx)] , 3` (1) √ κ(xx, x 0 ) = E (g(xx) − m(xx))(g(xx0 ) − m(xx0 ))T 5kxx − x 0 k   a= . (8) ` and the fact that m and κ define, respectively, the mean and covariance of the variables g(xx) forming the Gaussian These kernels are from the Matérn class [10]. process is sometimes denoted as Another kernel was introduced by Gibbs in [3]: 1/2 g(xx) ∼ GP(m(xx), κ(xx, x0 )). (2) D  2`i (xx)`i (xx0 ) κGibbs (xx, x0 ) = ∏ i=1 `2i (xx) + `2i (xx0 ) The posterior distribution can be inferred with rules ! (9) for conditioning Gaussians as the Gaussian distribution (xi − xi0 )2 D exp − ∑ 2 , N (µµ ∗ , Σ ∗ ), where x) + `2i (xx0 ) i=1 `i (x T µ ∗ = µ (X X ∗ ) + K ∗ K −1 ( f − µ (X X )), where `i is a positive function which can be different for T (3) each i and D is the dimension of the vector x . Making the Σ ∗ = K ∗∗ − K ∗ K −1 K ∗ , hyper-parameter ` configurable in every dimension makes this kernel more flexible. where f is a vector of measured responses, X is a matrix Also a neural network can be used as a kernel for GP. with inputs of known responses, X ∗ is a matrix with inputs How to derive the following neural network kernel is dis- of unknown responses, K i j = κ(xxi , x j ), K ∗i j = κ(xxi , x ∗j ) and cussed in [17]. K ∗∗ x∗i , x ∗j ) for the considered covariance function κ. i j = κ(x The covariance function κ must be a symmetric func- κNN (xx, x 0 ) = tion of two vector inputs, and the matrix K by means of ! κ as described above must be for any number of points x i 2 2x̃xT Σ x̃x0 (10) arcsin p , positive semidefinite; each such function is called a kernel. π (1 + 2x̃xT Σ x̃x0 )(1 + 2x̃x0T Σ x̃x0 ) The kernel defining a covariance function is as a hyper- parameter of a Gaussian process. There is a large variety where x̃x is x with an added bias component such that of available kernels, in this work we consider the follow- x̃x = (1, x1 , . . . , xD )T and Σ denotes a corresponding bias ing ones. component. Polynomial kernels are defined as follows: A new kernel can be also created using addition. For instance the addition of a SE kernel to a Q kernel results κ(xx, x 0 ) = (xxT x 0 + σ02 ) p , (4) in a new kernel defined as follows: kxx − x 0 k2   where p ∈ N and σ0 is a constant term (bias). κSE+Q (xx, x 0 ) = σ 2 exp − For p = 1 the kernel is called linear (LIN) and for p = 2 2`2 (11) the kernel is quadratic (Q). T 0 +(xx x + σ0 ) .2 2 A Squared exponential kernel (SE) is defined as kxx − x 0 k2   3 Methodology κSE (xx, x 0 ) = σ 2 exp − , (5) 2`2 To clarify characteristics of the data used to train a sur- where ` is a characteristic length-scale, a hyper-parameter rogate model in the DTS-CMA-ES algorithm, we use the determining the relationship between the distance of vec- explanation from [15]: For each generation g of the DTS- tors in the input space and correlations in the output space. CMA-ES algorithm, a set of surrogate models M are A Rational quadratic kernel (RQ) can be viewed as a trained on a training set T . The training set T is a subset generalization of the SE kernel. The RQ kernel is defined of the archive A of all evaluated data points. Afterwards, as −α a surrogate model M ∈ M is utilized to select new popula- kxx − x 0 k2  κRQ (xx, x 0 ) = σ 2 1 + . (6) tion P. The question is how to select the most convenient 2α`2 surrogate model using the sets A, T , P? The hyper-parameter α > 0 can be seen as a decomposi- tion of the exponential function in the SE kernel. 3.1 Framework for Model Setting Selection Frequently, the following two kernels are used: One way to describe the surrogate model selection prob- 3 2 0 lem is to use a framework for algorithm selection proposed κMat (xx, x ) = (1 + a) exp (−a) , where √ by Rice in [18]. This framework is designed with five main 3kxx − x 0 k components and for problem of surrogate model selection a= , (7) ` can be briefly explained as: Data space is a space of possible problems. In this case, or R2 may be more convenient for the investigation of data space contains sets of data points that are present the relationships between model performance and fitness in the DTS-CMA-ES runs. landscape features. Algorithm space is a space of possible surrogate models to solve a problem from the data space. Selection mapping Utilizing the FLA features, we can construct a D-dimensional space Φ, where each dimension Feature space is a space of possible characterizations of represents one FLA feature. In this space, we can create a the data space. We use feature sets from FLA and classification model that will map a φ ∈ Φ to the respec- CMA-ES features described in Subsection 3.2. tive best performing covariance function of the Gaussian Performance space is a space describing the perfor- process learned from previous runs of the DTS-CMA-ES mance of a particular algorithm for a particular prob- algorithm. lem. Selection mapping S : Φ → M is a component that maps landscape features φ ∈ Φ to a model M ∈ M such Selection mapping is a function that gives a surrogate that S(φ ) maximizes the model performance. model M, for a particular vector of features φ , such that it minimizes models error ε. 3.2 Fitness Landscape Analysis The following diagram [13, 18] (Figure 2) illustrates the main parts of this framework and their relations. Fitness landscape analysis (FLA) aims to characterize the The goal is to train a classifier, represented by the se- structure of a fitness function with measurable features. In lection mapping, which could be later utilized to select the the context of expensive black-box optimization, the fea- best covariance function of a Gaussian process for given ture calculation relies only on the already evaluated data data. points. In [11], the authors discussed sets of low-level features that can be computed with various techniques. Some of Data space In the DTS-CMA-ES, three sets of data points them are not useful for the context of expensive black-box are used. The first one is an archive A containing all f - optimization because they require additional evaluations evaluated data points {xxi , f (xxi ) | i = 1, . . . , n}, where n is of the optimized black-box function. the number of f -evaluated points. The second one is the Several of such feature sets have been suggested in the training set T containing f -evaluated data points which literature to support FLA, e.g. Nearest-Better Clustering are a subset of A and are utilized for fitting a surrogate [7], Information Content of Fitness Sequences [12], or Dis- model in DTS-CMA-ES. The training set is selected to persion [9]. All of the mentioned feature sets were already contain data points that near the currently searched space used in the paper [16] and we briefly describe them in the by the CMA-ES (see [1] for training set selection meth- following paragraphs. ods). The last set is a sampled population P, for which the values of the black-box function are unknown. The pop- ulation P is selected using the doubly trained evolution y-Distribution This set contains features based on the dis- control that utilizes the predictions of the Gaussian process tribution of the fitness function values. In [11], the authors surrogate model. These sets are changing each generation. have presented three such features: skewness, kurtosis, and A more detailed explanation of how the sets T and P are number of peaks. selected can be found in [1, 14]. Both the skewness and the kurtosis of a distribution are computed from central moments. The skewness tells us how asymmetric the distribution is and the kurtosis mea- Model space The set of considered surrogate models con- sures how much the distribution differs from the normal sisted of Gaussian processes with various covariance func- distribution in the sense of tailedness. tions. The last feature is an estimation of the number of peaks in the y-Distribution. Feature space The features are computed on datasets A, T , and T ∪P for each generation in the run of the DTS- Levelset Levelset features are calculated from a dataset CMA-ES algorithm. split into two classes based on a threshold in function val- ues. As a split value, the median value or other quantile Performance space Performance can be measured with values have been studied in [11]. a variety of evaluation metrics and the question is which Linear, quadratic, and mixture discriminant analysis are metric would be the most convenient for the surrogate used on the partitioned dataset to separate classes. The model selection task. In [16], the authors used the Ranking underlying idea is that for a right choice of the threshold Difference Error (RDE). However, error measures such as value, a multimodal fitness landscape cannot be separated Mean Squared Error (MSE), Mean Absolute Error (MAE), with linear or quadratic discriminant analysis. However, Data space Surrogate model space Train model on T M∈M A T P Assess Feature ex- performance traction on P Φ(A, T , P) S ε(M) ∈ E Selection mapping Feature minimizing ε Performance space S: Φ → M space Figure 2: Modified Rice’s framework for surrogate model selection in DTS-CMA-ES. mixture discriminant analysis should have a better perfor- Information Content of Fitness Sequences Information mance on a multimodal fitness landscape. Content of Fitness Sequences (ICoFS) introduced in [12], The features are defined as cross-validated misclassifi- measures how difficult is it to describe a given fitness func- cation errors for each type of discriminant analysis. tion. For instance, a low information function would be a constant fitness function as opposed to a high informa- tion function such as some multimodal complicated fitness Meta-model Features from this class are acquired from function. fitting a linear and quadratic regression model. This method uses neighboring values and compares The model performance, specifically the adjusted R2 their fitness values. The comparisons are later transformed value of linear and quadratic models, has been used in into discrete information from which the features are com- [11] together with the minimum and the maximum of the puted. absolute values of the linear model coefficients. For the quadratic model, the authors used the maximum absolute CMA-ES features The authors of [16] proposed features value divided by the minimum absolute value of the fitted related to the DTS-CMA-ES algorithm. They are com- model’s coefficients. puted from the CMA-ES settings, from the set of points X = {xxi | i = 1, . . . , n} for which the function value is known, and from DTS-CMA-ES parameters such as. Nearest-Better Clustering The features based on • The generation number g is an easy to obtain feature Nearest-Better Clustering (NBC) have been proposed in derived from an optimization run of DTS-CMA-ES. [7]. The presented five features should help to recognize funnel structures in the fitness landscape. • CMA-ES uses a step-size σ (g) for controlling the size of a distribution from which the CMA-ES samples new points. Therefore, the step-size can be also used Dispersion Dispersion of a function measures how close as a feature. together the sampled points are in the search space [9]. • The evolution path p c and the σ evolution path length The dispersion features are derived from this idea. They features are derived from the evolution paths length average differences between dispersion values below a used in the CMA-ES. These features encode how the certain moving threshold value. path of the evolution process has changed in recent To estimate the dispersion, the authors of [9] sampled generations and measure how useful were previous the space n times and took the best b points from which steps for the optimization. they averaged pairwise distances between them. This step was repeated for two different n values, and the final dis- • An additional CMA-ES feature is derived from the persion was computed by subtracting those results. That number of restarts of the DTS-CMA-ES algorithm. way a difference in dispersion is estimated. This could indicate how difficult the problem is. • Mahalanobis distance of the CMA-ES mean m (g) to The differences between exact and loose accuracy vary the mean of the empirical distribution of all points between used error measures. For RDE, MSE, and MAE X is another feature described in [16]. This feature those differences are greater than for R2 error. This might indicates the suitability of X for training a surrogate be a consequence of choosing the 5% quantile for similarly model. best performing kernels. • The CMA similarity likelihood feature is the log- likelihood of all points X with respect to the CMA-ES 4.1 Used Data distribution. This may also represent a measure of set The problems used for retrieving the data {A(i) , T (i) , P (i) | suitability for a surrogate model training. i = 1, . . . , g} in this paper were obtained from running the DTS-CMA-ES algorithm on the Black-Box Optimiza- tion Benchmarks from the COmparing Continuous Op- 4 Experimental Evaluation timisers (COCO) platform, namely, problems in dimen- sions 2, 3, 5, 10, and 20 on instances 11-15 [5, 6]. The Several experiments using the data obtained during the run sets {A(i) , T (i) , P (i) | i = 1, . . . , g} were extracted for 25 of the DTS-CMA-ES on benchmark functions with differ- uniformly selected generations for 8 considered surrogate ent surrogate models were designed. From the error mea- models. The algorithm was terminated if one of the fol- sures of used surrogate models, the best surrogate model lowing two conditions was true: can be selected as the one with the minimal error. We used Gaussian processes as a surrogate model. In (1) the target fitness value 10−8 is reached, or particular, the following covariance functions were used: (2) the number of evaluations of the optimized fitness 5 2 κLIN , κSE , κRQ , κSE , κMat , κNN , κGibbs , and κSE+Q . The function f is at least 250D, where D is the dimension parameters of the kernel defining the covariance function of the function f . are found by maximum-likelihood or leave-one-out cross- From the data, we calculate FLA features and errors de- validation method [1]. rived from surrogate models predictions. The considered In the feature space, we measured the following low- error measures are: RDE, MSE, MAE, and R2 . level feature sets: y-Distribution, Levelset, Meta-Model, The data generated for this paper were already used in Nearest-Better Clustering, Dispersion, Information Con- [16]. Compared to [16], more metrics in the performance tent, and CMA-ES features. These sets are described in space and more classifiers are investigated. Features were greater detail in Subsection 3.2. calculated using the algorithm underlying the R-package Experiments are compared using two accuracies. The flacco [8], reimplemented in the MATLAB language. exact accuracy measures exact matches of the classified kernel and the true best performing kernel. The loose ac- curacy is calculated from loose matches that considers as a 4.2 Single-Label Classification correctly classified a prediction which falls into similarly A classifier Sc : Φ → M was trained on labels of the best best performing kernels. Kernels are considered similarly performing models. To obtain a label for a data point, the best performing if their error is in the 5% quantile of the minimal error value for each GP kernel was found and its considered kernels errors for a particular data point. kernel set as a label. It is not always clear which model Experiments are compared to a baseline model that rec- should be selected as a label because multiple models can ommends the most frequent best performing kernel from have equal errors. To address this ambiguity, multi-label the training set. To this end, various approaches can be classifiers are tested in the next subsection. used. With classification models we can classify the best performing kernels, or by utilizing the information about 4.3 Multi-Label Classification errors, we can apply regression or multi-label classifation models. From the original dataset, not only the best, but all nearly The following classifiers or their regression versions best performing models are found and used as labels for Sc were trained: decision tree, random forest, support vec- training. The trained classifier is then capable of predict- tor machine, and artificial neural network with two hidden ing multiple labels for given landscape features. However, dense layers (50 and 25 neurons respectively). Results are for a fair comparison with the single-label classification shown in Figures 3 and 4. and with the regression approach, only one label has to be The baseline model was outperformed by almost every predicted. To this end, a regression model is utilized to presented classification method. Single-label classifica- predict the best performing model to select a single-label tion methods have the highest accuracy, but its accuracy among labels predicted by the multi-label classifier. That is very similar to the multi-label classification methods. regression model considers only labels predicted by the The advantage of the multi-label classification is that it multi-label classifier and among them, the one best accord- provides more flexibility for tuning the settings and hence ing to the regression model is selected as the final label for provides more room for improvement. comparison. 0.35 RDE MSE exact exact loose 0.40 loose 0.30 0.35 0.25 0.30 0.25 0.20 Accuracy Accuracy 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 elin e ee st va vo rk ee st M rk ee est M ork elin e ee st va vo rk ee st M rk ee est M ork bas n tr fore M o Mo t w o n tr fore l SV t w o n tr f o r n SV etw bas n tr fore M o Mo t w o n tr fore l SV t w o n tr fo r n SV etw isio om SVl l S V l ne isio om abe l l ne isio om sio ral n isio om SVl l S V l ne isio om abe l l ne isio om sio ral n Dec and abe e-labe u r a Dec l RanMd ulti- ur a Dec RanRdegres Neu Dec and abe abe r a Neu el D e c n d l Ra Mu l t i - u r a Dec RanRdegres Neu bel el R le-l l l Ne el l Ne on i n bel el R gle-l ingle-l bel i-lab l Ne on i n e-la le-lab Sing Sing b e i-lab a b e b e s ti-la Regre egres s s i o ress ion le-la gle-lab Sin abe b e ti-la Regre egres s s si o ress ion Sing l le-la Mult Multi-l Mul Sing S le-la Mult Multi-l Mul Sing Sing R Reg Sin Sing R Reg Figure 3: Exact accuracy and loose accuracy for each considered model trained with landscape features and predicting the best performing model w.r.t. Ranking Difference Error and Mean Squared Error. Hyper-parameters for each classifier were found using a 5-fold cross-validation and final accuracies were measured on the test set containing 20% of the original data set. R2 MAE exact 0.35 exact loose loose 0.30 0.30 0.25 0.25 0.20 0.20 Accuracy Accuracy 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 elin e ee st va vo rk ee st M rk ee est M ork elin e ee st va vo rk ee st M rk ee est M ork bas n tr fore M o Mo t w o n tr fore l SV t w o n tr f o r n SV etw bas n tr fore M o Mo t w o n tr fore l SV t w o n tr fo r n SV etw isio om SV l l S V l ne isio om abe l l ne isio om sio ral n isio om SV l l S V l ne isio om abe l l ne isio om sio ral n Dec a n d ab e be u r a Dec a n d l t i - ur a Dec and res g Neu Dec a n d b e b e u r a Dec an d l t i - u r a Dec and res g Neu bel el R le-l le-la N e bel i-lab el abe l R M u bel N e ssio n sion R R e ion bel a el R gle-l ingle-l a N e bel i-lab e l abe l R M u bel N e ssio n sion R R e ion le-la le-lab Sing Sing le-la Mult Multi-l ti-la Regre egres ress le-la gle-lab Sin S le-la Mult Multi-l ti-la Regre egres ress Sing Sing Sing Mul R Reg Sing Sin Sing Mul R Reg Figure 4: Exact accuracy and loose accuracy for each considered model trained with landscape features and predicting the best performing model w.r.t. R2 and Mean Absolute Error. Hyper-parameters for each classifier were found using a 5-fold cross-validation and final accuracies were measured on the test set containing 20% of the original data set. 4.4 Regression [4] Nikolaus Hansen. The CMA evolution strategy: a compar- ing review. In Towards a new evolutionary computation, A regression model Sr : Φ → E |M| was trained to predict pages 75–102. Springer, 2006. an error of a surrogate model for given landscape fea- [5] Nikolaus Hansen, Anne Auger, Steffen Finck, and Ray- tures. The Sr model yields errors from which a minimum mond Ros. Real-Parameter Black-Box Optimization is found and its corresponding surrogate model is selected. Benchmarking 2009: Noiseless Functions Definitions. Some regression models yield only one prediction. To Technical report, Citeseer, 2010. this end, for each surrogate model one regression model is [6] Nikolaus Hansen, Anne Auger, Steffen Finck, and Ray- trained to predict the error and results are then combined. mond Ros. Real-Parameter Black-Box Optimization Benchmarking 2012: Experimental Setup. Technical re- port, Citeseer, 2012. 5 Conclusion [7] Pascal Kerschke, Mike Preuss, Simon Wessing, and Heike Trautmann. Detecting funnel structures by means of ex- A design of various methods for classifying the data from ploratory landscape analysis. In Proceedings of the 2015 FLA to predict the most convenient surrogate model were Annual Conference on Genetic and Evolutionary Compu- presented. The baseline model was outperformed with al- tation, pages 265–272, 2015. most every presented classification method. However, the [8] Pascal Kerschke and Heike Trautmann. Comprehen- differences between the highest accuracy scores and the sive Feature-Based Landscape Analysis of Continuous and baseline scores are very small. From the accuracy scores Constrained Optimization Problems Using the R-package flacco. In Applications in Statistical Computing – From in Figures 3 and 4, it is clear that both the best classifiers Music Data Analysis to Industrial Quality Improvement, and the best regression models are random forests for al- Studies in Classification, Data Analysis, and Knowledge most all considered approaches. Organization, pages 93 – 123. Springer, 2019. The accuracy scores suggests that the classifiers did not [9] Monte Lunacek and Darrell Whitley. The dispersion metric solve the problem of surrogate model selection for DTS- and the CMA evolution strategy. In Proceedings of the 8th CMA-ES algorithm completely. However, this method annual conference on Genetic and evolutionary computa- might improve the performance of the DTS-CMA-ES al- tion - GECCO 06. ACM Press, 2006. gorithm because even a small improvement in accuracy [10] Bertil Matérn. Spatial variation. Technical report, 1960. might be useful. [11] Olaf Mersmann, Bernd Bischl, Heike Trautmann, Mike Possible problems might be with an imbalance of Preuss, Claus Weihs, and Günter Rudolph. Exploratory classes in the training dataset and/or with similar perfor- landscape analysis. In Proceedings of the 13th annual con- mance of some surrogate models for some data points. ference on Genetic and evolutionary computation, pages The latter one was addressed with multi-label classifica- 829–836, 2011. tion methods. [12] Mario A Muñoz, Michael Kirley, and Saman K Halgamuge. A further improvement of the presented methods could Exploratory landscape analysis of continuous space opti- be achieved through improved fitness landscape analysis. mization problems using information content. IEEE trans- This concerns on the one hand new suitable fitness land- actions on evolutionary computation, 19(1):74–87, 2014. scape features, on the other hand feature selection that [13] Mario A Muñoz, Yuan Sun, Michael Kirley, and Saman K could reduce the feature space and improve the perfor- Halgamuge. Algorithm selection for black-box continu- mance of employed classifiers and regression models. ous optimization problems: A survey on methods and chal- lenges. Information Sciences, 317:224–245, 2015. [14] Zbyněk Pitra, Lukáš Bajer, and Martin Holeňa. Doubly Acknowledgement trained evolution control for the surrogate CMA-ES. In International Conference on Parallel Problem Solving from The research reported in this paper has been supported by Nature, pages 59–68. Springer, 2016. the Czech Science Foundation (GAČR) grant 18-18080S. [15] Zbyněk Pitra, Lukáš Bajer, and Martin Holeňa. Knowledge-based Selection of Gaussian Process Sur- rogates. In Workshop & Tutorial on Interactive Adaptive References Learning, page 48, 2019. [16] Zbyněk Pitra, Jakub Repický, and Martin Holeňa. Land- [1] Lukáš Bajer, Zbyněk Pitra, Jakub Repický, and Martin scape analysis of gaussian process surrogates for the covari- Holeňa. Gaussian process surrogate models for the CMA ance matrix adaptation evolution strategy. In Proceedings evolution strategy. Evolutionary computation, 27(4):665– of the Genetic and Evolutionary Computation Conference, 697, 2019. pages 691–699, 2019. [2] G. E. P. Box and K. B. Wilson. On the Experimental At- [17] Carl Rasmussen. Gaussian processes for machine learning. tainment of Optimum Conditions. Journal of the Royal Sta- MIT Press, Cambridge, Mass, 2006. tistical Society: Series B (Methodological), 13(1):1–38, jan [18] John R. Rice et al. The algorithm selection problem. Ad- 1951. vances in computers, 15(65-118):5, 1976. [3] Mark N Gibbs. Bayesian Gaussian processes for regression and classification. PhD thesis, Citeseer, 1998.