Assessment of Surrogate Model Settings Using Landscape Analysis

                                          Mikuláš Dvořák1 , Zbyněk Pitra2,3 , Martin Holeňa3
                1 Faculty of Information Technology, Czech Technical University, Thákurova 7, Prague, Czech Republic
     2   Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Trojanova 13, Prague, Czech Republic
            3 Institute of Computer Science, Czech Academy of Sciences, Pod vodárenskou věží 2, Prague, Czech Republic


Abstract: This work in progress concerns assessment of                    characterize the structure of a fitness landscape with mea-
surrogate model settings for expensive black-box opti-                    surable features. As these features are describing the struc-
mization. The assessment is performed in the context of                   ture of a fitness function, they could provide information
Gaussian process models used in the Doubly Trained Sur-                   based on which the most suitable surrogate model could
rogate (DTS) variant of the state-of-the-art black-box opti-              be obtained.
mizer, the Covariance Matrix Adaptation Evolution Strat-                     This paper addresses the problem of how to select the
egy (CMA-ES). This work focuses on the connection be-                     most convenient surrogate model, in the context of vari-
tween Gaussian process surrogate model predictive accu-                   ous metrics quantifying the quality of the considered sur-
racy and an essential model hyper-parameter – the covari-                 rogate model, in every generation of the Doubly Trained
ance function. The performance of DTS-CMA-ES is re-                       Surrogate Covariance Matrix Adaptation Evolution Strat-
lated to the results of landscape analysis of the objective               egy (DTS-CMA-ES [14]). Later, this metric can be used,
function. To this end various classification and regression               with a set of features from fitness landscape analysis, to
methods are used, proposed in the traditional framework                   train a classification model that selects the surrogate model
for algorithm selection by Rice. Several single-label clas-               for any black-box function. This idea is depicted in the
sification, multi-label classification, and regression meth-              Figure 1. An accurate model selection method could be
ods are experimentally evaluated on data from DTS-CMA-                    used for the DTS-CMA-ES algorithm and could poten-
ES runs on the noiseless benchmark functions from the                     tially speed up the optimization process. This work might
COCO platform for comparing continuous optimizers in                      provide valuable insight for such a goal.
black-box settings.                                                          To select a surrogate model, various classification
                                                                          strategies can be used, and by assessing their performance
                                                                          the most suitable classification model can be later utilized.
1    Introduction                                                         The selection is described in the context of a framework
                                                                          for algorithm selection proposed by Rice in [18].
Optimization is a field of mathematics that has been stud-
                                                                             We have started from the research in [15], where authors
ied for centuries. Many problems can be reduced to a
                                                                          used a classification tree for selection mapping. However,
problem of finding global optima of a function. Gradi-
                                                                          the accuracy of the classification tree was not very satis-
ent descent methods or analytical solutions are often used
                                                                          factory. Therefore, we test more classification models.
to solve these problems.
                                                                             This paper is structured as follows. In Section 2, an
   Expensive black-box optimization is addressing opti-
                                                                          introduction to surrogate models for surrogate-assisted
mization problems in situations when a mathematical def-
                                                                          CMA-ES is presented. Section 3 discusses the design for
inition of the optimized objective is unknown and its eval-
                                                                          algorithm selection utilizing fitness landscape analysis and
uation costs valuable resources such as money or time.
                                                                          Rice’s framework. Finally, in Section 4, various classifi-
   The Covariance Matrix Adaptation Evolution Strategy
                                                                          cation and regression methods for selecting the most con-
(CMA-ES [4]) is a stochastic method suitable for op-
                                                                          venient surrogate model for DTS-CMA-ES are shown.
timization of black-box functions. A surrogate model
is a regression model that can be used to approximate
the unknown black-box function. Instead of evaluating                     2   Surrogate Models in the Context of CMA
the black-box function in every search point, the surro-
                                                                              Evolution Strategy
gate model is used to decrease the number of expensive
evaluations based on already evaluated points. However,
the combination of the CMA-ES with a surrogate model                      The CMA-ES is an algorithm for numerical black-box op-
presents new challenges in tuning surrogate models to                     timization. The algorithm can be simplified into a repeti-
make the optimization more effective. Finally, fitness                    tion of the following three steps:
landscape analysis (FLA) is a technique that is trying to
                                                                          (1) sample a new population of size λ by sampling from
                                                                              a multivariate normal distribution N (m
                                                                                                                    m, Σ ),
      Copyright c 2020 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY         (2) select the µ best offspring from the sampled popula-
4.0).                                                                         tion based on their respective function values,
                           Data                                                                                               Feature space


                                                                              5.00


                                                                              4.75
                                                             200

                                                             100
                                                                              4.50


                                                                   f(x)
                                                             0


                                                                          2
                                                             100
                                                                              4.25
                                                             200

                                                                              4.00
                                                                                            x
                                                         1
                                                     0                        3.75
       2                                         1
                                                     2


            1
                                                     x


                      0                      2
                x1                                                            3.50
                               1
                                         3
                                                                                      2.0       1.5       1.0            0.5       0.0        0.5           1.0           1.5           2.0
                                                                                                                                      1


                     Surrogate model 1                                                                                   Surrogate model 2


                                                             200                                                                                                           200
                                                             150                                                                                                           150
                                                             100                                                                                                           100
                                                             50
                                                                   f(x)


                                                                                                                                                                                 f(x)
                                                                                                                                                                           50
                                                             0                                                                                                             0
                                                             50                                                                                                             50
                                                             100                                                                                                           100
                                                             150                                                                                                           150


                                                         1                                                                                                            1
                                                     0                                                                                                            0
      2                                          1                                                    2                                                     1
           1                                                                                                    1
                                                     2


                                                                                                                                                                  2
                                                     x


                                                                                                                                                                x
                      0                      2                                                                            0                             2
                x1         1                                                                                        x1            1
                                   2     3                                                                                                2         3


Figure 1: This figure is an artificial illustration of the relation between fitness landscape analysis and surrogate modeling.
The top left graph represents the data that are modeled by the surrogate model. The top right graph shows how this data
could be represented in the space described by features derived from the fitness landscape analysis. The decision boundary
represents a trained classification model that should choose more accurate surrogate model in the fitness landscape analysis
space. The bottom figures shows that two surrogate models could model the space in different way and hence one might
be more accurate then other.


(3) update parameters of the multivariate distribution m                             of predicting a whole distribution instead of just a value
    and Σ with respect to the selected µ offspring.                                  of the objective function. The covariance function of the
                                                                                     used Gaussian process is a hyper-parameter, which we are
   In step (2), all λ offspring need to be evaluated in order                        trying to set by utilizing features from the FLA.
to select the best µ offspring. A surrogate model that ap-
proximates the underlying black-box function can be used
to decrease the number of needed expensive evaluations.
   Surrogate modeling is a technique based on building re-                           2.1    Gaussian process
gression models of the original function using the already
evaluated data points. This technique originated from
response surface modeling where the regression models                                A Gaussian process is a collection of random variables,
are usually simple polynomial models. Response surface                               any finite number of which have a joint Gaussian distribu-
modeling was introduced by George E. P. Box and K. B.                                tion.
Wilson in 1951 [2].                                                                     Due to a joint Gaussian distribution, Gaussian process
   DTS-CMA-ES is a version of the CMA-ES algorithm                                   is described by its mean and covariance function. The
utilizing surrogate models. This algorithm uses regres-                              mean function m(xx) and the covariance function κ(xx, x 0 )
sion models such as Gaussian process for their capability                            of a random variable g(xx) assigned to point x are defined
                                                                                                       √ !
as                                                                               5
                                                                                 2        0              5a
                                                                               κMat (xx, x ) =    1+a+      exp (−a) , where
             m(xx) = E [g(xx)] ,                                                                        3`
                                                                    (1)                                           √
          κ(xx, x 0 ) = E (g(xx) − m(xx))(g(xx0 ) − m(xx0 ))T                                                       5kxx − x 0 k
                                                             
                                                                                                              a=                 . (8)
                                                                                                                        `
and the fact that m and κ define, respectively, the mean
and covariance of the variables g(xx) forming the Gaussian                 These kernels are from the Matérn class [10].
process is sometimes denoted as                                              Another kernel was introduced by Gibbs in [3]:
                                                                                                                                           1/2
                    g(xx) ∼ GP(m(xx), κ(xx, x0 )).                  (2)
                                                                                                            D 
                                                                                                                     2`i (xx)`i (xx0 )
                                                                                        κGibbs (xx, x0 ) = ∏
                                                                                                           i=1     `2i (xx) + `2i (xx0 )
   The posterior distribution can be inferred with rules                                                                                    !            (9)
for conditioning Gaussians as the Gaussian distribution                                                         (xi − xi0 )2
                                                                                                                 D
                                                                                                   exp − ∑ 2                                    ,
N (µµ ∗ , Σ ∗ ), where                                                                                           x) + `2i (xx0 )
                                                                                                         i=1 `i (x

                                       T
               µ ∗ = µ (X
                        X ∗ ) + K ∗ K −1 ( f − µ (X
                                                  X )),                    where `i is a positive function which can be different for
                                   T
                                                                    (3)    each i and D is the dimension of the vector x . Making the
               Σ ∗ = K ∗∗ − K ∗ K −1 K ∗ ,                                 hyper-parameter ` configurable in every dimension makes
                                                                           this kernel more flexible.
where f is a vector of measured responses, X is a matrix
                                                                              Also a neural network can be used as a kernel for GP.
with inputs of known responses, X ∗ is a matrix with inputs
                                                                           How to derive the following neural network kernel is dis-
of unknown responses, K i j = κ(xxi , x j ), K ∗i j = κ(xxi , x ∗j ) and
                                                                           cussed in [17].
K ∗∗      x∗i , x ∗j ) for the considered covariance function κ.
  i j = κ(x
   The covariance function κ must be a symmetric func-                         κNN (xx, x 0 ) =
tion of two vector inputs, and the matrix K by means of                                                                                         !
κ as described above must be for any number of points x i                             2                           2x̃xT Σ x̃x0                          (10)
                                                                                        arcsin    p                                                 ,
positive semidefinite; each such function is called a kernel.                         π               (1 + 2x̃xT Σ x̃x0 )(1 + 2x̃x0T Σ x̃x0 )
The kernel defining a covariance function is as a hyper-
parameter of a Gaussian process. There is a large variety                  where x̃x is x with an added bias component such that
of available kernels, in this work we consider the follow-                 x̃x = (1, x1 , . . . , xD )T and Σ denotes a corresponding bias
ing ones.                                                                  component.
   Polynomial kernels are defined as follows:                                  A new kernel can be also created using addition. For
                                                                           instance the addition of a SE kernel to a Q kernel results
                      κ(xx, x 0 ) = (xxT x 0 + σ02 ) p ,            (4)    in a new kernel defined as follows:
                                                                                                                      kxx − x 0 k2
                                                                                                                                  
where p ∈ N and σ0 is a constant term (bias).                                           κSE+Q (xx, x 0 ) = σ 2 exp −
  For p = 1 the kernel is called linear (LIN) and for p = 2                                                               2`2         (11)
the kernel is quadratic (Q).                                                                                          T 0
                                                                                                                  +(xx x + σ0 ) .2 2
  A Squared exponential kernel (SE) is defined as

                                       kxx − x 0 k2
                                                   
                                                                           3     Methodology
            κSE (xx, x 0 ) = σ 2 exp −                , (5)
                                           2`2
                                                                           To clarify characteristics of the data used to train a sur-
where ` is a characteristic length-scale, a hyper-parameter                rogate model in the DTS-CMA-ES algorithm, we use the
determining the relationship between the distance of vec-                  explanation from [15]: For each generation g of the DTS-
tors in the input space and correlations in the output space.              CMA-ES algorithm, a set of surrogate models M are
   A Rational quadratic kernel (RQ) can be viewed as a                     trained on a training set T . The training set T is a subset
generalization of the SE kernel. The RQ kernel is defined                  of the archive A of all evaluated data points. Afterwards,
as                                                −α                      a surrogate model M ∈ M is utilized to select new popula-
                                     kxx − x 0 k2
                                
            κRQ (xx, x 0 ) = σ 2 1 +                  .   (6)              tion P. The question is how to select the most convenient
                                        2α`2                               surrogate model using the sets A, T , P?
The hyper-parameter α > 0 can be seen as a decomposi-
tion of the exponential function in the SE kernel.                         3.1       Framework for Model Setting Selection
   Frequently, the following two kernels are used:
                                                                           One way to describe the surrogate model selection prob-
      3
      2        0                                                           lem is to use a framework for algorithm selection proposed
     κMat (xx, x ) = (1 + a) exp (−a) , where
                                             √                             by Rice in [18]. This framework is designed with five main
                                              3kxx − x 0 k                 components and for problem of surrogate model selection
                                         a=                , (7)
                                                  `                        can be briefly explained as:
Data space is a space of possible problems. In this case,               or R2 may be more convenient for the investigation of
    data space contains sets of data points that are present            the relationships between model performance and fitness
    in the DTS-CMA-ES runs.                                             landscape features.
Algorithm space is a space of possible surrogate models
    to solve a problem from the data space.                             Selection mapping Utilizing the FLA features, we can
                                                                        construct a D-dimensional space Φ, where each dimension
Feature space is a space of possible characterizations of               represents one FLA feature. In this space, we can create a
    the data space. We use feature sets from FLA and                    classification model that will map a φ ∈ Φ to the respec-
    CMA-ES features described in Subsection 3.2.                        tive best performing covariance function of the Gaussian
Performance space is a space describing the perfor-                     process learned from previous runs of the DTS-CMA-ES
    mance of a particular algorithm for a particular prob-              algorithm.
    lem.                                                                   Selection mapping S : Φ → M is a component that
                                                                        maps landscape features φ ∈ Φ to a model M ∈ M such
Selection mapping is a function that gives a surrogate                  that S(φ ) maximizes the model performance.
     model M, for a particular vector of features φ , such
     that it minimizes models error ε.
                                                                        3.2   Fitness Landscape Analysis
The following diagram [13, 18] (Figure 2) illustrates the
main parts of this framework and their relations.                       Fitness landscape analysis (FLA) aims to characterize the
   The goal is to train a classifier, represented by the se-            structure of a fitness function with measurable features. In
lection mapping, which could be later utilized to select the            the context of expensive black-box optimization, the fea-
best covariance function of a Gaussian process for given                ture calculation relies only on the already evaluated data
data.                                                                   points.
                                                                           In [11], the authors discussed sets of low-level features
                                                                        that can be computed with various techniques. Some of
Data space In the DTS-CMA-ES, three sets of data points                 them are not useful for the context of expensive black-box
are used. The first one is an archive A containing all f -              optimization because they require additional evaluations
evaluated data points {xxi , f (xxi ) | i = 1, . . . , n}, where n is   of the optimized black-box function.
the number of f -evaluated points. The second one is the                   Several of such feature sets have been suggested in the
training set T containing f -evaluated data points which                literature to support FLA, e.g. Nearest-Better Clustering
are a subset of A and are utilized for fitting a surrogate              [7], Information Content of Fitness Sequences [12], or Dis-
model in DTS-CMA-ES. The training set is selected to                    persion [9]. All of the mentioned feature sets were already
contain data points that near the currently searched space              used in the paper [16] and we briefly describe them in the
by the CMA-ES (see [1] for training set selection meth-                 following paragraphs.
ods). The last set is a sampled population P, for which the
values of the black-box function are unknown. The pop-
ulation P is selected using the doubly trained evolution                y-Distribution This set contains features based on the dis-
control that utilizes the predictions of the Gaussian process           tribution of the fitness function values. In [11], the authors
surrogate model. These sets are changing each generation.               have presented three such features: skewness, kurtosis, and
A more detailed explanation of how the sets T and P are                 number of peaks.
selected can be found in [1, 14].                                          Both the skewness and the kurtosis of a distribution are
                                                                        computed from central moments. The skewness tells us
                                                                        how asymmetric the distribution is and the kurtosis mea-
Model space The set of considered surrogate models con-                 sures how much the distribution differs from the normal
sisted of Gaussian processes with various covariance func-              distribution in the sense of tailedness.
tions.                                                                     The last feature is an estimation of the number of peaks
                                                                        in the y-Distribution.
Feature space The features are computed on datasets
A, T , and T ∪P for each generation in the run of the DTS-
                                                                        Levelset Levelset features are calculated from a dataset
CMA-ES algorithm.
                                                                        split into two classes based on a threshold in function val-
                                                                        ues. As a split value, the median value or other quantile
Performance space Performance can be measured with                      values have been studied in [11].
a variety of evaluation metrics and the question is which                 Linear, quadratic, and mixture discriminant analysis are
metric would be the most convenient for the surrogate                   used on the partitioned dataset to separate classes. The
model selection task. In [16], the authors used the Ranking             underlying idea is that for a right choice of the threshold
Difference Error (RDE). However, error measures such as                 value, a multimodal fitness landscape cannot be separated
Mean Squared Error (MSE), Mean Absolute Error (MAE),                    with linear or quadratic discriminant analysis. However,
                                Data
                                space
                                                                               Surrogate model
                                                                                    space
                                                            Train model
                                                               on T
                                                                                    M∈M
                       A    T       P


                                                                                               Assess
                                        Feature ex-                                         performance
                                         traction                                               on P


                           Φ(A, T , P)                       S                     ε(M) ∈ E

                                                      Selection mapping
                             Feature                    minimizing ε              Performance
                              space                      S: Φ → M                    space

                 Figure 2: Modified Rice’s framework for surrogate model selection in DTS-CMA-ES.


mixture discriminant analysis should have a better perfor-         Information Content of Fitness Sequences Information
mance on a multimodal fitness landscape.                           Content of Fitness Sequences (ICoFS) introduced in [12],
  The features are defined as cross-validated misclassifi-         measures how difficult is it to describe a given fitness func-
cation errors for each type of discriminant analysis.              tion. For instance, a low information function would be
                                                                   a constant fitness function as opposed to a high informa-
                                                                   tion function such as some multimodal complicated fitness
Meta-model Features from this class are acquired from              function.
fitting a linear and quadratic regression model.                      This method uses neighboring values and compares
    The model performance, specifically the adjusted R2            their fitness values. The comparisons are later transformed
value of linear and quadratic models, has been used in             into discrete information from which the features are com-
[11] together with the minimum and the maximum of the              puted.
absolute values of the linear model coefficients. For the
quadratic model, the authors used the maximum absolute             CMA-ES features The authors of [16] proposed features
value divided by the minimum absolute value of the fitted          related to the DTS-CMA-ES algorithm. They are com-
model’s coefficients.                                              puted from the CMA-ES settings, from the set of points
                                                                   X = {xxi | i = 1, . . . , n} for which the function value is
                                                                   known, and from DTS-CMA-ES parameters such as.
Nearest-Better Clustering The features based on
                                                                     • The generation number g is an easy to obtain feature
Nearest-Better Clustering (NBC) have been proposed in
                                                                       derived from an optimization run of DTS-CMA-ES.
[7]. The presented five features should help to recognize
funnel structures in the fitness landscape.                          • CMA-ES uses a step-size σ (g) for controlling the size
                                                                       of a distribution from which the CMA-ES samples
                                                                       new points. Therefore, the step-size can be also used
Dispersion Dispersion of a function measures how close                 as a feature.
together the sampled points are in the search space [9].
                                                                     • The evolution path p c and the σ evolution path length
The dispersion features are derived from this idea. They
                                                                       features are derived from the evolution paths length
average differences between dispersion values below a
                                                                       used in the CMA-ES. These features encode how the
certain moving threshold value.
                                                                       path of the evolution process has changed in recent
   To estimate the dispersion, the authors of [9] sampled
                                                                       generations and measure how useful were previous
the space n times and took the best b points from which
                                                                       steps for the optimization.
they averaged pairwise distances between them. This step
was repeated for two different n values, and the final dis-          • An additional CMA-ES feature is derived from the
persion was computed by subtracting those results. That                number of restarts of the DTS-CMA-ES algorithm.
way a difference in dispersion is estimated.                           This could indicate how difficult the problem is.
    • Mahalanobis distance of the CMA-ES mean m (g) to             The differences between exact and loose accuracy vary
      the mean of the empirical distribution of all points       between used error measures. For RDE, MSE, and MAE
      X is another feature described in [16]. This feature       those differences are greater than for R2 error. This might
      indicates the suitability of X for training a surrogate    be a consequence of choosing the 5% quantile for similarly
      model.                                                     best performing kernels.

    • The CMA similarity likelihood feature is the log-
      likelihood of all points X with respect to the CMA-ES      4.1   Used Data
      distribution. This may also represent a measure of set     The problems used for retrieving the data {A(i) , T (i) , P (i) |
      suitability for a surrogate model training.                i = 1, . . . , g} in this paper were obtained from running
                                                                 the DTS-CMA-ES algorithm on the Black-Box Optimiza-
                                                                 tion Benchmarks from the COmparing Continuous Op-
4     Experimental Evaluation                                    timisers (COCO) platform, namely, problems in dimen-
                                                                 sions 2, 3, 5, 10, and 20 on instances 11-15 [5, 6]. The
Several experiments using the data obtained during the run       sets {A(i) , T (i) , P (i) | i = 1, . . . , g} were extracted for 25
of the DTS-CMA-ES on benchmark functions with differ-            uniformly selected generations for 8 considered surrogate
ent surrogate models were designed. From the error mea-          models. The algorithm was terminated if one of the fol-
sures of used surrogate models, the best surrogate model         lowing two conditions was true:
can be selected as the one with the minimal error.
   We used Gaussian processes as a surrogate model. In           (1) the target fitness value 10−8 is reached, or
particular, the following covariance functions were used:        (2) the number of evaluations of the optimized fitness
                          5
                          2
κLIN , κSE , κRQ , κSE , κMat , κNN , κGibbs , and κSE+Q . The       function f is at least 250D, where D is the dimension
parameters of the kernel defining the covariance function            of the function f .
are found by maximum-likelihood or leave-one-out cross-
                                                                    From the data, we calculate FLA features and errors de-
validation method [1].
                                                                 rived from surrogate models predictions. The considered
   In the feature space, we measured the following low-          error measures are: RDE, MSE, MAE, and R2 .
level feature sets: y-Distribution, Levelset, Meta-Model,           The data generated for this paper were already used in
Nearest-Better Clustering, Dispersion, Information Con-          [16]. Compared to [16], more metrics in the performance
tent, and CMA-ES features. These sets are described in           space and more classifiers are investigated. Features were
greater detail in Subsection 3.2.                                calculated using the algorithm underlying the R-package
   Experiments are compared using two accuracies. The            flacco [8], reimplemented in the MATLAB language.
exact accuracy measures exact matches of the classified
kernel and the true best performing kernel. The loose ac-
curacy is calculated from loose matches that considers as a      4.2   Single-Label Classification
correctly classified a prediction which falls into similarly     A classifier Sc : Φ → M was trained on labels of the best
best performing kernels. Kernels are considered similarly        performing models. To obtain a label for a data point, the
best performing if their error is in the 5% quantile of the      minimal error value for each GP kernel was found and its
considered kernels errors for a particular data point.           kernel set as a label. It is not always clear which model
   Experiments are compared to a baseline model that rec-        should be selected as a label because multiple models can
ommends the most frequent best performing kernel from            have equal errors. To address this ambiguity, multi-label
the training set. To this end, various approaches can be         classifiers are tested in the next subsection.
used. With classification models we can classify the best
performing kernels, or by utilizing the information about
                                                                 4.3   Multi-Label Classification
errors, we can apply regression or multi-label classifation
models.                                                          From the original dataset, not only the best, but all nearly
   The following classifiers or their regression versions        best performing models are found and used as labels for Sc
were trained: decision tree, random forest, support vec-         training. The trained classifier is then capable of predict-
tor machine, and artificial neural network with two hidden       ing multiple labels for given landscape features. However,
dense layers (50 and 25 neurons respectively). Results are       for a fair comparison with the single-label classification
shown in Figures 3 and 4.                                        and with the regression approach, only one label has to be
   The baseline model was outperformed by almost every           predicted. To this end, a regression model is utilized to
presented classification method. Single-label classifica-        predict the best performing model to select a single-label
tion methods have the highest accuracy, but its accuracy         among labels predicted by the multi-label classifier. That
is very similar to the multi-label classification methods.       regression model considers only labels predicted by the
The advantage of the multi-label classification is that it       multi-label classifier and among them, the one best accord-
provides more flexibility for tuning the settings and hence      ing to the regression model is selected as the final label for
provides more room for improvement.                              comparison.
           0.35
                                                                                               RDE                                                                                                                                                                                                 MSE
                                                                                                                                                                                  exact                                                                                                                                                                                               exact
                                                                                                                                                                                  loose                        0.40                                                                                                                                                                   loose
           0.30
                                                                                                                                                                                                               0.35
           0.25                                                                                                                                                                                                0.30
                                                                                                                                                                                                               0.25
           0.20
Accuracy


                                                                                                                                                                                                    Accuracy
                                                                                                                                                                                                               0.20
           0.15
                                                                                                                                                                                                               0.15
           0.10
                                                                                                                                                                                                               0.10
           0.05
                                                                                                                                                                                                               0.05
           0.00                                                                                                                                                                                                0.00
                      elin
                           e                   ee        st va                vo                 rk ee              st M                        rk            ee          est M               ork                         elin
                                                                                                                                                                                                                              e                    ee        st va            vo               rk ee                     st M                        rk           ee         est M                ork
                  bas                     n tr      fore M o            Mo                 t w o       n tr    fore l SV                  t w o          n tr       f o r  n SV           etw                         bas                     n tr      fore M o           Mo            t w o     n tr            fore l SV                   t w o         n tr       fo r  n SV            etw
                                     isio         om SVl          l S V                l ne isio            om abe
                                                                                                                l                   l ne isio                    om sio              ral n                                               isio         om SVl         l S V           l ne isio              om abe   l                   l ne isio                   om sio             ral n
                                 Dec          and abe e-labe                     u r a      Dec l RanMd ulti-                  ur a         Dec RanRdegres                       Neu                                                 Dec          and abe       abe              r a
                                                                                                                                                                                                                                                                              Neu el D
                                                                                                                                                                                                                                                                                            e c         n d
                                                                                                                                                                                                                                                                                                    l Ra Mu
                                                                                                                                                                                                                                                                                                             l t i -                u r a        Dec RanRdegres                     Neu
                             bel        el R le-l          l                l Ne el                                        l Ne on     i                 n                                                                       bel        el R gle-l ingle-l             bel i-lab                                            l Ne on     i                n
                         e-la le-lab Sing Sing                          b e       i-lab           a b e                b e        s
                                                                                                                  ti-la Regre egres
                                                                                                                                     s             s i o
                                                                                                                                                                        ress
                                                                                                                                                                             ion                                            le-la gle-lab Sin                                                   abe                         b e
                                                                                                                                                                                                                                                                                                                       ti-la Regre egres
                                                                                                                                                                                                                                                                                                                                        s s             si o
                                                                                                                                                                                                                                                                                                                                                                           ress
                                                                                                                                                                                                                                                                                                                                                                                ion
                   Sing
                        l                                        le-la Mult Multi-l                           Mul                                                                                                      Sing
                                                                                                                                                                                                                                                         S          le-la Mult Multi-l                          Mul
                             Sing                            Sing                                                                        R                         Reg                                                           Sin                           Sing                                                                           R                        Reg
 Figure 3: Exact accuracy and loose accuracy for each considered model trained with landscape features and predicting the best performing model w.r.t. Ranking Difference
 Error and Mean Squared Error. Hyper-parameters for each classifier were found using a 5-fold cross-validation and final accuracies were measured on the test set containing
 20% of the original data set.
                                                                                                     R2                                                                                                                                                                                                          MAE
                                                                                                                                                                                           exact                         0.35                                                                                                                                                                           exact
                                                                                                                                                                                           loose                                                                                                                                                                                                        loose
           0.30
                                                                                                                                                                                                                         0.30
           0.25
                                                                                                                                                                                                                         0.25
           0.20
                                                                                                                                                                                                                         0.20
Accuracy


                                                                                                                                                                                                              Accuracy
           0.15
                                                                                                                                                                                                                         0.15
           0.10
                                                                                                                                                                                                                         0.10
           0.05
                                                                                                                                                                                                                         0.05
           0.00                                                                                                                                                                                                          0.00
                      elin
                           e                   ee            st va               vo                   rk ee                      st M                      rk         ee            est M               ork                         elin
                                                                                                                                                                                                                                        e                    ee           st va             vo                    rk ee                       st M                      rk         ee           est M                ork
                  bas                     n tr         fore M o               Mo                t w o     n tr             fore l SV                 t w o       n tr         f o r  n SV           etw                         bas                     n tr        fore M o             Mo                 t w o     n tr              fore l SV                 t w o       n tr         fo r  n SV            etw
                                     isio           om SV  l            l S V              l ne isio                om abe   l                   l ne isio               om sio                ral n                                               isio          om SV  l          l S V              l ne isio                 om abe    l                   l ne isio               om sio               ral n
                                 Dec          a n d   ab e           be              u r a       Dec          a n d  l t i -                ur a       Dec           and res
                                                                                                                                                                           g               Neu                                                 Dec          a n d   b e        b e              u r a        Dec           an d   l t i -                u r a      Dec           and res
                                                                                                                                                                                                                                                                                                                                                                                        g              Neu
                             bel        el R le-l              le-la             N e
                                                                              bel i-lab
                                                                                             el
                                                                                                       abe
                                                                                                          l R    M  u
                                                                                                                                    bel
                                                                                                                                        N e
                                                                                                                                               ssio
                                                                                                                                                     n
                                                                                                                                                              sion
                                                                                                                                                                   R   R e
                                                                                                                                                                                       ion                                                 bel
                                                                                                                                                                                                                                                                   a
                                                                                                                                                                                                                                                      el R gle-l ingle-l
                                                                                                                                                                                                                                                                             a              N e
                                                                                                                                                                                                                                                                                         bel i-lab
                                                                                                                                                                                                                                                                                                        e l
                                                                                                                                                                                                                                                                                                                   abe
                                                                                                                                                                                                                                                                                                                       l R   M  u
                                                                                                                                                                                                                                                                                                                                                 bel
                                                                                                                                                                                                                                                                                                                                                     N e
                                                                                                                                                                                                                                                                                                                                                             ssio
                                                                                                                                                                                                                                                                                                                                                                  n
                                                                                                                                                                                                                                                                                                                                                                           sion
                                                                                                                                                                                                                                                                                                                                                                                R   R e
                                                                                                                                                                                                                                                                                                                                                                                                   ion
                        le-la le-lab Sing Sing                         le-la Mult Multi-l                                      ti-la Regre egres                                  ress                                                le-la gle-lab Sin              S            le-la Mult Multi-l                                        ti-la Regre egres                                 ress
                   Sing      Sing                                Sing                                                   Mul                         R                        Reg                                                 Sing      Sin                              Sing                                                     Mul                         R                        Reg
 Figure 4: Exact accuracy and loose accuracy for each considered model trained with landscape features and predicting the best performing model w.r.t. R2 and Mean
 Absolute Error. Hyper-parameters for each classifier were found using a 5-fold cross-validation and final accuracies were measured on the test set containing 20% of the
 original data set.
4.4   Regression                                                     [4] Nikolaus Hansen. The CMA evolution strategy: a compar-
                                                                         ing review. In Towards a new evolutionary computation,
A regression model Sr : Φ → E |M| was trained to predict                 pages 75–102. Springer, 2006.
an error of a surrogate model for given landscape fea-               [5] Nikolaus Hansen, Anne Auger, Steffen Finck, and Ray-
tures. The Sr model yields errors from which a minimum                   mond Ros.        Real-Parameter Black-Box Optimization
is found and its corresponding surrogate model is selected.              Benchmarking 2009: Noiseless Functions Definitions.
   Some regression models yield only one prediction. To                  Technical report, Citeseer, 2010.
this end, for each surrogate model one regression model is           [6] Nikolaus Hansen, Anne Auger, Steffen Finck, and Ray-
trained to predict the error and results are then combined.              mond Ros.        Real-Parameter Black-Box Optimization
                                                                         Benchmarking 2012: Experimental Setup. Technical re-
                                                                         port, Citeseer, 2012.
5     Conclusion                                                     [7] Pascal Kerschke, Mike Preuss, Simon Wessing, and Heike
                                                                         Trautmann. Detecting funnel structures by means of ex-
A design of various methods for classifying the data from                ploratory landscape analysis. In Proceedings of the 2015
FLA to predict the most convenient surrogate model were                  Annual Conference on Genetic and Evolutionary Compu-
presented. The baseline model was outperformed with al-                  tation, pages 265–272, 2015.
most every presented classification method. However, the             [8] Pascal Kerschke and Heike Trautmann. Comprehen-
differences between the highest accuracy scores and the                  sive Feature-Based Landscape Analysis of Continuous and
baseline scores are very small. From the accuracy scores                 Constrained Optimization Problems Using the R-package
                                                                         flacco. In Applications in Statistical Computing – From
in Figures 3 and 4, it is clear that both the best classifiers
                                                                         Music Data Analysis to Industrial Quality Improvement,
and the best regression models are random forests for al-
                                                                         Studies in Classification, Data Analysis, and Knowledge
most all considered approaches.                                          Organization, pages 93 – 123. Springer, 2019.
   The accuracy scores suggests that the classifiers did not         [9] Monte Lunacek and Darrell Whitley. The dispersion metric
solve the problem of surrogate model selection for DTS-                  and the CMA evolution strategy. In Proceedings of the 8th
CMA-ES algorithm completely. However, this method                        annual conference on Genetic and evolutionary computa-
might improve the performance of the DTS-CMA-ES al-                      tion - GECCO 06. ACM Press, 2006.
gorithm because even a small improvement in accuracy                [10] Bertil Matérn. Spatial variation. Technical report, 1960.
might be useful.                                                    [11] Olaf Mersmann, Bernd Bischl, Heike Trautmann, Mike
   Possible problems might be with an imbalance of                       Preuss, Claus Weihs, and Günter Rudolph. Exploratory
classes in the training dataset and/or with similar perfor-              landscape analysis. In Proceedings of the 13th annual con-
mance of some surrogate models for some data points.                     ference on Genetic and evolutionary computation, pages
The latter one was addressed with multi-label classifica-                829–836, 2011.
tion methods.                                                       [12] Mario A Muñoz, Michael Kirley, and Saman K Halgamuge.
   A further improvement of the presented methods could                  Exploratory landscape analysis of continuous space opti-
be achieved through improved fitness landscape analysis.                 mization problems using information content. IEEE trans-
This concerns on the one hand new suitable fitness land-                 actions on evolutionary computation, 19(1):74–87, 2014.
scape features, on the other hand feature selection that            [13] Mario A Muñoz, Yuan Sun, Michael Kirley, and Saman K
could reduce the feature space and improve the perfor-                   Halgamuge. Algorithm selection for black-box continu-
mance of employed classifiers and regression models.                     ous optimization problems: A survey on methods and chal-
                                                                         lenges. Information Sciences, 317:224–245, 2015.
                                                                    [14] Zbyněk Pitra, Lukáš Bajer, and Martin Holeňa. Doubly
Acknowledgement                                                          trained evolution control for the surrogate CMA-ES. In
                                                                         International Conference on Parallel Problem Solving from
The research reported in this paper has been supported by                Nature, pages 59–68. Springer, 2016.
the Czech Science Foundation (GAČR) grant 18-18080S.               [15] Zbyněk Pitra, Lukáš Bajer, and Martin Holeňa.
                                                                         Knowledge-based Selection of Gaussian Process Sur-
                                                                         rogates. In Workshop & Tutorial on Interactive Adaptive
References                                                               Learning, page 48, 2019.
                                                                    [16] Zbyněk Pitra, Jakub Repický, and Martin Holeňa. Land-
 [1] Lukáš Bajer, Zbyněk Pitra, Jakub Repický, and Martin               scape analysis of gaussian process surrogates for the covari-
     Holeňa. Gaussian process surrogate models for the CMA              ance matrix adaptation evolution strategy. In Proceedings
     evolution strategy. Evolutionary computation, 27(4):665–            of the Genetic and Evolutionary Computation Conference,
     697, 2019.                                                          pages 691–699, 2019.
 [2] G. E. P. Box and K. B. Wilson. On the Experimental At-         [17] Carl Rasmussen. Gaussian processes for machine learning.
     tainment of Optimum Conditions. Journal of the Royal Sta-           MIT Press, Cambridge, Mass, 2006.
     tistical Society: Series B (Methodological), 13(1):1–38, jan
                                                                    [18] John R. Rice et al. The algorithm selection problem. Ad-
     1951.
                                                                         vances in computers, 15(65-118):5, 1976.
 [3] Mark N Gibbs. Bayesian Gaussian processes for regression
     and classification. PhD thesis, Citeseer, 1998.