<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessment of Surrogate Model Settings Using Landscape Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikuláš Dvorˇák</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zbyneˇk Pitra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, Czech Technical University</institution>
          ,
          <addr-line>Thákurova 7, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University</institution>
          ,
          <addr-line>Trojanova 13, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
          ,
          <addr-line>Pod vodárenskou veˇží 2, Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work in progress concerns assessment of surrogate model settings for expensive black-box optimization. The assessment is performed in the context of Gaussian process models used in the Doubly Trained Surrogate (DTS) variant of the state-of-the-art black-box optimizer, the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). This work focuses on the connection between Gaussian process surrogate model predictive accuracy and an essential model hyper-parameter - the covariance function. The performance of DTS-CMA-ES is related to the results of landscape analysis of the objective function. To this end various classification and regression methods are used, proposed in the traditional framework for algorithm selection by Rice. Several single-label classification, multi-label classification, and regression methods are experimentally evaluated on data from DTS-CMAES runs on the noiseless benchmark functions from the COCO platform for comparing continuous optimizers in black-box settings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Optimization is a field of mathematics that has been
studied for centuries. Many problems can be reduced to a
problem of finding global optima of a function.
Gradient descent methods or analytical solutions are often used
to solve these problems.</p>
      <p>Expensive black-box optimization is addressing
optimization problems in situations when a mathematical
definition of the optimized objective is unknown and its
evaluation costs valuable resources such as money or time.</p>
      <p>
        The Covariance Matrix Adaptation Evolution Strategy
(CMA-ES [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) is a stochastic method suitable for
optimization of black-box functions. A surrogate model
is a regression model that can be used to approximate
the unknown black-box function. Instead of evaluating
the black-box function in every search point, the
surrogate model is used to decrease the number of expensive
evaluations based on already evaluated points. However,
the combination of the CMA-ES with a surrogate model
presents new challenges in tuning surrogate models to
make the optimization more effective. Finally, fitness
landscape analysis (FLA) is a technique that is trying to
characterize the structure of a fitness landscape with
measurable features. As these features are describing the
structure of a fitness function, they could provide information
based on which the most suitable surrogate model could
be obtained.
      </p>
      <p>
        This paper addresses the problem of how to select the
most convenient surrogate model, in the context of
various metrics quantifying the quality of the considered
surrogate model, in every generation of the Doubly Trained
Surrogate Covariance Matrix Adaptation Evolution
Strategy (DTS-CMA-ES [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]). Later, this metric can be used,
with a set of features from fitness landscape analysis, to
train a classification model that selects the surrogate model
for any black-box function. This idea is depicted in the
Figure 1. An accurate model selection method could be
used for the DTS-CMA-ES algorithm and could
potentially speed up the optimization process. This work might
provide valuable insight for such a goal.
      </p>
      <p>
        To select a surrogate model, various classification
strategies can be used, and by assessing their performance
the most suitable classification model can be later utilized.
The selection is described in the context of a framework
for algorithm selection proposed by Rice in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        We have started from the research in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], where authors
used a classification tree for selection mapping. However,
the accuracy of the classification tree was not very
satisfactory. Therefore, we test more classification models.
      </p>
      <p>This paper is structured as follows. In Section 2, an
introduction to surrogate models for surrogate-assisted
CMA-ES is presented. Section 3 discusses the design for
algorithm selection utilizing fitness landscape analysis and
Rice’s framework. Finally, in Section 4, various
classification and regression methods for selecting the most
convenient surrogate model for DTS-CMA-ES are shown.
2</p>
      <p>Surrogate Models in the Context of CMA
Evolution Strategy
The CMA-ES is an algorithm for numerical black-box
optimization. The algorithm can be simplified into a
repetition of the following three steps:
(1) sample a new population of size l by sampling from
a multivariate normal distribution N (m; S);
(2) select the m best offspring from the sampled
population based on their respective function values,
Feature space
2
2
1
x1 0</p>
      <p>1</p>
      <p>Surrogate model 1
1
x1 0
1
2
3
3
2
5.00
4.75
4.50
(3) update parameters of the multivariate distribution m
and S with respect to the selected m offspring.</p>
      <p>In step (2), all l offspring need to be evaluated in order
to select the best m offspring. A surrogate model that
approximates the underlying black-box function can be used
to decrease the number of needed expensive evaluations.</p>
      <p>
        Surrogate modeling is a technique based on building
regression models of the original function using the already
evaluated data points. This technique originated from
response surface modeling where the regression models
are usually simple polynomial models. Response surface
modeling was introduced by George E. P. Box and K. B.
Wilson in 1951 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>DTS-CMA-ES is a version of the CMA-ES algorithm
utilizing surrogate models. This algorithm uses
regression models such as Gaussian process for their capability
of predicting a whole distribution instead of just a value
of the objective function. The covariance function of the
used Gaussian process is a hyper-parameter, which we are
trying to set by utilizing features from the FLA.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Gaussian process</title>
      <p>A Gaussian process is a collection of random variables,
any finite number of which have a joint Gaussian
distribution.</p>
      <p>Due to a joint Gaussian distribution, Gaussian process
is described by its mean and covariance function. The
mean function m(x) and the covariance function k(x; x0)
of a random variable g(x) assigned to point x are defined</p>
      <p>m(x) = E [g(x)] ;
k(x; x0) = E (g(x)
m(x))(g(x0)
m(x0))T
and the fact that m and k define, respectively, the mean
and covariance of the variables g(x) forming the Gaussian
process is sometimes denoted as
g(x)</p>
      <p>GP(m(x); k(x; x0)):</p>
      <p>The posterior distribution can be inferred with rules
for conditioning Gaussians as the Gaussian distribution
N (m ; S ), where
m = m (X ) + K
S = K K T K 1K ;</p>
      <p>T K 1( f
m (X ));
where f is a vector of measured responses, X is a matrix
with inputs of known responses, X is a matrix with inputs
of unknown responses, K i j = k(xi; x j), K i j = k(xi; x j ) and
K i j = k(xi ; x j ) for the considered covariance function k.</p>
      <p>The covariance function k must be a symmetric
function of two vector inputs, and the matrix K by means of
k as described above must be for any number of points xi
positive semidefinite; each such function is called a kernel.
The kernel defining a covariance function is as a
hyperparameter of a Gaussian process. There is a large variety
of available kernels, in this work we consider the
following ones.</p>
      <p>Polynomial kernels are defined as follows:</p>
      <p>k(x; x0) = (xT x0 + s02)p;
where p 2 N and s0 is a constant term (bias).</p>
      <p>For p = 1 the kernel is called linear (LIN) and for p = 2
the kernel is quadratic (Q).</p>
      <p>A Squared exponential kernel (SE) is defined as
kSE(x; x0) = s 2 exp
kx</p>
      <p>x0k2
2`2
;
where ` is a characteristic length-scale, a hyper-parameter
determining the relationship between the distance of
vectors in the input space and correlations in the output space.</p>
      <p>A Rational quadratic kernel (RQ) can be viewed as a
generalization of the SE kernel. The RQ kernel is defined
as
kRQ(x; x0) = s 2 1 + k
x
2a`2
x0k2
a
:
The hyper-parameter a &gt; 0 can be seen as a
decomposition of the exponential function in the SE kernel.</p>
      <p>Frequently, the following two kernels are used:
(1)
(2)
(3)
(4)
(5)
(6)
3
kM2 at(x; x0) = (1 + a) exp ( a) , where
a =
p3kx
`
x0k
; (7)</p>
      <p>5
kM2 at(x; x0) =
1 + a +</p>
      <p>
        exp ( a) , where
These kernels are from the Matérn class [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Another kernel was introduced by Gibbs in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
      </p>
      <p>D
kGibbs(x; x0) = Õ</p>
      <p>i=1
exp
2`i(x)`i(x0)
`i2(x) + `i2(x0)
åD (xi xi0)2
i=1 `i2(x) + `i2(x0)
1=2
!
;
(9)
where `i is a positive function which can be different for
each i and D is the dimension of the vector x. Making the
hyper-parameter ` configurable in every dimension makes
this kernel more flexible.</p>
      <p>
        Also a neural network can be used as a kernel for GP.
How to derive the following neural network kernel is
discussed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>kNN(x; x0) =
2
p
arcsin</p>
      <p>2x˜T Sx˜0
p(1 + 2x˜T Sx˜0)(1 + 2x˜0T Sx˜0)
!
;
(10)
where x˜ is x with an added bias component such that
x˜ = (1; x1; : : : ; xD)T and S denotes a corresponding bias
component.</p>
      <p>A new kernel can be also created using addition. For
instance the addition of a SE kernel to a Q kernel results
in a new kernel defined as follows:
kSE+Q(x; x0) = s 2 exp
kx</p>
      <p>x0k2
2`2
+(xT x0 + s02)2:
(11)
3</p>
      <sec id="sec-2-1">
        <title>Methodology</title>
        <p>
          To clarify characteristics of the data used to train a
surrogate model in the DTS-CMA-ES algorithm, we use the
explanation from [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]: For each generation g of the
DTSCMA-ES algorithm, a set of surrogate models M are
trained on a training set T . The training set T is a subset
of the archive A of all evaluated data points. Afterwards,
a surrogate model M 2 M is utilized to select new
population P. The question is how to select the most convenient
surrogate model using the sets A; T ; P?
3.1
        </p>
        <p>
          Framework for Model Setting Selection
One way to describe the surrogate model selection
problem is to use a framework for algorithm selection proposed
by Rice in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This framework is designed with five main
components and for problem of surrogate model selection
can be briefly explained as:
Data space is a space of possible problems. In this case,
data space contains sets of data points that are present
in the DTS-CMA-ES runs.
        </p>
        <p>Algorithm space is a space of possible surrogate models
to solve a problem from the data space.</p>
        <p>Feature space is a space of possible characterizations of
the data space. We use feature sets from FLA and
CMA-ES features described in Subsection 3.2.
Performance space is a space describing the
performance of a particular algorithm for a particular
problem.</p>
        <p>Selection mapping is a function that gives a surrogate
model M, for a particular vector of features f , such
that it minimizes models error e.</p>
        <p>
          The following diagram [
          <xref ref-type="bibr" rid="ref13 ref18">13, 18</xref>
          ] (Figure 2) illustrates the
main parts of this framework and their relations.
        </p>
        <p>The goal is to train a classifier, represented by the
selection mapping, which could be later utilized to select the
best covariance function of a Gaussian process for given
data.</p>
        <p>
          Data space In the DTS-CMA-ES, three sets of data points
are used. The first one is an archive A containing all f
evaluated data points fxi; f (xi) j i = 1; : : : ; ng, where n is
the number of f -evaluated points. The second one is the
training set T containing f -evaluated data points which
are a subset of A and are utilized for fitting a surrogate
model in DTS-CMA-ES. The training set is selected to
contain data points that near the currently searched space
by the CMA-ES (see [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for training set selection
methods). The last set is a sampled population P, for which the
values of the black-box function are unknown. The
population P is selected using the doubly trained evolution
control that utilizes the predictions of the Gaussian process
surrogate model. These sets are changing each generation.
A more detailed explanation of how the sets T and P are
selected can be found in [
          <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
          ].
        </p>
        <p>Model space The set of considered surrogate models
consisted of Gaussian processes with various covariance
functions.</p>
        <p>Feature space The features are computed on datasets
A; T , and T [ P for each generation in the run of the
DTSCMA-ES algorithm.</p>
        <p>
          Performance space Performance can be measured with
a variety of evaluation metrics and the question is which
metric would be the most convenient for the surrogate
model selection task. In [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], the authors used the Ranking
Difference Error (RDE). However, error measures such as
Mean Squared Error (MSE), Mean Absolute Error (MAE),
or R2 may be more convenient for the investigation of
the relationships between model performance and fitness
landscape features.
        </p>
        <p>Selection mapping Utilizing the FLA features, we can
construct a D-dimensional space F, where each dimension
represents one FLA feature. In this space, we can create a
classification model that will map a f 2 F to the
respective best performing covariance function of the Gaussian
process learned from previous runs of the DTS-CMA-ES
algorithm.</p>
        <p>Selection mapping S : F ! M is a component that
maps landscape features f 2 F to a model M 2 M such
that S(f ) maximizes the model performance.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Fitness Landscape Analysis</title>
      <p>Fitness landscape analysis (FLA) aims to characterize the
structure of a fitness function with measurable features. In
the context of expensive black-box optimization, the
feature calculation relies only on the already evaluated data
points.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the authors discussed sets of low-level features
that can be computed with various techniques. Some of
them are not useful for the context of expensive black-box
optimization because they require additional evaluations
of the optimized black-box function.
      </p>
      <p>
        Several of such feature sets have been suggested in the
literature to support FLA, e.g. Nearest-Better Clustering
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Information Content of Fitness Sequences [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], or
Dispersion [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. All of the mentioned feature sets were already
used in the paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and we briefly describe them in the
following paragraphs.
y-Distribution This set contains features based on the
distribution of the fitness function values. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the authors
have presented three such features: skewness, kurtosis, and
number of peaks.
      </p>
      <p>Both the skewness and the kurtosis of a distribution are
computed from central moments. The skewness tells us
how asymmetric the distribution is and the kurtosis
measures how much the distribution differs from the normal
distribution in the sense of tailedness.</p>
      <p>The last feature is an estimation of the number of peaks
in the y-Distribution.</p>
      <p>
        Levelset Levelset features are calculated from a dataset
split into two classes based on a threshold in function
values. As a split value, the median value or other quantile
values have been studied in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Linear, quadratic, and mixture discriminant analysis are
used on the partitioned dataset to separate classes. The
underlying idea is that for a right choice of the threshold
value, a multimodal fitness landscape cannot be separated
with linear or quadratic discriminant analysis. However,
Data
space
A</p>
      <p>T</p>
      <p>P
F(A; T ; P)</p>
      <sec id="sec-3-1">
        <title>Feature space Feature extraction</title>
      </sec>
      <sec id="sec-3-2">
        <title>Train model</title>
        <p>on T</p>
      </sec>
      <sec id="sec-3-3">
        <title>Surrogate model space</title>
        <p>M 2 M</p>
        <p>S
Selection mapping
minimizing e
S : F ! M
e(M) 2 E</p>
      </sec>
      <sec id="sec-3-4">
        <title>Performance space</title>
      </sec>
      <sec id="sec-3-5">
        <title>Assess</title>
        <p>performance
on P
mixture discriminant analysis should have a better
performance on a multimodal fitness landscape.</p>
        <p>The features are defined as cross-validated
misclassification errors for each type of discriminant analysis.
Meta-model Features from this class are acquired from
fitting a linear and quadratic regression model.</p>
        <p>
          The model performance, specifically the adjusted R2
value of linear and quadratic models, has been used in
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] together with the minimum and the maximum of the
absolute values of the linear model coefficients. For the
quadratic model, the authors used the maximum absolute
value divided by the minimum absolute value of the fitted
model’s coefficients.
        </p>
        <p>
          Nearest-Better Clustering The features based on
Nearest-Better Clustering (NBC) have been proposed in
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The presented five features should help to recognize
funnel structures in the fitness landscape.
        </p>
        <p>
          Dispersion Dispersion of a function measures how close
together the sampled points are in the search space [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
The dispersion features are derived from this idea. They
average differences between dispersion values below a
certain moving threshold value.
        </p>
        <p>
          To estimate the dispersion, the authors of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] sampled
the space n times and took the best b points from which
they averaged pairwise distances between them. This step
was repeated for two different n values, and the final
dispersion was computed by subtracting those results. That
way a difference in dispersion is estimated.
        </p>
        <p>
          Information Content of Fitness Sequences Information
Content of Fitness Sequences (ICoFS) introduced in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
measures how difficult is it to describe a given fitness
function. For instance, a low information function would be
a constant fitness function as opposed to a high
information function such as some multimodal complicated fitness
function.
        </p>
        <p>This method uses neighboring values and compares
their fitness values. The comparisons are later transformed
into discrete information from which the features are
computed.</p>
        <p>
          CMA-ES features The authors of [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] proposed features
related to the DTS-CMA-ES algorithm. They are
computed from the CMA-ES settings, from the set of points
X = fxi j i = 1; : : : ; ng for which the function value is
known, and from DTS-CMA-ES parameters such as.
        </p>
        <p>The generation number g is an easy to obtain feature
derived from an optimization run of DTS-CMA-ES.
CMA-ES uses a step-size s (g) for controlling the size
of a distribution from which the CMA-ES samples
new points. Therefore, the step-size can be also used
as a feature.</p>
        <p>The evolution path pc and the s evolution path length
features are derived from the evolution paths length
used in the CMA-ES. These features encode how the
path of the evolution process has changed in recent
generations and measure how useful were previous
steps for the optimization.</p>
        <p>
          An additional CMA-ES feature is derived from the
number of restarts of the DTS-CMA-ES algorithm.
This could indicate how difficult the problem is.
Mahalanobis distance of the CMA-ES mean m(g) to
the mean of the empirical distribution of all points
X is another feature described in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. This feature
indicates the suitability of X for training a surrogate
model.
        </p>
        <p>The CMA similarity likelihood feature is the
loglikelihood of all points X with respect to the CMA-ES
distribution. This may also represent a measure of set
suitability for a surrogate model training.
4</p>
        <sec id="sec-3-5-1">
          <title>Experimental Evaluation</title>
          <p>Several experiments using the data obtained during the run
of the DTS-CMA-ES on benchmark functions with
different surrogate models were designed. From the error
measures of used surrogate models, the best surrogate model
can be selected as the one with the minimal error.</p>
          <p>
            We used Gaussian processes as a surrogate model. In
particular, the following covariance functions were used:
5
kLIN, kSE, kRQ, kSE, kM2 at, kNN, kGibbs, and kSE+Q. The
parameters of the kernel defining the covariance function
are found by maximum-likelihood or leave-one-out
crossvalidation method [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
          <p>In the feature space, we measured the following
lowlevel feature sets: y-Distribution, Levelset, Meta-Model,
Nearest-Better Clustering, Dispersion, Information
Content, and CMA-ES features. These sets are described in
greater detail in Subsection 3.2.</p>
          <p>Experiments are compared using two accuracies. The
exact accuracy measures exact matches of the classified
kernel and the true best performing kernel. The loose
accuracy is calculated from loose matches that considers as a
correctly classified a prediction which falls into similarly
best performing kernels. Kernels are considered similarly
best performing if their error is in the 5% quantile of the
considered kernels errors for a particular data point.</p>
          <p>Experiments are compared to a baseline model that
recommends the most frequent best performing kernel from
the training set. To this end, various approaches can be
used. With classification models we can classify the best
performing kernels, or by utilizing the information about
errors, we can apply regression or multi-label classifation
models.</p>
          <p>The following classifiers or their regression versions
were trained: decision tree, random forest, support
vector machine, and artificial neural network with two hidden
dense layers (50 and 25 neurons respectively). Results are
shown in Figures 3 and 4.</p>
          <p>The baseline model was outperformed by almost every
presented classification method. Single-label
classification methods have the highest accuracy, but its accuracy
is very similar to the multi-label classification methods.
The advantage of the multi-label classification is that it
provides more flexibility for tuning the settings and hence
provides more room for improvement.</p>
          <p>The differences between exact and loose accuracy vary
between used error measures. For RDE, MSE, and MAE
those differences are greater than for R2 error. This might
be a consequence of choosing the 5% quantile for similarly
best performing kernels.
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Used Data</title>
      <p>
        The problems used for retrieving the data fA(i); T (i); P (i) j
i = 1; : : : ; gg in this paper were obtained from running
the DTS-CMA-ES algorithm on the Black-Box
Optimization Benchmarks from the COmparing Continuous
Optimisers (COCO) platform, namely, problems in
dimensions 2, 3, 5, 10, and 20 on instances 11-15 [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The
sets fA(i); T (i); P (i) j i = 1; : : : ; gg were extracted for 25
uniformly selected generations for 8 considered surrogate
models. The algorithm was terminated if one of the
following two conditions was true:
(1) the target fitness value 10 8 is reached, or
(2) the number of evaluations of the optimized fitness
function f is at least 250D, where D is the dimension
of the function f .
      </p>
      <p>From the data, we calculate FLA features and errors
derived from surrogate models predictions. The considered
error measures are: RDE, MSE, MAE, and R2.</p>
      <p>
        The data generated for this paper were already used in
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Compared to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], more metrics in the performance
space and more classifiers are investigated. Features were
calculated using the algorithm underlying the R-package
flacco [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], reimplemented in the MATLAB language.
4.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Single-Label Classification</title>
      <p>A classifier Sc : F ! M was trained on labels of the best
performing models. To obtain a label for a data point, the
minimal error value for each GP kernel was found and its
kernel set as a label. It is not always clear which model
should be selected as a label because multiple models can
have equal errors. To address this ambiguity, multi-label
classifiers are tested in the next subsection.
4.3</p>
    </sec>
    <sec id="sec-6">
      <title>Multi-Label Classification</title>
      <p>From the original dataset, not only the best, but all nearly
best performing models are found and used as labels for Sc
training. The trained classifier is then capable of
predicting multiple labels for given landscape features. However,
for a fair comparison with the single-label classification
and with the regression approach, only one label has to be
predicted. To this end, a regression model is utilized to
predict the best performing model to select a single-label
among labels predicted by the multi-label classifier. That
regression model considers only labels predicted by the
multi-label classifier and among them, the one best
according to the regression model is selected as the final label for
comparison.
o
2 E F
0 r i
% r g
o u
r r
a e
i t
n h
g
l
a
a</p>
      <p>n
5 d
- s
f c
o</p>
      <p>a
l
d p</p>
      <p>e
c
r f
0
.
2
0
l t 4
d e : S ba
taa rE E ingle selin</p>
      <p>r x -la e
s o a S b
e r c in el
t . g D
. t le e</p>
      <p>H a -la cis
y c be io</p>
      <p>l n
p c S R t
e u ing an ree
r r le do
- a - m
p c la</p>
      <p>b f
a y S e o
r ing lS res
a a le VMt
em dn Single -label ova</p>
      <p>- S
te lo la V</p>
      <p>b M
r e
s o M l o
fo se ulti-laNeura vo
r a b l
e c M el ne
ac cu ulti-la Decistwork
h ra bel ion
c c R t
l y an ree
a Md</p>
      <p>uo
s f l m
isfi ro Multi ti-labefore</p>
      <p>e e la lSst</p>
      <p>b
r a e V
c R l M</p>
      <p>N
w h eg e</p>
      <p>r u
e e r</p>
      <p>c s a
r s l
e o R io ne
f n eg nD tw
o is res ec ork
u d sio isio
n e n n
d r R t</p>
      <p>e a re
u d Rend e</p>
      <p>go
is m Re resmf
n g so
g o re iore</p>
      <p>s ns
d s St
a e ion VM
5 l N</p>
      <p>e
- t u
f r ra
o a l</p>
      <p>n
l i e
d n tw
e o</p>
      <p>r
c d k
r
o w
s
s it
- h
v
a l
l a
i n
d</p>
      <p>d
a
t s
i c
o a
n p
0 0 0
.2 .2 .3
0 5 0</p>
    </sec>
    <sec id="sec-7">
      <title>Regression</title>
      <p>A regression model Sr : F ! E jMj was trained to predict
an error of a surrogate model for given landscape
features. The Sr model yields errors from which a minimum
is found and its corresponding surrogate model is selected.</p>
      <p>Some regression models yield only one prediction. To
this end, for each surrogate model one regression model is
trained to predict the error and results are then combined.
5</p>
      <sec id="sec-7-1">
        <title>Conclusion</title>
        <p>A design of various methods for classifying the data from
FLA to predict the most convenient surrogate model were
presented. The baseline model was outperformed with
almost every presented classification method. However, the
differences between the highest accuracy scores and the
baseline scores are very small. From the accuracy scores
in Figures 3 and 4, it is clear that both the best classifiers
and the best regression models are random forests for
almost all considered approaches.</p>
        <p>The accuracy scores suggests that the classifiers did not
solve the problem of surrogate model selection for
DTSCMA-ES algorithm completely. However, this method
might improve the performance of the DTS-CMA-ES
algorithm because even a small improvement in accuracy
might be useful.</p>
        <p>Possible problems might be with an imbalance of
classes in the training dataset and/or with similar
performance of some surrogate models for some data points.
The latter one was addressed with multi-label
classification methods.</p>
        <p>A further improvement of the presented methods could
be achieved through improved fitness landscape analysis.
This concerns on the one hand new suitable fitness
landscape features, on the other hand feature selection that
could reduce the feature space and improve the
performance of employed classifiers and regression models.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>The research reported in this paper has been supported by
the Czech Science Foundation (GACˇ R) grant 18-18080S.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Lukáš</given-names>
            <surname>Bajer</surname>
          </string-name>
          , Zbyneˇk Pitra, Jakub Repický, and
          <article-title>Martin Holenˇa. Gaussian process surrogate models for the CMA evolution strategy</article-title>
          .
          <source>Evolutionary computation</source>
          ,
          <volume>27</volume>
          (
          <issue>4</issue>
          ):
          <fpage>665</fpage>
          -
          <lpage>697</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. E. P.</given-names>
            <surname>Box and K. B. Wilson</surname>
          </string-name>
          .
          <article-title>On the Experimental Attainment of Optimum Conditions</article-title>
          .
          <source>Journal of the Royal Statistical Society: Series B (Methodological)</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          , jan
          <year>1951</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Mark</surname>
            <given-names>N</given-names>
          </string-name>
          <string-name>
            <surname>Gibbs</surname>
          </string-name>
          .
          <article-title>Bayesian Gaussian processes for regression and classification</article-title>
          .
          <source>PhD thesis</source>
          , Citeseer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Nikolaus</given-names>
            <surname>Hansen</surname>
          </string-name>
          .
          <article-title>The CMA evolution strategy: a comparing review</article-title>
          .
          <source>In Towards a new evolutionary computation</source>
          , pages
          <fpage>75</fpage>
          -
          <lpage>102</lpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Nikolaus</given-names>
            <surname>Hansen</surname>
          </string-name>
          , Anne Auger,
          <string-name>
            <surname>Steffen Finck</surname>
            , and
            <given-names>Raymond</given-names>
          </string-name>
          <string-name>
            <surname>Ros</surname>
          </string-name>
          .
          <article-title>Real-Parameter Black-Box Optimization Benchmarking 2009: Noiseless Functions Definitions</article-title>
          .
          <source>Technical report, Citeseer</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nikolaus</given-names>
            <surname>Hansen</surname>
          </string-name>
          , Anne Auger,
          <string-name>
            <surname>Steffen Finck</surname>
            , and
            <given-names>Raymond</given-names>
          </string-name>
          <string-name>
            <surname>Ros</surname>
          </string-name>
          .
          <article-title>Real-Parameter Black-Box Optimization Benchmarking 2012: Experimental Setup</article-title>
          .
          <source>Technical report, Citeseer</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Kerschke</surname>
          </string-name>
          , Mike Preuss, Simon Wessing, and
          <string-name>
            <given-names>Heike</given-names>
            <surname>Trautmann</surname>
          </string-name>
          .
          <article-title>Detecting funnel structures by means of exploratory landscape analysis</article-title>
          .
          <source>In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation</source>
          , pages
          <fpage>265</fpage>
          -
          <lpage>272</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Kerschke</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heike</given-names>
            <surname>Trautmann</surname>
          </string-name>
          .
          <article-title>Comprehensive Feature-Based Landscape Analysis of Continuous and Constrained Optimization Problems Using the R-package flacco. In Applications in Statistical Computing - From Music Data Analysis to Industrial Quality Improvement, Studies in Classification, Data Analysis, and Knowledge Organization</article-title>
          , pages
          <fpage>93</fpage>
          -
          <lpage>123</lpage>
          . Springer,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Monte</given-names>
            <surname>Lunacek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Darrell</given-names>
            <surname>Whitley</surname>
          </string-name>
          .
          <article-title>The dispersion metric and the CMA evolution strategy</article-title>
          .
          <source>In Proceedings of the 8th annual conference on Genetic and evolutionary computation - GECCO 06</source>
          . ACM Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Bertil</given-names>
            <surname>Matérn</surname>
          </string-name>
          .
          <article-title>Spatial variation</article-title>
          .
          <source>Technical report</source>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Olaf</surname>
            <given-names>Mersmann</given-names>
          </string-name>
          , Bernd Bischl, Heike Trautmann, Mike Preuss, Claus Weihs, and
          <string-name>
            <given-names>Günter</given-names>
            <surname>Rudolph</surname>
          </string-name>
          .
          <article-title>Exploratory landscape analysis</article-title>
          .
          <source>In Proceedings of the 13th annual conference on Genetic and evolutionary computation</source>
          , pages
          <fpage>829</fpage>
          -
          <lpage>836</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Mario</surname>
            <given-names>A Muñoz</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kirley</surname>
          </string-name>
          , and
          <article-title>Saman K Halgamuge</article-title>
          .
          <article-title>Exploratory landscape analysis of continuous space optimization problems using information content</article-title>
          .
          <source>IEEE transactions on evolutionary computation</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>74</fpage>
          -
          <lpage>87</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Mario</surname>
            <given-names>A Muñoz</given-names>
          </string-name>
          , Yuan Sun,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kirley</surname>
          </string-name>
          , and
          <article-title>Saman K Halgamuge</article-title>
          .
          <article-title>Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges</article-title>
          .
          <source>Information Sciences</source>
          ,
          <volume>317</volume>
          :
          <fpage>224</fpage>
          -
          <lpage>245</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Zbyneˇk</given-names>
            <surname>Pitra</surname>
          </string-name>
          , Lukáš Bajer, and
          <article-title>Martin Holenˇa. Doubly trained evolution control for the surrogate CMA-ES</article-title>
          .
          <source>In International Conference on Parallel Problem Solving from Nature</source>
          , pages
          <fpage>59</fpage>
          -
          <lpage>68</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Zbyneˇk</given-names>
            <surname>Pitra</surname>
          </string-name>
          , Lukáš Bajer, and
          <article-title>Martin Holenˇa. Knowledge-based Selection of Gaussian Process Surrogates</article-title>
          .
          <source>In Workshop &amp; Tutorial on Interactive Adaptive Learning, page 48</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Zbyneˇk</given-names>
            <surname>Pitra</surname>
          </string-name>
          , Jakub Repický, and
          <article-title>Martin Holenˇa</article-title>
          .
          <article-title>Landscape analysis of gaussian process surrogates for the covariance matrix adaptation evolution strategy</article-title>
          .
          <source>In Proceedings of the Genetic and Evolutionary Computation Conference</source>
          , pages
          <fpage>691</fpage>
          -
          <lpage>699</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Carl</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          .
          <article-title>Gaussian processes for machine learning</article-title>
          . MIT Press, Cambridge, Mass,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>John R. Rice</surname>
          </string-name>
          et al.
          <article-title>The algorithm selection problem</article-title>
          .
          <source>Advances in computers</source>
          ,
          <volume>15</volume>
          (
          <fpage>65</fpage>
          -118):
          <fpage>5</fpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>