=Paper= {{Paper |id=Vol-1649/163 |storemode=property |title=Traditional Gaussian Process Surrogates in the BBOB Framework |pdfUrl=https://ceur-ws.org/Vol-1649/163.pdf |volume=Vol-1649 |authors=Jakub Repický, Lukáš Bajer, Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/RepickyBH16 }} ==Traditional Gaussian Process Surrogates in the BBOB Framework== https://ceur-ws.org/Vol-1649/163.pdf
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 163–171
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 J. Repický, L. Bajer, M. Holeňa



                Traditional Gaussian Process Surrogates in the BBOB Framework

                                            Jakub Repický1 , Lukáš Bajer1 , and Martin Holeňa2
                                                     1 Faculty of Mathematics and Physics
                                                         Charles University in Prague
                                                            Malostranské nám. 25
                                                           Prague, Czech Republic
                                                     ④❥✳r❡♣✐❝❦②✱❜❛❥❡❧✉❦⑥❅❣♠❛✐❧✳❝♦♠
                                                       2 Institute of Computer Science,

                                                         Czech Academy of Sciences
                                                           Pod Vodárenskou věží 2
                                                           Prague, Czech Republic
                                                             ♠❛rt✐♥❅❝s✳❝❛s✳❝③

      Abstract: Objective function evaluation in continuous op-               Still the whole population must be evaluated which
      timization tasks is often the operation that dominates the           might make running the algorithm for a sufficient num-
      algorithm’s cost. In particular in the case of black-box             ber of generations infeasible due to expensive evaluation.
      functions, i.e. when no analytical description is available,         To address this issue, techniques that involve a surrogate
      and the function is evaluated empirically. In such a situ-           regression model of the fitness have been proposed.
      ation, utilizing information from a surrogate model of the              Two major requirements of incorporating a surrogate
      objective function is a well known technique to accelerate           model (also called metamodel in the literature) into evo-
      the search. In this paper, we review two traditional ap-             lutionary strategies are model management and model se-
      proaches to surrogate modelling based on Gaussian pro-               lection.
      cesses that we have newly reimplemented in MATLAB:                      Model management is the task to control the surrogate
      Metamodel Assisted Evolution Strategy using probability              model’s impact on algorithm’s convergence by using the
      of improvement and Gaussian Process Optimization Pro-                original fitness alongside its surrogate model in the course
      cedure. In the research reported in this paper, both ap-             of the search.
      proaches have been for the first time evaluated on Black-               In evolution control, a certain fraction of individuals or
      Box Optimization Benchmarking framework (BBOB), a                    generations is controlled, i.e. evaluated with the fitness
      comprehensive benchmark for continuous optimizers.                   function, while the remainder is evaluated with the surro-
                                                                           gate model [8].
                                                                              For example, Metamodel-Assisted Evolution Strategy
      1    Introduction                                                    (MAES) uses a surrogate model to pre-select the most
                                                                           promising individuals before they enter a selection proce-
      An analytical definition of the objective function in                dure of a standard ES [3].
      real-world optimization tasks is sometimes hard to ob-                  In contrast to evolution control, surrogate approach [2]
      tain. Therefore, neither information about the function’s            directly optimizes the model output in an iterative proce-
      smoothness nor its derivatives are available. Moreover,              dure, thus avoiding the issue of determining the correct
      evaluation of the function is usually expensive as it can            fraction of controlled individuals. In each iteration, a fixed
      only be done empirically, e.g. by measurement, testing, or           number of candidate solutions are found by minimizing
      running a computer simulation. Such functions are called             the model with an evolution strategy. These solutions are
      black-box.                                                           thereafter evaluated on the real fitness and the model is
         One class of optimization algorithms that are suc-                updated.
      cessfully applied to black-box optimization are evolu-                  Regarding the model selection, Gaussian processes
      tion strategies. An evolution strategy is an optimization            (GPs) are a non-parameterized regression model that is ap-
      method that works on a population of candidate solutions             pealing for the task as it gives its prediction in terms of a
      using evolutionary operators of selection, mutation and re-          Gaussian distribution. The variance of this prediction can
      combination [10] [11]. In particular, the Covariance Ma-             be utilized as a confidence measure that promotes explo-
      trix Adaptation Evolution Strategy (CMA-ES) [4] is con-              ration of insufficiently modelled areas.
      sidered to be the state-of-the-art continuous black-box op-             This paper reviews two traditional algorithms intercon-
      timizer. It samples each population according to a mul-              necting Gaussian process-based surrogate models with the
      tivariate normal distribution determined by a covariance             CMA-ES: Metamodel-Assisted Evolution strategy with
      matrix. A notable property of the algorithm is the adap-             improved pre-selection criterion by Ulmer, Strechert and
      tation of the covariance matrix along the path of past suc-          Zell [13] and Gaussian Process Optimization Procedure
      cessful search steps.                                                (GPOP) by Büche, Schraudolph and Koumoutsakos [2].
164                                                                                                               J. Repický, L. Bajer, M. Holeňa

         The former is a GP-based MAES with probability of                  For a set tN+1 that includes a new observation tN+1 =
      improvement (POI) as a pre-selection criterion.                    f (xN+1 ), we obtain
         The latter represents the surrogate approach – in each
      iteration, a local GP model is built and four functions de-                                 exp(− 1 tT C−1 tN+1 )
                                                                                 p(tN+1 | XN+1 ) = p 2 N+1 N+1          .              (2)
      signed to balance predicted value and variance are opti-                                       (2π)N+1 det(CN+1 )
      mized.
         While both algorithms are GP-based, they differ both            Using Bayesian rule for conditional probabilities, the pre-
      in the model-management approach as well as in utilizing           diction at a new data point has the density function
      GP’s confidence.                                                                                       p(tN+1 | XN+1 )
         The framework COCO/BBOB (Comparing Continuous                              p(tN+1 | XN+1 , tN ) =                   .         (3)
                                                                                                               p(tN | XN )
      Optimizers / Black-Box Optimization Benchmarking) [7]
      provides an experimental setup for benchmarking black-             The covariance matrix CN+1 can be written with the use
      box optimizers. In particular, its noiseless testbed [6] com-      of the covariance matrix CN as
      prises of 24 functions with properties that are to different                                        
      extents challenging for continuous optimizers.                                                  CN k
                                                                                            CN+1 =                          (4)
         Both tested methods had been proposed before the                                             kT κ
      BBOB framework originated. In the research reported in
      this paper, we evaluated both methods on the noiseless part        where k = (C(xi , xN+1 ))N1 is a vector of covariances be-
      of the BBOB framework. For that purpose a new imple-               tween the new point and XN and κ = C(xN+1 , xN+1 ) is the
      mentation1 was required as the original source codes were          new point’s variance.
      not available to us.                                                  Using (1) and (2) together with the fact that the inverse
         In the following, we first briefly introduce Gaussian pro-      C−1                                              −1
                                                                           N+1 can also be expressed by the means of CN , (3) can
      cesses as a suitable surrogate fitness model in Section 2.         be simplified to a univariate Gaussian [2]
      An exposition of the tested methods is given in Section 3
      and experimental setup in Section 4. Section 5 presents                                                                    !
                                                                                                          1 (tN+1 − tˆN+1 )2
      experimental results and finally Section 6 concludes the               p(tN+1 | XN+1 , tN ) ∝ exp −                              (5)
      paper.                                                                                              2     σt2N+1

                                                                         with mean and variance given by
      2    Gaussian processes

      Both algorithms under review feature the choice of Gaus-                               tˆN+1 = kT C−1
                                                                                                         N tN ,
      sian processes as a surrogate fitness function model.                                 σt2N+1 = κ − kT C−1
                                                                                                             N k.
         GPs are a probabilistic model with several properties
      that make it well suited for fitness function modelling: its          A comprehensive exposition of Gaussian processes can
      hyperparameters are comprehensible and limited in num-             be found in [9].
      ber and it provides a confidence measure given by standard            The covariance function plays an important role as it ex-
      deviation of predicted value at new data points.                   presses prior assumptions about the shape of the modelled
         In the following, we define a GP using notation and             function. The vector Θ of model’s hyperparameters is usu-
      equations from Büche [2].                                          ally fitted with the maximum likelihood method.
         Consider f : RD → R an unknown real-parameter func-                One of the most commonly used covariance functions is
      tion to be approximated. GP model is specified by a                squared exponential covariance function:
                              N                                                                                                  
      set XN = xi | xi ∈ RD i=1 of N training data points with                                          1 (x p − xq )T (x p − xq )
      known function values tN = {ti | f (xi ) = ti }Ni=1 . The data            C (x p , xq ) = θ exp −                              (6)
                                                                                                        2            r2
      are assumed to be a sample of zero-mean multivariate
      Gaussian distribution with joint probability density               where the parameter θ scales the covariance between two
                                                                         points and radius r is the characteristic length scale of the
                                    exp(− 12 tTN C−1
                                                  N tN )                 process.
                      p(tN | XN ) = p                             (1)
                                      (2π)N det(CN )                       When two points are close to each other compared to the
                                                                         characteristic length scale r, the covariance is close to one
      where the covariance matrix CN is defined by means of              while it exponentially decays to zero with their distance
      a covariance function C(xi , x j , Θ) i, j ∈ {1, . . . , N} with   growing.
      a fixed set of hyperparameters Θ.                                    The squared exponential covariance function can be im-
                                                                         proved with automatic relevance determination (ARD):
                                                                                                                               !
          1 The   sources are available at ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴r❡♣❥❛❦✴
                                                                                                          1 D (x p,i − xq,i )2
                                                                                 C (x p , xq ) = θ exp − ∑                          (7)
      s✉rr♦❣❛t❡✲❝♠❛❡s✴                                                                                    2 i=1      ri2
Traditional Gaussian Process Surrogates in the BBOB Framework                                                                              165

     where the radii ri scale the impact of two points distance           Algorithm 1 GP Model-Assisted Evolution Strategy
     on their correlation separately in each dimension.                   Input: f – fitness function
        To address the need to balance the exploitation of a fit-             µ – number of parents
     ted model prediction and the exploration of regions where                λ – population size
     model’s confidence is low, the standard deviation of GP’s                λPre – size of pre-selected population
     prediction can be utilized.                                              Ntr – size of training dataset
        One possibility are merit functions proposed in [12]                  χM – the preselection criterion that depends on a GP
     for the purpose of balancing exploration and exploitation                model M, e.g. mean model prediction or POI
     in engineering design optimization. Consider M to be a                1: Pop ← generate and evaluate λ initial samples
     trained Gaussian process model. A merit function com-                 2: M ← a GP model trained on points from Pop
     bines the goals of finding a minimum predicted by the                 3: while termination criteria not reached do
     model M and improving the accuracy of M into a single                 4:    Offspring ← reproduce Pop into λPre new points
     objective function:                                                   5:    Offspring ← mutate Offspring
                                                                           6:    Offspring ← select best λ points according to the
                        fM,α (x) = tˆ(x) − ασ (x)                  (8)
                                                                                 pre-selection criterion χM
     where tˆ(x) is the mean of the prediction of the function             7:    evaluate Offspring with f
     value in x, σ (x) is the prediction standard deviation and            8:    M ← update model M on Ntr points most recently
     α ≥ 0 is a balancing parameter.                                             evaluated with f
        Another option is the probability of improvement (POI).            9:    Pop ← select µ points best according to f
     Let us assume Gaussian process prediction to be a ran-               10: end while
     dom variable Y (x) with mean tˆ(x) and standard deviation
     σ (x). For a chosen threshold T less than or equal to the
     so-far best obtained fitness value fmin , the probability of         3.2   Gaussian Process Optimization Procedure
     improvement in point x is defined as:
                                                        
                                               T − tˆ(x)                  Gaussian Process Optimization Procedure is due to Büche,
             POIT (x) = p(Y (x) ≤ T ) = Φ                     (9)
                                                σ (x)                     Schraudolph and P. Koumoutsakos [2]. A Gaussian pro-
     where Φ is the cumulative distribution function of the dis-          cess is trained on a subset of already evaluated data and
     tribution N (0, 1).                                                  optimized by an ES instead of the original fitness. As
                                                                          optimization of the surrogate is cheaper than optimization
                                                                          of the original fitness, this can be repeated until reaching
     3     Tested Methods                                                 some termination criteria.
     3.1   GP Model Assisted Evolution Control                               CMA-ES is used as the evolution strategy, with the
                                                                          number of parents µ set to 2 and the population size λ
     A standard ES operates on λ offsprings generated from                set to 10.
     µ parents by evolutionary operators of recombination and                The pseudocode is given in Figure 2. After generating
     mutation. After the fitness of the offsprings is evaluated, a        an initial population, a local GP model is built and utilized
     population of µ best individuals is selected to reproduce to         in an iterative procedure. Considering possibly low ac-
     a new offspring in the next iteration. In (µ, λ ) ES, µ best         curacy and computational infeasibility of global models,
     individuals are selected from the λ offsprings, whereas in           the training dataset is restricted to NC points closest to the
     (µ + λ ) ES, µ best individuals are selected from the union          current best known solution xbest (step 6) and NR most re-
     of the offprings and their parents.                                  cently evaluated points (step 7).
        MAES [3] modifies the standard evolution strategy with
     producing λPre > λ instead of λ individuals from µ par-                 If the GP model M has been successfully trained then
     ents by the same operators of recombination and mutation             the CMA-ES optimizes four merit functions fM,α (8) for
     (steps 4 and 5 in the Algorithm 1). Given a fitted Gaus-             each α ∈ {0, 1, 2, 4}. Areas that might be approximated
     sian process M, individuals xi , i = 1, . . . , λPre are then pre-   inaccurately are avoided by bounding the ES to the hyper-
     selected according to a criterion χM defined for the model           cube spanning the set of NC points selected from xbest ’s
     M to create λ offsprings (step 6).                                   neighborhood (steps 10 and 12). The points that are op-
        The GP model is trained in every generation on a set of           tima of the considered merit functions are evaluated by the
     Ntr most recently evaluated individuals (step 8).                    original fitness and added to the dataset of known points.
        In this paper we consider two pre-selection criteria in              In the case that no new solution is found, a random per-
     accordance with [13]: mean model prediction (MMP) (5)                turbation (step 23) is evaluated and added to the dataset.
     which selects λPre points with the best mean predicted fit-          Unfortunately, authors don’t specify the parameter m that
     ness tˆ(x) and POI (9). The authors of [13] prefer POI to            occurs in 23. We set it to the value m = 1.
     merit functions (8), as it does not depend on finding the               The authors used the following covariance function in
     appropriate value of the scale parameter α.                          their GP model:
166                                                                                                             J. Repický, L. Bajer, M. Holeňa

                                                                         Benchmarking) [6] [7] is intended for systematic ex-
                                                              !          perimentation with black-box optimizers. We use the
                                       1 n (x p,i − xq,i )2              noiseless testbed of the BBOB framework, which is
            C (x p , xq ) = θ1 exp −     ∑
                                       2 i=1      ri2             (10)   comprised of 24 real-parameter benchmark functions
                                                                         with different characteristics such as non-separability,
                          + θ2 + δ pq θ3
                                                                         multi-modality, ill-conditioning etc.
        where ri , θ1 , θ2 , θ3 > 0 and δ pq is the Kronecker delta.        Each function is defined on [−5, 5]D for all D ≥ 2. For
      The function is the sum of the squared exponential covari-         every function and every dimensionality, 15 trials of the
      ance function with ARD (7), constant shift θ2 and a white          optimizer are run. A different instance of the original
      noise scaled with θ3 .                                             function is used for each trial. We used dimensionalities
                                                                         2, 3, 5, 10, thus 1440 trials in total were run for each set of
      Algorithm 2 Gaussian Process Optimization Procedure                parameters and each method.
      Input: NC – number of training points selected according              Since source codes were available for neither of the
          to their distance to xbest                                     tested methods, we implemented them in MATLAB.
          NR – number of training points selected according to              For Gaussian processes, we chose MATLAB’s default
          most recent time of evaluation                                 implementation, ❢✐tr❣♣, a part of Statistics and Machine
           f – fitness function                                          Learning Toolbox. The GP hyperparameters fitting was
          µ – the number of parents for CMA-ES                           done with a quasi-newton optimizer, which is the default
          λ – population size for CMA-ES                                 in ❢✐tr❣♣.
       1: {x1 , . . . , xNC /2 } ← a set of NC /2 points generated by
                                                                            Parameters of the benchmarked algorithms were set as
          CMA-ES                                                         follows:
       2: yi ← f (xi ) i ∈ {1, . . . , NC /2}
       3: A ← {(xi , yi ) | i ∈ {1, . . . , NC /2}}                      CMA-ES. A multi-start version with the population size
       4: xbest ← arg minx∈{xi | i∈{1,...,N /2}} ( f (x))                doubled after each restart was used in the tests (MATLAB
                                               C
       5: while stopping criteria not reached do                         code v. 3.62.beta). Number of restarts was set to 4,
       6:    TC ← NC points from {u | ∃z (u, z) ∈ A} closest to          while other parameters settings were left default: λ =
             xbest                                                       4 + ⌊3log10 D⌋, σstart = 38 , IncPopSize = ⌊λ /2⌋.
       7:    TR ← NR points most recently evaluated
       8:    M ← a GP model trained on TC ∪ TR                           MAES. We implemented GP Model Assisted Evolution
       9:    S ← 0/                                                      Strategy on top of a framework developed for S-CMA-ES
      10:    d ← (di )D    i=1 , di = maxx∈TC (xi ) − minx∈TC (xi )
                                                                         algorithm, which employs a GP model in conjunction with
      11:    for all α ∈ {0, 1, 2, 4} do                                a generation evolution control [1]. The S-CMA-ES imple-
      12:        B ← x | xbest − d2 ≤ x ≤ xbest + d2                     mentation allows to conveniently replace the population
      13:        x ← optimize fM,α within B by (µ, λ )-CMA-ES            sampling step of the CMA-ES with a custom procedure.
      14:        if 6 ∃z (x, z) ∈ A then                                 In this case, a population intended for pre-selection is sam-
      15:             y ← f (x)                                          pled and processed as described in Subsection 3.1. The
      16:             S ← S ∪ {x, y}                                     control is then handed over back to the CMA-ES.
      17:        end if                                                     The number of parents and population size were set to
      18:    end for                                                     correspond with the CMA-ES settings.
      19:    if S 6= 0/ then                                                Two pre-selection criteria were tested: MMP and the
      20:        A ← A∪S                                                 POI with threshold equal to the so-far best sampled fitness
      21:        xbest ← arg minx∈{u | ∃z (u,z)∈A} ( f (x))              value fmin . In both cases λPre was set to 3λ . The training
      22:    else                          D                           set was comprised from 2λ most recently evaluated points.
      23:        xR ← xibest + 100   zi di
                                           m     zi ∼ N (0, 1)              We used the same covariance function as in the GPOP
                                         i=1                             case, i.e. (10).
      24:      yR ← f (x)
                        
      25:      A ← A ∪ (xR , yR )
                                                                         GPOP. We adhered to [2] in usage of the proposed covari-
      26:   end if
                                                                         ance function (10).
      27: end while
                                                                           The termination criteria were chosen as follows:
      28: return xbest
                                                                           • number of consecutive iterations with no new solu-
                                                                             tion found is larger than 2 while the tolerance on the
                                                                             two points euclidean distance for them to be consid-
      4     Experimental Setup                                               ered equal is 10−8

      The framework COCO/BBOB (Comparing Con-                              • the overall change of fitness values during the last 10
      tinuous Optimizers / Black-Box Optimization                            iterations is lower than 10−9
Traditional Gaussian Process Surrogates in the BBOB Framework                                                                            167

         • the target fitness value is reached                         6   Conclusion
        Training set size parameters NC and NR were chosen in          In this paper, we compared the CMA-ES and our im-
     accordance with empirical results reported in [2], particu-       plementation of two traditional methods which improve
     larly NC = NR = 5 ∗ D.                                            upon the CMA-ES with Gaussian process-based surrogate
                                                                       models: MAES, which extends the CMA-ES with a pre-
       Although the performance is measured in terms of the            selection step, and GPOP, which iteratively optimizes the
     number of fitness evaluations, other operations such as op-       GP model.
     timizing a surrogate model may also be costly in bench-              The benchmarks on the BBOB framework did not
     marking scenarios. If we consider functions in 10 D, a            show any significant speedup of MAES compared to the
     run of GPOP on one core of a computational grid took ap-          CMA-ES. On the other hand, GPOP in many cases out-
     proximately 27.8 real hours per function on average. For          performs all the other methods, especially in early opti-
     those reasons, we limited the maximum number of fitness           mization stages. On some functions, though, it tends to
     evaluations to 100 ∗ D for all tested methods.                    get trapped in local minima. This might be explained with
                                                                       the fact that GPOP requires considerably fewer function
                                                                       evaluations per iteration than other methods. However, the
     5     Experimental Results                                        model is built locally and might not be sufficient for ex-
                                                                       ploration of the search space in later phases.
     Results from experiments on all the 24 noiseless BBOB                Our MAES implementation relies on a modification of
     benchmark functions are presented in Figures 1–3.                 the sampling step in CMA-ES, thereby changing the dis-
        The expected running time (ERT) depicted in Figure 1           tribution of the sampled population. This might mislead
     depends on a given target value, ft = fopt + ∆ f , i.e. the       CMA-ES a bit and requires further research, in particular
     true optimum fopt of the respective benchmark function            considering the proposal in [5].
     raised by a small value ∆ f . The ERT is computed over               Another area for further research is the exploration of
     all relevant trials as the number of the original function        various confidence measures across tested methods, espe-
     evaluations (FE) executed during each trial until the tar-        cially in connection with GPOP.
     get function value ft reached, summed over all trials and
     divided by the number of successful trials, i.e. trials that
     actually reached ft : [7]                                         7   Acknowledgments
                                    #FE( fbest ≥ ft )                  Access to computing and storage facilities owned by par-
                      ERT( ft ) =                               (11)
                                        #succ                          ties and projects contributing to the National Grid In-
                                                                       frastructure MetaCentrum, provided under the programme
        In the graphs for functions 1, 2, 5, 8, 9, 12, 14, GPOP        "Projects of Large Research, Development, and Innova-
     achieved significantly better results compared to all other       tions Infrastructures" (CESNET LM2015042), is greatly
     algorithms for some dimensions. In contrast, the differ-          appreciated.
     ences between MAES and CMA-ES are rather small, re-                  The research performed by Lukáš Bajer has been sup-
     gardless of the pre-selection criterion.                          ported by the Czech Health Research Council project
        The graphs in Figure 2 summarize the performance               NV15-33250A.
     over subgroups of the benchmark functions for the high-              The research performed by Jakub Repický was sup-
     est tested dimension 10. The graphs show the proportion           ported by SVV project number 260 224.
     of algorithm runs that reached a target value ft ∈ 10[−1..2]
     (see the figure caption for further details).
        The GPOP speedup is most eminent on the group of               References
     separable functions (functions 1–5), that is functions, op-
     timization of which can be reduced to D one-dimensional            [1] L. Bajer, Z. Pitra, and M. Holeňa. Benchmarking Gaus-
     problems [6]. On the other hand, GPOP has the worst re-                sian processes and random forests surrogate models on the
                                                                            BBOB noiseless testbed. In Proceedings of the Compan-
     sults of all methods on multi-modal functions (functions
                                                                            ion Publication of the 2015 Annual Conference on Genetic
     15–19).
                                                                            and Evolutionary Computation, GECCO Companion ’15,
        Similar results may be observed on Figure 3 that for                2015.
     each function shows the dependence of the relative best fit-
                                                                        [2] D. Büche, N. N. Schraudolph, and P. Koumoutsakos. Ac-
     ness value on the number of evaluations in 10 D. As can be             celerating evolutionary algorithms with Gaussian process
     seen on the graphs for functions 13 and 24, GPOP in some               fitness function models. IEEE Transactions on Systems,
     cases outperforms all other algorithms in early stages, but            Man and Cybernetics, 35(2):183–194, 2005.
     then gets trapped in a local minimum.                              [3] M. Emmerich, A. Giotis, M. Özdemir, T. Bäck, and
        On the graph for functions 21 on Figure 3, MAES with                K. Giannakoglou. Metamodel-assisted evolution strategies.
     MMP as the pre-selection criterion visibly outperforms all             In In Parallel Problem Solving from Nature VII, pages 361–
     other algorithms.                                                      370. Springer, 2002.
168                                                                                                                                    J. Repický, L. Bajer, M. Holeňa


                           3              1 Sphere                   2 Ellipsoid separable            3 Rastrigin separable     4 Skew Rastrigin-Bueche separ
      log10 of (ERT / D)




                           2


                           1                         CMAES
                                                     MAES-MMP
                                                     MAES-POI
                                                     GPOP
                           0 target Df: 0.1



                           3           5 Linear slope                 6 Attractive sector               7 Step-ellipsoid            8 Rosenbrock original
      log10 of (ERT / D)




                           2


                           1


                           0



                           4        9 Rosenbrock rotated                   10 Ellipsoid                      11 Discus                 12 Bent cigar
      log10 of (ERT / D)




                           3

                           2

                           1

                           0



                           4           13 Sharp ridge            14 Sum of different powers                 15 Rastrigin               16 Weierstrass
      log10 of (ERT / D)




                           3

                           2

                           1

                           0



                           4 17 Schaffer F7, condition 10       18 Schaffer F7, condition 1000   19 Griewank-Rosenbrock F8F2        20 Schwefel x*sin(x)
      log10 of (ERT / D)




                           3

                           2

                           1

                           0



                           4       21 Gallagher 101 peaks            22 Gallagher 21 peaks                  23 Katsuuras           24 Lunacek bi-Rastrigin
      log10 of (ERT / D)




                           3

                           2
                                                                                                                                                  CMAES
                           1                                                                                                                      MAES-MMP
                                                                                                                                                  MAES-POI
                                                                                                                                                  GPOP
                           0                                                                                                     target Df: 0.1
                               2      3      5          10       2     3      5           10      2     3       5          10    2     3      5         10
                                          dimension                        dimension                         dimension                    dimension


      Figure 1: Expected running time (ERT as log10 of the number of f -evaluations) for target function value 0.1 divided
      by dimension, versus dimension. Different symbols correspond to different algorithms given in the legend of f1 and f24 .
      Values are plotted up to the highest dimensionality with at least one successful trial. Light symbols give the maximum
      number of function evaluations from the longest trial divided by dimension. Black stars indicate a statistically better result
      compared to all other algorithms with p < 0.05 using Hommel correction.
Traditional Gaussian Process Surrogates in the BBOB Framework                                                                                       169




                                                           separable functions                                moderate functions
     Proportion of function+target pairs



                                           1.0                                     GPOP                                              MAES-MMP
                                                 bbob - f1-f5, 10-D                                bbob - f6-f9, 10-D

                                           0.8

                                                                                   CMAES                                             MAES-POI
                                           0.6


                                           0.4
                                                                                   MAES-POI                                          CMAES


                                           0.2


                                                                                   MAES-MMP                                          GPOP
                                           0.0



                                                         ill-conditioned functions                          multi-modal functions
     Proportion of function+target pairs




                                           1.0                                     GPOP                                              MAES-POI
                                                 bbob - f10-f14, 10-D                              bbob - f15-f19, 10-D

                                           0.8

                                                                                   MAES-MMP                                          MAES-MMP
                                           0.6


                                           0.4
                                                                                   CMAES                                             CMAES


                                           0.2


                                                                                   MAES-POI                                          GPOP
                                           0.0



                                                 weakly structured multi-modal functions                          all functions
     Proportion of function+target pairs




                                           1.0                                     GPOP                                              GPOP
                                                 bbob - f20-f24, 10-D                              bbob - f1-f24, 10-D

                                           0.8

                                                                                   MAES-POI                                          MAES-MMP
                                           0.6


                                           0.4
                                                                                   MAES-MMP                                          MAES-POI


                                           0.2


                                                                                   CMAES                                             CMAES
                                           0.0
                                             -1           0           1          2            3   -1        0           1          2            3
                                                       log10 of (# f-evals / dimension)                  log10 of (# f-evals / dimension)


     Figure 2: Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by
     dimension (FE/D) for 61 targets with target precision in 10[−1..2] for different subgroups of BBOB benchmark functions
     in 10 D.
170                                                                                                                                                                     J. Repický, L. Bajer, M. Holeňa




                                        1 Sphere                             2 Ellipsoid separable                       3 Rastrigin separable            4 Skew Rastrigin-Bueche separ
                     2                                           10                                            3                                           3
                                                 CMA-ES
                     0                           MAES-MMP
                                                 MAES-POI                                                    2.5                                         2.5
      log10 of Df




                                                 GPOP              5
                    -2
                                                                                                               2                                           2
                    -4
                                                                   0
                                                                                                             1.5                                         1.5
                    -6

                    -8                                            -5                                           1                                           1


                                      5 Linear slope                           6 Attractive sector                          7 Step-ellipsoid                         8 Rosenbrock original
                         5                                         6                                           3                                           6
      log10 of Df




                         0                                         4                                           2                                           4


                     -5                                            2                                           1                                           2


                    -10                                            0                                           0                                           0


                                 9 Rosenbrock rotated                              10 Ellipsoid                                11 Discus                                 12 Bent cigar
                     5                                             7                                           5                                         10

                     4                                                                                                                                     8
                                                                   6                                           4
       log10 of Df




                     3                                                                                                                                     6
                                                                   5                                           3
                     2                                                                                                                                     4
                                                                   4                                           2
                     1                                                                                                                                     2

                     0                                             3                                           1                                           0


                                    13 Sharp ridge                     14 Sum of different powers                             15 Rastrigin                               16 Weierstrass
                     4                                             2                                           3                                           2

                                                                   1
                     3                                                                                                                                   1.5
       log10 of Df




                                                                                                             2.5
                                                                   0
                     2                                                                                                                                     1
                                                                  -1
                                                                                                               2
                     1                                                                                                                                   0.5
                                                                  -2

                     0                                            -3                                         1.5                                           0


                         17 Schaffer F7, condition 10             18 Schaffer F7, condition 1000              19 Griewank-Rosenbrock F8F2                            20 Schwefel x*sin(x)
                     2                                             2                                         1.5                                           5

                                                                1.5                                                                                        4
                     1
      log10 of Df




                                                                   1                                                                                       3
                     0                                                                                         1
                                                                0.5                                                                                        2
                    -1
                                                                   0                                                                                       1

                    -2                                          -0.5                                         0.5                                           0


                                 21 Gallagher 101 peaks                     22 Gallagher 21 peaks                             23 Katsuuras                          24 Lunacek bi-Rastrigin
                         2                                         2                                           1                                         2.4
                                                                                                                                                                                     CMA-ES
                                                                                                                                                                                     MAES-MMP
                                                                                                             0.8                                         2.2                         MAES-POI
      log10 of Df




                    1.5                                         1.5                                                                                                                  GPOP
                                                                                                             0.6                                           2
                         1                                         1
                                                                                                             0.4                                         1.8

                    0.5                                         0.5                                          0.2                                         1.6
                             0              50            100          0                50             100         0                50             100         0                50             100
                             number of evaluations / D                     number of evaluations / D                   number of evaluations / D                   number of evaluations / D


      Figure 3: Relative fitness of best individual as log10 value versus number of evaluations divided by dimension for all
      functions in 10 D. Best fitness value at each number of evaluations is computed as the median of all trials.
Traditional Gaussian Process Surrogates in the BBOB Framework           171

      [4] N. Hansen. The CMA evolution strategy: a compar-
          ing review. In I. Inza J. A. Lozano, P. Larrañaga and
          E. Bengoetxea, editors, Towards a new evolutionary com-
          putation. Advances on estimation of distribution algo-
          rithms, pages 75–102. Springer, 2006.
      [5] N. Hansen. Injecting external solutions into CMA-ES.
          CoRR, abs/1110.4181, 2011.
      [6] N. Hansen, S. Finck, R. Ros, and A. Auger. Real-parameter
          black-box optimization benchmarking 2009: Noiseless
          functions definitions. technical report rr-6829. Technical
          report, INRIA, 2009. Updated February 2010.
      [7] N. Hansen, S. Finck, R. Ros, and A. Auger. Real-parameter
          black-box optimization benchmarking 2012: Experimental
          setup. technical report. Technical report, INRIA, 2012.
      [8] Y. Jin and B. Sendhoff. Fitness approximation in evolution-
          ary computation-a survey. In GECCO 2002 Proceedings of
          Genetic and Evolutionary Computation Conference, pages
          1105–1111, 2002.
      [9] C. E. Rassmusen and C. K. I. Williams. Gaussian Pro-
          cesses for Machine Learning. Adaptive computation and
          machine learning series. MIT Press, 2006.
     [10] I. Rechenberg. Evolutionsstrategie – Optimierung technis-
          cher Systeme nach Prinzipien der biologischen Evolution.
          Frommann-Holzboog, Stuttgart, 1973.
     [11] H.-P. Schwefel. Numerische Optimierung von Computer-
          Modellen. Birkhäuser, Basel, 1977.
     [12] V. Torczon and M. W. Trosset. Using approximations to ac-
          celerate engineering design optimization. In Proceedings of
          the 7th AIAA/USAF/NASA/ISSMO Multidisciplinary Anal-
          ysis & Optimization Symposium (Held at Saint Louis, Mis-
          souri), paper 98-4800, pages 507–512, 1998.
     [13] H. Ulmer, F. Streichert, and A. Zell. Evolution strategies
          assisted by Gaussian processes with improved pre-selection
          criterion. IEEE Congress on Evolutionary Computation,
          pages 692–699, 2003.