-

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES

Zbyneˇk Pitra

z.pitra@gmail.com 1 3

Lukáš Bajer

bajer@cs.cas.cz 0 2

Martin Holenˇ a

2 0 Faculty of Mathematics and Physics, Charles University in Prague Malostranské nám. 25, 118 00 Prague 1 , Czech Republic 1 Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague Brˇehová 7 , 115 19 Prague 1 , Czech Republic 2 Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vodárenskou veˇží 2 , 182 07 Prague 8 , Czech Republic 3 National Institute of Mental Health Topolová 748 , 250 67 Klecany , Czech Republic

2015

186 193

In practical optimization tasks, it is more and more frequent that the objective function is black-box which means that it cannot be described mathematically. Such functions can be evaluated only empirically, usually through some costly or time-consuming measurement, numerical simulation or experimental testing. Therefore, an important direction of research is the approximation of these objective functions with a suitable regression model, also called surrogate model of the objective functions. This paper evaluates two different approaches to the continuous black-box optimization which both integrates surrogate models with the state-of-the-art optimizer CMAES. The first Ranking SVM surrogate model estimates the ordering of the sampled points as the CMA-ES utilizes only the ranking of the fitness values. However, we show that continuous Gaussian processes model provides in the early states of the optimization comparable results.

Optimization of an expensive objective or fitness function plays an important role in many engineering and research tasks. For such functions, it is sometimes difficult to find an exact analytical formula, or to obtain any derivatives or information about smoothness. Instead, values for a given input are possible to be obtained only through expensive and time-consuming measurements and experiments. Those functions are called black-box, and because of the evaluation costs, the primary criterion for assessment of the black-box optimizers is the number of fitness function evaluations necessary to achieve the optimal value.

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [ 5 ] is considered to be the state-of-the-art of the black-box continuous optimization. The important property of the CMA-ES is that it advances through the search space only according to the ordering of the function values in current population. Hence, the search of the algorithm is rather local which predisposes it to premature convergence in local optima if not used with sufficiently large population size. This issue resulted in development of several restart strategies [ 12 ], such as IPOP-CMA-ES [ 1 ] and BIPOP-CMA-ES [ 6 ] performing restarts with population size successively increased, or aCMA-ES [ 9 ] using also unsuccessful individuals for covariance matrix adaptation.

Furthermore, the CMA-ES often requires more fitness function evaluations to find the optimum than many realworld experiments can offer. In order to decrease the number of evaluations in evolutionary algorithms, it is convenient to periodically train a surrogate model of the fitness function and use it for evaluation of new points instead of the original function. The second option is to use the model for selection of the most promising points to be evaluated by the original fitness.

Loshchilov’s surrogate-model-based algorithm s∗ACM-ES [ 13 ] utilizes the former approach: it estimates the ordering of the fitness values required by the CMA-ES using Ranking Support Vector Machines (SVM) as an ordinal regression model. Moreover, it has been shown [ 13 ] that model parameters (hyperparametres) used to construct Ranking SVM model can be optimized during the search by the pure CMA-ES algorithm. Later proposed s∗ACM-ES extensions, referred to as s∗ACM-ES-k [ 15 ] and BIPOP-s∗ACM-ES-k [ 14 ], use a more intensive exploitation of the surrogate model by increasing population size in generations evaluated by the model.

More recently, a similar algorithm based on regression surrogate model called S-CMA-ES [ 3 ] has been presented. As opposed to the former algorithm, S-CMAES is performing continuous regression by Gaussian processes (GP) [ 17 ] and random forests (RF) [ 4 ].

This paper compares the two mentioned surrogate CMA-ES algorithms, s∗ACM-ES-k and S-CMA-ES, and the original CMA-ES itself. We benchmark these algorithms on the BBOB/COCO testing set [ 7, 8 ] not only in their one population IPOP-CMA-ES version, but also in combination with the two-population-size

BIPOP-CMA-ES.

The remainder of the paper is organized as follows. The next chapter briefly describes tested algorithms: the CMA-ES, the BIPOP-CMA-ES, the s∗ACM-ES-k, and the S-CMA-ES. Section 3 contains experimental setup and results, and Section 4 concludes the paper and suggests further research directions. 2

Algorithms

2.1

The CMA-ES In each generation g, the CMA-ES [ 5 ] generates λ new candidate solutions xk ∈ RD, where k = 1, . . . , λ , from a multivariate normal distribution N(m g , σ 2( ) ( ) g C(g)), where m g is the mean interpretable as the current best ( ) estimate of the optimum, σ 2(g) the step size, representing the overall standard deviation, and C(g) the D × D covariance matrix. The algorithm selects the μ points with the lowest function value from λ generated candidates to adjust distribution parameters for the next generation.

The CMA-ES uses restart strategies to deal with multimodal fitness landscapes and to avoid being trapped in local optima. A multi-start strategy where the population size is doubled in each restart is referred to as

IPOP-CMA-ES [1].

2.2

BIPOP-CMA-ES

size σl0arge = σd0efault: The BIPOP-CMA-ES [ 6 ], unlike IPOP-CMA-ES, considers two different restart strategies. In the first one, corresponding to the IPOP-CMA-ES, the population size is doubled in each restart irestart using a constant initial stepλlarge = 2irestart λdefault .

In the second one, the smaller population size λsmall is computed as ⎢ ⎢ 1 λlarge λsmall = ⎢⎢⎢λdefault 2 λdefault ⎢ ⎣

U[ 0,1 ]2 ⎥⎥ ⎥ , ⎥ ⎥ ⎥ ⎦ where U[ 0, 1 ] denotes the uniform distribution in [ 0, 1 ].

The initial step-size is also randomly drawn as

σsmall = σdefault × 10−2U[ 0,1 ].

0 0

The BIPOP-CMA-ES performs the first run using the default population size λdefault and the initial stepsize σd0efault. In the following restarts, the strategy with less function evaluations summed over all algorithm runs is selected. (1) (2) (3) 2.3

s∗ACM-ES-k Loshchilov’s version of the CMA-ES using the ordinal regression by Ranking SVM as surrogate model in specific generations instead of the original function is referred to as s∗ACM-ES [ 13 ], and its extension using a more intensive exploitation is called s∗ACM-ES-k [ 15 ].

Before the main loop starts, the s∗ACM-ES-k evaluates gstart generations by the original function, then it repeats the following steps: First, the surrogate model is constructed using hyperparameters θ , and the original function-evaluated points from previous generations. Second, the surrogate model is optimized by the CMA-ES for gm generations with population size λ = kλ λdefault and the number of best points μ = kμ μdefault, where kλ , kμ ≥ 1. Third, the following generation is evaluated by the original function using λ = λdefault and μ = μdefault. To avoid a potential divergence when gm fluctuate between 0 and 1, kλ > 1 is used only in the case of gm ≥ gmλ , where gmλ denotes the number of generations suitable for effective exploitation using the model. Then the model error is calculated according to the comparison of ranking between the original and model evaluation of the last generation. After that, the gm is adjusted in accordance with the model error. As the last step, the s∗ACM-ES-k searches a hyperparameter space by one generation of the CMA-ES minimizing the model error to find the most suitable hyperparameter settings θnew for the next model-evaluated generations.

The s∗ACM-ES-k version using BIPOP-CMA-ES proposed in [ 14 ] is called BIPOP-s∗ACM-ES-k. 2.4

S-CMA-ES As opposed to the former algorithms, a different approach to surrogate model usage is incorporated in the S-CMA-ES [ 3 ]. The algorithm is a modification of CMA-ES where the original evaluating and sampling phases are substituted by the Algorithm 1 at the beginning of each CMA-ES generation.

In order to avoid the false convergence of the algorithm in the BBOB benchmarking toolbox, the model-predicted values are adapted to never be lower then the so far minimum of the original function (see the step 17 in the pseudocode).

The main difference between the S-CMA-ES and the s∗ACM-ES-k is in the manner how the CMA-ES is utilized. Considering S-CMA-ES, the model prediction or training is performed within each generation of the CMA-ES. On the contrary in the s∗ACM-ES-k, individual generations of the CMA-ES are started to optimize either original fitness, surrogate fitness, or model itself.

Experimental Evaluation

The core of this paper lies in a systematic comparison of the two mentioned approaches to using surrogate models with the CMA-ES and the original CMA-ES algorithm itself. The first group of surrogate-based algorithms is formed by the S-CMA-ES algorithms using Gaussian processes and random forests models, and the other group is formed by the s∗ACM-ES algorithm. These four algorithms (CMA-ES, GP-CMA-ES, RF-CMA-ES, s∗ACM-ES) are tested in their IPOP version (based on IPOP-CMA-ES) [ 1 ] and in the bi-population restart strategy version (based on BIPOP-CMA-ES and its derivatives) [ 6 ]. 3.1

Experimental Setup

The experimental evaluation is performed through the noiseless part of the COCO/BBOB framework (COmparing Continuous Optimizers / Black-Box Optimization Benchmarking) [ 7, 8 ]. It is a collection of 24 benchmark functions with different degree of smoothness, uni-/multimodality, separability, conditionality etc. Each function is

Algorithm 1 Surrogate CMA-ES Algorithm [3]

Input: g (generation), gm (number of model generations), σ , λ , m, C (CMA-ES internal variables), r (maximal distance between training points and m), nREQ (minimal number of points for model training), nMAX (maximal number of points for model training), A (archive), fM (model), f (original fitness function) 1: xk ∼ N m, σ 2C k = 1, . . . , λ {CMA-ES sampling} 2: if g is original-evaluated then 3: yk ← f (xk) k = 1, . . . , λ {fitness evaluation } 4: A = A ∪ {(xk, yk)}kλ= 1 5: (Xtr, ytr) ← {(x, y) ∈ A S (m−x)⊺σ C−1~2(m−x) ≤ r} 6: if SXtrS ≥ nREQ then 7: (Xtr, ytr) ← choose nMAX points if SXtrS > nMAX 8: {transformation to the eigenvector basis:}

Xtr ← {(σ C−1~2)⊺xtr for each xtr ∈ Xtr} 9: fM ← trainModel(Xtr, ytr) 10: mark (g + 1) as model-evaluated 11: else 12: mark (g + 1) as original-evaluated 13: end if 14: else 15: 16: 17: xk ← (σ C−1~2)⊺xk k = 1, . . . , λ yk ← fM(xk) k = 1, . . . , λ {model evaluation} {shift yk values if (min yk) < best y from A} yk = yk + max{0, minA y − min yk} k = 1, . . . , λ 18: if gm model generations passed then 19: mark (g + 1) as original-evaluated 20: end if 21: end if

λ Output: fM, A, (yk)k= 1 defined for any dimensionD ≥ 2; the dimensions used for our tests are 2, 5, 10, and 20. The set of functions comprises, among others, well-known continuous optimization benchmarks like ellipsoid, Rosenbrock’s, Rastrigin’s,

Schweffel’s or Weierstrass’ function.

The framework calls the optimizers on 15 different instances for each function and dimension, meaning that 1440 optimization runs were called for each of the eight considered algorithms. The graphs at the end of the paper show detailed results in a per-function and per-groupof-function manner. The following paragraphs summarize the parameters of the algorithms.

The CMA-ES. The original CMA-ES was used in its IPOP-CMA-ES version (Matlab code v. 3.61) with number of restarts = 4, IncPopSize = 2, σstart = 38 , λ = 4 + ⌊3 log D⌋. The remainder settings were left default. s∗ACM-ES. We have used Loshchilov’s GECCO 2013 Matlab code xacmes.m [ 14 ] in its s∗ACM-ES version, setting the parameters CMAactive = 1, newRestartRules = 0 and withSurr = 1, modelType = 1, withModelEnsembles = 0, withModelOptimization = 1, hyper_lambda = 20, λMult = 1, μMult = 1 and ΛminIter = 4.

S-CMA-ES: GP5-CMA-ES and RF5-CMA-ES. The number after the GP/RF in the names of the algorithms denotes the number of model-evaluated generations gm, which are evaluated by the model in row. All considered S-CMA-ES versions use the distance r = 8 (see algorithm 1). For the GP model, KMν= a5te´r2n covariance function with starting val~ ues (σn2, l, σ 2f) = log(0.01, 2, 0.5) has been used (see [ 3 ] for the details). We have tested RF comprising 100 regression trees, each containing at least two training points in each leaf. The CMA-ES parameters (IPOP version, σstart , λ , IncPopSize etc.) were used the same as in the pure CMA-ES experiments. All S-CMA-ES parameter values were chosen according to preliminary testing on several functions from the COCO/BBOB framework.

BIPOP version of the algorithms. The bi-population versions BIPOP-CMA-ES and BIPOP-s∗ACM-ES use the same Loshchilov’s Matlab code xacmes.m with the parameter BIPOP = 1. The BIPOP-GP5-CMA-ES and BIPOP-RF5-CMA-ES algorithms are constructed in the same manner as the S-CMA-ES was transformed from the CMA-ES – by integration of the Algorithm 1 into every generation of the BIPOP-s∗ACM-ES. 3.2

Results

The performance of the algorithms is compared in the graphs placed in Figures 1–3. The graphs in Figure 1 depict the expected running time (ERT), which depends on a given target function value ft = fopt + Δ f – the true optimum fopt of the respective benchmark function raised by a small value Δ f . The ERT is computed over all relevant 1 Sphere 2 Ellipsoid separable 3 Rastrigin separable 4 Skew Rastrigin-Bueche separ 10 5 Linear slope 20 40 0ta2rget R3L/dim: 150 10 6 Attractive sector 20 40 20

40 0ta2rget R3L/dim: 150 10 7 Step-ellipsoid 0ta2rget R3L/dim: 150 10 20 8 Rosenbrock original 40 0ta2rget R3L/dim: 150 0ta2rget R3L/dim: 150 10 20

9 Rosenbrock rotated 4 3 2 1 3 2 1 4 3 2 1 4 3 2 1 3 2 1 3 2 1 40 40 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 3 2 1 3 2 1 40 40 40 4 3 2 1 4 3 2 1 4 3 2 1 3 2 1 5 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 3 2 1 4 3 2 1 4 3 2 1 40 40 40 0ta2rget R3L/dim: 150 20 40 0ta2rget R3L/dim: 150 20 40

0ta2rget R3L/dim: 150 10 10 Ellipsoid separable fcts log10 of (# f-evals / dimension) ill-conditioned fcts BIPOP-s BIPOP-saACMES BIPOP-G BIPOP-GP5 RF5-CMA RF5-CMAES saACMES saACMES BBIIPPOOPP--RRF5 s r i g r a t o i t u f o r s r i g r a t o i t u f f10-14,5-D f15-19,5-D a a e e c c 0.8 p t g r a t o i t u f 0.4 o r 0.0 1.0 s r i g r a t o i t u f 0.4 o r 0.0 1.0 s r i g r a t o i t dimension (FEvals/DIM) for all functions and subgroups in 5-D. The targets are chosen from 10[− 8..2 bestGECCO2009 artificial algorithm just not reached them within a given budget ofk

DIM, with k

× ∈ { separable fcts log10 of (# f-evals / dimension) ill-conditioned fcts 1.0 s r i g r a t o i t u f 0.4 o r 0.0 1.0 s r i g r a t o i t u f 0.4 o r 0.0 1.0 s r i g r a t o i t dimension (FEvals/DIM) for all functions and subgroups in 20-D. The targets are chosen from 10[− 8..2 bestGECCO2009 artificial algorithm just not reached them within a given budget ofk

DIM, with k

× ∈ { saACMES saACMES BBIIPPOOPP--RRF5 saACMES saACMES GP5-CMA GP5-CMAES BIPOP-G BIPOP-GP5 BIPOP-R BIPOP-RF5 RRFF55--CCMMAAES s r i g r a t o i t u f o r s r i g r a t o i t u f BIPOP-s BIPOP-saACMES BIPOP-G BIPOP-GP5 RF5-CMA RF5-CMAES trials as the number of the original function evaluations (FEs) executed during each trial until the best function value reached ft, summed over all trials and divided by the number of trials that actually reached ft [ 7 ].

As we can see in Figure 1, the 24 functions can be roughly divided into two groups according to the algorithm which performed the best (at least in 10D and 20D). The first group of functions where the CMA-ES performed best consists of functions 1, 3, 4, 6, and 20 while on functions 2, 5, 7, 10, 11, 13–16, 18, 21, 23, and 24, GP5-CMAES is usually better. The usage of the BIPOP versions generally leads to no improvement or even to performance decrease.

The graphs in Figures 2 and 3 summarize the performance over subgroups of the benchmark functions and show the proportion of algorithm runs that reached the target value ft ∈ 10[−8..2] indeed ( ft was actually different for each respective function, see the figures captions). Roughly speaking, the higher the colored line, the better the performance of the algorithm is for the number of the original evaluations given on the horizontal axis.

Thus we can see that our GP5-CMA-ES usually outperforms the other algorithms when we consider the evaluations budget FEs ≤ 101.5D, i.e. FEs ≤ 150 for 5D and FEs ≤ 600 for 20D. However, as the number of the considered original evaluations rises, the original CMA-ES or the s∗ACM-ES usually performs better. This fact can be summarized that our GP5-CMA-ES is convenient especially for the applications where a very low number of function evaluations is available, such as in [ 2 ]. 4

Conclusions & Future Work

In this paper, we have compared the surrogate-assisted SCMA-ES, which uses GP and RF continuous regression models, with s∗ACM-ES-k algorithm based on ordinal regression by Ranking SVM, and the original CMA-ES, all in their IPOP and BIPOP versions. The comparison shows that Gaussian process S-CMA-ES usually outperforms the ordinal-based s∗ACM-ES-k in early stages of the algorithm search, especially on multimodal functions (BBOB functions 15–24). However, the algorithms and surrogate models should be further analyzed and compared since, for example, the NEWUOA [ 16 ] or SMAC [ 10, 11 ] algorithms spend a considerably lower number of function evaluations than the CMA-ES in these early optimization phases. The BIPOP versions of the algorithms did not increased performances of appropriate IPOP versions except

BBOB function 5.

A natural perspective of improving S-CMA-ES is to make the number of model-evaluated generations selfadaptive. We will additionally investigate different properties of continuous and ordinal regression in view of their applicability as regression models. Different cases and benchmarks where the ordinal regression is clearly superior to continuous regression will be further identified. For example, hybrid surrogate models combining both kinds of regression will be attempted.

Acknowledgements

This work was supported by the Czech Science Foundation (GACˇ R) grant P103/13-17187S, by the Grant Agency of the Czech Technical University in Prague with its grant No. SGS14/205/OHK4/3T/14, and by the project “National Institute of Mental Health (NIMH-CZ)”, grant number CZ.1.05/2.1.00/03.0078 (and the European Regional Development Fund.). Further, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Infrastructure for Research, Development, and Innovations” (LM2010005), is greatly appreciated.

[1] Auger , A. , Hansen , N.: A restart CMA evolution strategy with increasing population size . In: The 2005 IEEE Congress on Evolutionary Computation 2 , 1769 - 1776 , IEEE, Sept. 2005

[2] Baerns , M. , Holenˇa , M.:. Combinatorial development of solid catalytic materials. Design of high-throughput experiments, data analysis, data mining . Imperial College Press / World Scientific, London, 2009

[3] Bajer , L. , Pitra , Z. , Holenˇa, M.: Benchmarking gaussian processes and random forests surrogate models on the BBOB noiseless testbed . In: Proceedings of the 17th GECCO Conference Companion , Madrid, July 2015 , ACM, New York

[4] Breiman , L. : Classification and regression trees . Chapman & Hall/CRC , 1984

[5] Hansen , N. : The CMA evolution strategy: A comparing review . In: J. A. Lozano , P.

Larranaga , I.

Inza , E. Bengoetxea, (eds), Towards a New Evolutionary Computation, 192 in Studies in Fuzziness and Soft Computing , 75 - 102 , Springer Berlin Heidelberg, Jan. 2006

[6] Hansen , N. : Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed . In: Proceedings of the 11th Annual GECCO Conference Companion: Late Breaking Papers , GECCO' 09 , 2389 - 2396 , New York, NY, USA, 2009 , ACM

[7] Hansen , N. , Auger , A. , Finck , S. , Ros , R.: Real-parameter black-box optimization benchmarking 2012: Experimental setup . Technical Report , INRIA, 2012

[8] Hansen , N. , Finck , S. , Ros , R. , Auger , A. : Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions . Technical Report RR-6829 , INRIA, 2009 , updated February 2010

[9] Hansen , N. , Ros , R.: Benchmarking a weighted negative covariance matrix update on the BBOB-2010 noiseless testbed . In: Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation , GECCO' 10 , 1673 - 1680 , New York, NY, USA, 2010 , ACM

[10] Hutter , F. , Hoos , H. , Leyton-Brown , K.: Sequential modelbased optimization for general algorithm configuration . In: C. Coello, (ed.), Learning and Intelligent Optimization 2011 , volume 6683 of Lecture Notes in Computer Science, 507 - 523 , Springer Berlin Heidelberg, 2011

[11] Hutter , F. , Hoos , H. , Leyton-Brown , K. : An evaluation of sequential model-based optimization for expensive blackbox functions . In: Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation , GECCO'13 Companion , 1209 - 1216 , New York, NY, USA, 2013 , ACM

[12] Loshchilov , I. , Schoenauer , M. , Sebag , M. : Alternative restart strategies for CMA-ES . In: C. A. C. Coello , V.

Cutello , K.

Deb , S.

Forrest , G. Nicosia, and M.

Pavone , (eds), PPSN (1) , volume 7491 of Lecture Notes in Computer Science, 296 - 305 , Springer, 2012

[13] Loshchilov , I. , Schoenauer , M. , Sebag , M. : Self-adaptive surrogate-assisted covariance matrix adaptation evolution strategy . In: Proceedings of the 14th GECCO , GECCO ' 12 , 321 - 328 , New York, NY, USA, 2012 , ACM

[14] Loshchilov , I. , Schoenauer , M. , Sebag , M.: BI-population CMA-ES algorithms with surrogate models and line searches . In: Genetic and Evolutionary Computation Conference (GECCO Companion) , 1177 - 1184 , ACM Press, July 2013

[15] Loshchilov , I. , Schoenauer , M. , Sebag , M. : Intensive surrogate model exploitation in self-adaptive surrogate-assisted CMA-ES (saACM-ES) . In: Genetic and Evolutionary Computation Conference (GECCO) , 439 - 446 , ACM Press, July 2013

[16] Powell , M. J. D. : The NEWUOA software for unconstrained optimization without derivatives . In: G. D. Pillo , M. Roma, (eds), Large-Scale Nonlinear Optimization, number 83 in Nonconvex Optimization and Its Applications , 255 - 297 , Springer

, 2006

[17] Rasmussen , C. E. , Williams , C. K. I. : Gaussian processes for machine learning . Adaptative Computation and Machine Learning Series , MIT Press, 2006