Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES ZbyněkPitra z.pitra@gmail.com National Institute of Mental Health Topolová

748 250 67 Klecany Czech Republic

Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in

Prague Břehová 7 115 19 Prague 1 Czech Republic

LukášBajer bajer@cs.cas.cz Institute of Computer Science Academy of Sciences of the Czech Republic Pod Vodárenskou věží

182 07 Prague 8 Czech Republic

Faculty of Mathematics and Physics Charles University in

Prague ; Malostranské nám. 25 118 00 Prague 1 Czech Republic

MartinHoleňa holena@cs.cas.cz Institute of Computer Science Academy of Sciences of the Czech Republic Pod Vodárenskou věží

182 07 Prague 8 Czech Republic

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES 3C5AEB96EF73149389248A28C6061D3D GROBID - A machine learning software for extracting information from scholarly documents

In practical optimization tasks, it is more and more frequent that the objective function is black-box which means that it cannot be described mathematically. Such functions can be evaluated only empirically, usually through some costly or time-consuming measurement, numerical simulation or experimental testing. Therefore, an important direction of research is the approximation of these objective functions with a suitable regression model, also called surrogate model of the objective functions. This paper evaluates two different approaches to the continuous black-box optimization which both integrates surrogate models with the state-of-the-art optimizer CMA-ES. The first Ranking SVM surrogate model estimates the ordering of the sampled points as the CMA-ES utilizes only the ranking of the fitness values. However, we show that continuous Gaussian processes model provides in the early states of the optimization comparable results.

Introduction

Optimization of an expensive objective or fitness function plays an important role in many engineering and research tasks. For such functions, it is sometimes difficult to find an exact analytical formula, or to obtain any derivatives or information about smoothness. Instead, values for a given input are possible to be obtained only through expensive and time-consuming measurements and experiments. Those functions are called black-box, and because of the evaluation costs, the primary criterion for assessment of the black-box optimizers is the number of fitness function evaluations necessary to achieve the optimal value.

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [5] is considered to be the state-of-the-art of the black-box continuous optimization. The important property of the CMA-ES is that it advances through the search space only according to the ordering of the function values in current population. Hence, the search of the algorithm is rather local which predisposes it to premature convergence in local optima if not used with sufficiently large population size. This issue resulted in development of several restart strategies [12], such as IPOP-CMA-ES [1] and BIPOP-CMA-ES [6] performing restarts with population size successively increased, or aCMA-ES [9] using also unsuccessful individuals for covariance matrix adaptation.

Furthermore, the CMA-ES often requires more fitness function evaluations to find the optimum than many realworld experiments can offer. In order to decrease the number of evaluations in evolutionary algorithms, it is convenient to periodically train a surrogate model of the fitness function and use it for evaluation of new points instead of the original function. The second option is to use the model for selection of the most promising points to be evaluated by the original fitness.

Loshchilov's surrogate-model-based algorithm s * ACM-ES [13] utilizes the former approach: it estimates the ordering of the fitness values required by the CMA-ES using Ranking Support Vector Machines (SVM) as an ordinal regression model. Moreover, it has been shown [13] that model parameters (hyperparametres) used to construct Ranking SVM model can be optimized during the search by the pure CMA-ES algorithm. Later proposed s * ACM-ES extensions, referred to as s * ACM-ES-k [15] and BIPOP-s * ACM-ES-k [14], use a more intensive exploitation of the surrogate model by increasing population size in generations evaluated by the model.

More recently, a similar algorithm based on regression surrogate model called S-CMA-ES [3] has been presented. As opposed to the former algorithm, S-CMA-ES is performing continuous regression by Gaussian processes (GP) [17] and random forests (RF) [4].

This paper compares the two mentioned surrogate CMA-ES algorithms, s * ACM-ES-k and S-CMA-ES, and the original CMA-ES itself. We benchmark these algorithms on the BBOB/COCO testing set [7,8] not only in their one population IPOP-CMA-ES version, but also in combination with the two-population-size BIPOP-CMA-ES.

The remainder of the paper is organized as follows. The next chapter briefly describes tested algorithms: the CMA-ES, the BIPOP-CMA-ES, the s * ACM-ES-k, and the S-CMA-ES. Section 3 contains experimental setup and results, and Section 4 concludes the paper and suggests further research directions.

Algorithms

The CMA-ES

In each generation g, the CMA-ES [5] generates λ new candidate solutions

x k ∈ R D , where k = 1,...,λ , from a multivariate normal distribution N(m (g) ,σ 2 (g) C (g) )

, where m (g) is the mean interpretable as the current best estimate of the optimum, σ 2 (g) the step size, representing the overall standard deviation, and C (g) the D × D covariance matrix. The algorithm selects the µ points with the lowest function value from λ generated candidates to adjust distribution parameters for the next generation.

The CMA-ES uses restart strategies to deal with multimodal fitness landscapes and to avoid being trapped in local optima. A multi-start strategy where the population size is doubled in each restart is referred to as IPOP-CMA-ES [1].

BIPOP-CMA-ES

The BIPOP-CMA-ES [6], unlike IPOP-CMA-ES, considers two different restart strategies. In the first one, corresponding to the IPOP-CMA-ES, the population size is doubled in each restart i restart using a constant initial stepsize σ 0 large = σ 0 default :

λ large = 2 i restart λ default .(1)

In the second one, the smaller population size λ small is computed as

λ small = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ λ default 1 2 λ large λ default U[0,1] 2 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ,(2)

where U[0,1] denotes the uniform distribution in [0,1]. The initial step-size is also randomly drawn as

σ 0 small = σ 0 default × 10 −2U[0,1] .(3)

The BIPOP-CMA-ES performs the first run using the default population size λ default and the initial stepsize σ 0 default . In the following restarts, the strategy with less function evaluations summed over all algorithm runs is selected.

s * ACM-ES-k

Loshchilov's version of the CMA-ES using the ordinal regression by Ranking SVM as surrogate model in specific generations instead of the original function is referred to as s * ACM-ES [13], and its extension using a more intensive exploitation is called s * ACM-ES-k [15].

Before the main loop starts, the s * ACM-ES-k evaluates g start generations by the original function, then it repeats the following steps: First, the surrogate model is constructed using hyperparameters θ , and the original function-evaluated points from previous generations. Second, the surrogate model is optimized by the CMA-ES for g m generations with population size λ = k λ λ default and the number of best points µ = k µ µ default , where k λ ,k µ ≥ 1. Third, the following generation is evaluated by the original function using λ = λ default and µ = µ default . To avoid a potential divergence when g m fluctuate between 0 and 1, k λ > 1 is used only in the case of g m ≥ g m λ , where g m λ denotes the number of generations suitable for effective exploitation using the model. Then the model error is calculated according to the comparison of ranking between the original and model evaluation of the last generation. After that, the g m is adjusted in accordance with the model error. As the last step, the s * ACM-ES-k searches a hyperparameter space by one generation of the CMA-ES minimizing the model error to find the most suitable hyperparameter settings θ new for the next model-evaluated generations.

The s * ACM-ES-k version using BIPOP-CMA-ES proposed in [14] is called BIPOP-s * ACM-ES-k.

S-CMA-ES

As opposed to the former algorithms, a different approach to surrogate model usage is incorporated in the S-CMA-ES [3]. The algorithm is a modification of CMA-ES where the original evaluating and sampling phases are substituted by the Algorithm 1 at the beginning of each CMA-ES generation.

In order to avoid the false convergence of the algorithm in the BBOB benchmarking toolbox, the model-predicted values are adapted to never be lower then the so far minimum of the original function (see the step 17 in the pseudocode).

The main difference between the S-CMA-ES and the

Experimental Evaluation

The core of this paper lies in a systematic comparison of the two mentioned approaches to using surrogate models with the CMA-ES and the original CMA-ES algorithm itself. The first group of surrogate-based algorithms is formed by the S-CMA-ES algorithms using Gaussian processes and random forests models, and the other group is formed by the s * ACM-ES algorithm. These four algorithms (CMA-ES, GP-CMA-ES, RF-CMA-ES, s * ACM-ES) are tested in their IPOP version (based on IPOP-CMA-ES) [1] and in the bi-population restart strategy version (based on BIPOP-CMA-ES and its derivatives) [6].

Experimental Setup

The experimental evaluation is performed through the noiseless part of the COCO/BBOB framework (COmparing Continuous Optimizers / Black-Box Optimization Benchmarking) [7,8]. It is a collection of 24 benchmark functions with different degree of smoothness, uni-/multimodality, separability, conditionality etc. Each function is

y k ← f (x k ) k = 1,...,λ {fitness evaluation} 4: A = A ∪ {(x k ,y k )} λ k=1 5: (X tr ,y tr ) ← {(x,y)∈A (m−x) ⊺ σ C −1 2 (m−x) ≤ r} 6: if X tr ≥ n REQ then 7: (X tr ,y tr ) ← choose n MAX points if X tr > n MAX 8:

{transformation to the eigenvector basis:}

X tr ← {(σ C −1 2 ) ⊺ x tr for each x tr ∈ X tr } 9:

f M ← trainModel(X tr ,y tr ) end if 21: end if Output: f M , A, (y k ) λ k=1 defined for any dimension D ≥ 2; the dimensions used for our tests are 2, 5, 10, and 20. The set of functions comprises, among others, well-known continuous optimization benchmarks like ellipsoid, Rosenbrock's, Rastrigin's, Schweffel's or Weierstrass' function.

x k ← (σ C −1 2 ) ⊺ x k k = 1,...,λ 16: y k ← f M (x k ) k = 1,...,

The framework calls the optimizers on 15 different instances for each function and dimension, meaning that 1440 optimization runs were called for each of the eight considered algorithms. The graphs at the end of the paper show detailed results in a per-function and per-groupof-function manner. The following paragraphs summarize the parameters of the algorithms.

Results

The performance of the algorithms is compared in the graphs placed in Figures 1-3. The graphs in Figure 1 depict the expected running time (ERT), which depends on a given target function value f t = f opt + ∆ f -the true optimum f opt of the respective benchmark function raised by a small value ∆ f . The ERT is computed over all relevant trials as the number of the original function evaluations (FEs) executed during each trial until the best function value reached f t , summed over all trials and divided by the number of trials that actually reached f t [7].

As we can see in Figure 1, the 24 functions can be roughly divided into two groups according to the algorithm which performed the best (at least in 10D and 20D). The first group of functions where the CMA-ES performed best consists of functions 1, 3, 4, 6, and 20 while on functions 2, 5, 7, 10, 11, 13-16, 18, 21, 23, and 24, GP5-CMA-ES is usually better. The usage of the BIPOP versions generally leads to no improvement or even to performance decrease.

The graphs in Figures 2 and 3 summarize the performance over subgroups of the benchmark functions and show the proportion of algorithm runs that reached the target value f t ∈ 10 [−8..2] indeed ( f t was actually different for each respective function, see the figures captions). Roughly speaking, the higher the colored line, the better the performance of the algorithm is for the number of the original evaluations given on the horizontal axis.

Thus we can see that our GP5-CMA-ES usually outperforms the other algorithms when we consider the evaluations budget FEs ≤ 10 1.5 D, i.e. FEs ≤ 150 for 5D and FEs ≤ 600 for 20D. However, as the number of the considered original evaluations rises, the original CMA-ES or the s * ACM-ES usually performs better. This fact can be summarized that our GP5-CMA-ES is convenient especially for the applications where a very low number of function evaluations is available, such as in [2].

Conclusions & Future Work

In this paper, we have compared the surrogate-assisted S-CMA-ES, which uses GP and RF continuous regression models, with s * ACM-ES-k algorithm based on ordinal regression by Ranking SVM, and the original CMA-ES, all in their IPOP and BIPOP versions. The comparison shows that Gaussian process S-CMA-ES usually outperforms the ordinal-based s * ACM-ES-k in early stages of the algorithm search, especially on multimodal functions (BBOB functions 15-24). However, the algorithms and surrogate models should be further analyzed and compared since, for example, the NEWUOA [16] or SMAC [10,11] algorithms spend a considerably lower number of function evaluations than the CMA-ES in these early optimization phases. The BIPOP versions of the algorithms did not increased performances of appropriate IPOP versions except BBOB function 5.

A natural perspective of improving S-CMA-ES is to make the number of model-evaluated generations selfadaptive. We will additionally investigate different properties of continuous and ordinal regression in view of their applicability as regression models. Different cases and benchmarks where the ordinal regression is clearly superior to continuous regression will be further identified. For example, hybrid surrogate models combining both kinds of regression will be attempted.

-k is in the manner how the CMA-ES is utilized. Considering S-CMA-ES, the model prediction or training is performed within each generation of the CMA-ES. On the contrary in the s * ACM-ES-k, individual generations of the CMA-ES are started to optimize either original fitness, surrogate fitness, or model itself.

Algorithm 11Surrogate CMA-ES Algorithm [3] Input: g (generation), g m (number of model generations), σ , λ , m, C (CMA-ES internal variables), r (maximal distance between training points and m), n REQ (minimal number of points for model training), n MAX (maximal number of points for model training), A (archive), f M (model), f (original fitness function) 1: x k ∼ N m,σ 2 C k = 1,...,λ {CMA-ES sampling} 2: if g is original-evaluated then 3:

The CMA-ES. The original CMA-ES was used in its IPOP-CMA-ES version (Matlab code v. 3.61) with number of restarts = 4, IncPopSize = 2, σ start =8 3 , λ = 4 + ⌊3logD⌋. The remainder settings were left default. s * ACM-ES. We have used Loshchilov's GECCO 2013 Matlab code xacmes.m [14] in its s * ACM-ES version, setting the parameters CMAactive = 1, newRestartRules = 0 and withSurr = 1, modelType = 1, withModelEnsembles = 0, withModelOptimization = 1, hyper_lambda = 20, λ Mult = 1, µ Mult = 1 and Λ minIter = 4. S-CMA-ES: GP5-CMA-ES and RF5-CMA-ES. The number after the GP/RF in the names of the algorithms denotes the number of model-evaluated generations g m , which are evaluated by the model in row. All considered S-CMA-ES versions use the distance r = 8 (see algorithm 1). For the GP model, K ν=5 2 Matérn covariance function with starting values (σ 2 n ,l,σ 2 f ) = log(0.01,2,0.5) has been used (see [3] for the details). We have tested RF comprising 100 regression trees, each containing at least two training points in each leaf. The CMA-ES parameters (IPOP version, σ start , λ , IncPopSize etc.) were used the same as in the pure CMA-ES experiments. All S-CMA-ES parameter values were chosen according to preliminary testing on several functions from the COCO/BBOB framework. BIPOP version of the algorithms. The bi-population versions BIPOP-CMA-ES and BIPOP-s * ACM-ES use the same Loshchilov's Matlab code xacmes.m with the parameter BIPOP = 1. The BIPOP-GP5-CMA-ES and BIPOP-RF5-CMA-ES algorithms are constructed in the same manner as the S-CMA-ES was transformed from the CMA-ES -by integration of the Algorithm 1 into every generation of the BIPOP-s * ACM-ES.

Figure 1 :Figure 2 :Figure 3 :123Figure 1: Expected running time (ERT in number of f -evaluations as log 10 value) divided by dimension versus dimension. The target function value is chosen such that the bestGECCO2009 artificial algorithm just failed to achieve an ERT of 10 × DIM. Different symbols correspond to different algorithms given in the legend of f 1 and f 24 . Light symbols give the maximum number of function evaluations from the longest trial divided by dimension. Black stars indicate a statistically better result compared to all other algorithms with p < 0.01 and Bonferroni correction number of dimensions (six). Legend: ○:BIPOP-CMAES, ▽:BIPOP-GP5, ⋆:BIPOP-RF5, ◻:BIPOP-saACMES, △:CMA-ES, ♢:GP5-CMAES, :RF5-CMAES, :saACMES Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES

Acknowledgements

This work was supported by the Czech Science Foundation (GA ČR) grant P103/13-17187S, by the Grant Agency of the Czech Technical University in Prague with its grant No. SGS14/205/OHK4/3T/14, and by the project "National Institute of Mental Health (NIMH-CZ)", grant number CZ.1.05/2.1.00/03.0078 (and the European Regional Development Fund.). Further, access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme "Projects of Large Infrastructure for Research, Development, and Innovations" (LM2010005), is greatly appreciated.

A restart CMA evolution strategy with increasing population size AAuger NHansen The 2005 IEEE Congress on Evolutionary Computation IEEE Sept. 2005 2 MBaerns MHoleňa Combinatorial development of solid catalytic materials. Design of high-throughput experiments, data analysis, data mining

London

Imperial College Press / World Scientific 2009 Benchmarking gaussian processes and random forests surrogate models on the BBOB noiseless testbed LBajer ZPitra MHoleňa Proceedings of the 17th GECCO Conference Companion the 17th GECCO Conference Companion

Madrid; New York

ACM July 2015 Classification and regression trees LBreiman 1984 Chapman & Hall/CRC The CMA evolution strategy: A comparing review NHansen Studies in Fuzziness and Soft Computing JALozano PLarranaga IInza EBengoetxea

Berlin Heidelberg

Springer Jan. 2006 192 Towards a New Evolutionary Computation Benchmarking a BI-population CMA-ES on the BBOB-2009 function testbed NHansen Proceedings of the 11th Annual GECCO Conference Companion: Late Breaking Papers, GECCO'09 the 11th Annual GECCO Conference Companion: Late Breaking Papers, GECCO'09

New York, NY, USA

ACM 2009 NHansen AAuger SFinck RRos Real-parameter black-box optimization benchmarking 2012. 2012 INRIA Technical Report Experimental setup NHansen SFinck RRos AAuger RR-6829 Real-parameter black-box optimization benchmarking 2009: Noiseless functions definitions 2009. February 2010 INRIA Technical Report Benchmarking a weighted negative covariance matrix update on the BBOB-2010 noiseless testbed NHansen RRos Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO'10 the 12th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO'10

New York, NY, USA

ACM 2010 Sequential modelbased optimization for general algorithm configuration FHutter HHoos KLeyton-Brown Learning and Intelligent Optimization 2011 Lecture Notes in Computer Science CCoello

Berlin Heidelberg

Springer 2011 6683 An evaluation of sequential model-based optimization for expensive blackbox functions FHutter HHoos KLeyton-Brown Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO'13 Companion the 15th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO'13 Companion

New York, NY, USA

ACM 1209-1216. 2013 Alternative restart strategies for CMA-ES ILoshchilov MSchoenauer MSebag PPSN C. A. C. Coello, V. Cutello, K. Deb, S. Forrest, G. Nicosia, and M. Pavone 7491 1 2012 Springer Self-adaptive surrogate-assisted covariance matrix adaptation evolution strategy ILoshchilov MSchoenauer MSebag Proceedings of the 14th GECCO, GECCO '12 the 14th GECCO, GECCO '12

New York, NY, USA

ACM 2012 BI-population CMA-ES algorithms with surrogate models and line searches ILoshchilov MSchoenauer MSebag Genetic and Evolutionary Computation Conference (GECCO Companion) ACM Press July 2013 Intensive surrogate model exploitation in self-adaptive surrogate-assisted CMA-ES (saACM-ES) ILoshchilov MSchoenauer MSebag Genetic and Evolutionary Computation Conference (GECCO) ACM Press July 2013 The NEWUOA software for unconstrained optimization without derivatives MJ DPowell Large-Scale Nonlinear Optimization, number 83 in Nonconvex Optimization and Its Applications GDPillo MRoma Springer US 2006 Gaussian processes for machine learning CERasmussen CK IWilliams Adaptative Computation and Machine Learning Series MIT Press 2006