-

Comparing Non-Linear Regression Methods on Black-Box Optimization Benchmarks

Vojteˇch Kopal

vojtech.kopal@gmail.com 0

Martin Holenˇ a

martin@cs.cas.cz 1 0 Charles University in Prague, Faculty of Mathematics and Physics , Malostranské nám. 25, 118 00 Praha 1 , Czech Republic 1 Institute of Computer Science, Academy of Sciences of the Czech Republic , Pod Vodárenskou veˇží 2, 182 07 Praha , Czech Republic

2015

135 142

The paper compares several non-linear regression methods on synthetic data sets generated using standard benchmarks for continuous black-box optimization. For that comparison, we have chosen regression methods that have been used as surrogate models in such optimization: radial basis function networks, Gaussian processes, and random forests. Because the purpose of black-box optimization is frequently some kind of design of experiments, and because a role similar to surrogate models is in the traditional design of experiments played by response surface models, we also include standard response surface models, i.e., polynomial regression. The methods are evaluated based on their mean-squared error and on the Kendall's rank correlation coefficient between the ordering of function values according to the model and according to the function used to generate the data.

In this paper, we compare non-linear regression methods that could be used as surrogate models for optimization tasks. The methods are compared on synthetic data sets generated using standard benchmarks for continuous black-box optimization, for which we used implementations based on definitions fromReal-Parameter Black-Box Optimization Benchmarking 2009 [ 8 ].

A continuous black-box optimization is a task where we try to minimize a continuous objective function f : X ⊆ Rn → R for which we do not have an analytical expression. Such problems arise, for example, if the values of the objective function are results of experimental measurements.

For that comparison, we have chosen regression methods that have been used as surrogate models in such optimization: radial basis function networks [ 3 ] [ 18 ], Gaussian processes [ 6 ] [ 11 ], and random forests [ 4 ].

We measure the accuracy of each methods based on mean square error and Kendall’s rank coefficient and based on the results we suggest which methods work better as surrogate models. We are interested in properties of each method to be used as a surrogate model, though our experiments do not replace a direct evaluation in optimization or in evolutionary algorithms. This is a subject of two other papers included in this proceedings.

Other comparisons of non-linear models have been presented. A numerical comparison of neural networks and polynomial regression has been performed in [ 2 ] and in [ 16 ], in the latter one also classification and regression tree (CART) model has been compared. An evaluation of Gaussian processes with other non-linear methods has been done in [ 15 ] and in [ 10 ]. These studies compared accuracy of each model for prediction and have not paid attention to surrogate models for optimization. Example of such works can be found in [ 7 ], where they have compared quadratic polynomial regression with other methods based on prediction accuracy and mean-squared error, and in [ 13 ], where is polynomial regression compared with radial basis function networks based on accuracy and also on optimization results. In this paper, we compare the methods by means of mean-squared error and also Kendall’s rank coefficient.

We briefly describe the theoretical background for each of these methods: how the corresponding models are being induced and how they are used to predict new values. For the synthetic data, we added an overview of how the functions look like in a 3-dimensional space (Figure 1).

The paper is organised as follows. In Section 2, we recall the theoretical fundamentals of the employed regression methods. In Section 3, we describe the setup of our experiments and summarise the results, before the paper concludes in Section 4. 2

Regression Methods in Data Mining

With a continuously increasing amount of gathered data, data mining techniques allow us to search for patterns in the data sets and model the underlying reality. Various models have been introduced in the past, starting from a linear regression to complex nonlinear methods such as neural networks, or Gaussian processes. These models are used to approximate a function that describes the relationship between target and input values.

We now introduce the methods compared in this paper. Each of these methods has its strengths and weaknesses

Polynomial regression Complexity Θ(M2N) Θ(N3) Θ(MKN˜log2 N˜) [ 12 ] polynomial time [ 17 ] which we point out in Table 2 and later we will discuss them in the context of results of our experiments.

We assume to have a pair ((X ),Y ), where X is pdimensional data set with n points, i.e. X is a matrix p × n, or it is a vector X = (x1, x2, ..., xn), where xi is a column vector of size p, i.e. xi = (xi1, xi2, ..., xip), and Y = (Y1,Y2, ...,Yn) is a vector of size n of target values to corresponding rows in matrix X. We use ||x|| as the Euclidean norm of vector x. In the paper, we use following notation: • X ,Y, β are vectors with elements Xi,Yi, βi, respectively, also β j,k is a scalar denoting a parameter in polynomial regression for interaction x jxk, • f is a function and f (x) is an output of the functions corresponding to input x, for multivariate function f , we have either matrix notation Y = f (X), or vector notation Yi = f (xi), • f¯(X) is an average output over f (xi), ∀i ∈ {1, ..., N}. 2.1

Polynomial Regression

The most simple form of polynomial regression (PR) is linear regression in which the model is described by p + 1 parameters β0, β1, ..., βp, f (Xi) = β0 +

x jβ j p X j=1 which can be computed by [ 9 ] β = (β0, . . . , βp) = (XT X)−1XT y (1)

Polynomial regression is still part of the linear regression family, because the dependence on the model parameters is linear. However, we consider also higher powers of input variables. For example, in the quadratic case we add both xi2 form for i ∈ {1, ..., p}, and also as an interaction xix j for i, j ∈ {1, ..., p}, i 6= j. Consequently we have (p2 + p)/2 new variables.

For our experiments, we will restrict attention to quadratic regression, f (Xi) = β0 + x jβ j + x2j βp+ j + x jxkβi,k p X j=1

p X j=1

Xp Xp j=1 k=1 k< j

This is also the standard restriction in response surface modeling [ 14 ]. 2.2

Random Forests

Random Forests (RF) is a model proposed by Breiman [ 5 ], and it is based on ensembles of decision trees. Due to our interest in surrogate models for continuous black-box optimization, we are interested in ensembles of regression trees.

A regression tree is a function defined by means of a binary tree with inner nodes representing predicates, and edges from a node to its children representing whether the predicate is or is not fulfilled. The leaf nodes give the predicted target value. The tree is built recursively starting with a root node and searching for an optimal binary predicate over the input variables. Regression trees can be applied to data sets with both categorical/discrete variables, and real-valued variables. Since we focus on surrogate models for continuous black-box optimization, we only consider real-valued predicates. For a real-valued variable, the data set is split into two parts through minimizing following formula

X X

(yi − c1)2 + xi∈R1( j,s) xi∈R2( j,s) (yi − c2)2 where R1, R2 are the two linearly bounded regions with axes-perpendicular borders into which the data set is split using a j-th variable x j and its splitting point s, and c1, c2 are the averages of function values of points belonging to R1, R2, respectively. After finding the optimal splitting point we recursively apply this process to both regions R1 and R2, and for each of them, only the data points in the region are considered. This process continues until a stopping criterion is met. This can be either the minimum number of data points in leaves or inner nodes, or the depth of the tree.

If the regression tree finally splits the input space into the regions R1, . . . , Rm, we can compute the prediction for a new data point using the following formula:

M X m=1 f (x) =

cmI(x ∈ Rm) where cm is an average target value of data points in region Rm.

An ensemble of regression trees averages the predictions when presented with a new data point.

There are several options how to induce a number of trees over the same data set that will lead to low correlation. In traditional bagging, independent subsets of the original data used for individual trees are obtained by sampling from the data set uniformly and with replacement. In addition, random subsets of input variables can be used. In Matlab implementation of random forests, a square root of number of input variables are selected by default, which is also a setting we have used for our experiments.

The model parameters are number of trees (NT) which are added to the ensemble and the minimum number of data in leaves (ML). 2.3

Gaussian Processes

A Gaussian process (GP) is a random process such that its restriction to any finite number of points has a Gaussian probability distribution. A Gaussian process GP(μ (x), κ (x, x0)) is defined by its mean function μ (x) and a covariance function κ (x, x0).

f (x) ∼ GP(μ (x), κ (x, x0)) (2) These functions determine the mean and covariance of the process because

E[ f (x)] = μ (x),

= κ (x, x0) Cov[ f (x), f (x0)] = E[( f (x) − μ (x))( f (x0) − μ (x0))] (3) The important part of modelling functions with Gaussian processes is choosing the covariance function. An important feature of covariance functions is that they can be combined together using addition and multiplication, i.e. for κ , κ 0 covariance functions, κ × κ 0 and κ + κ 0 are again covariance functions. Frequently used covariance functions are: linear, periodic, squared-exponential, and rational quadratic. • Linear: • Periodic: κlin(x, x0) = xx0

2 r κper(r) = exp(− l2 sin2(π p )) • Squared-exponential: • Rational Quadratic: r2 κSE (r) = exp(− 2l2 ) κRQ(r) = (1 + where r = |x − x0| and c, l, p, α are parameters of the covariance function (because the covariance function itself is a parameter of the Gaussian process, they are called hyper-parameters of the process). l is a length-scale, p defines period, and α changes the smoothness of rational quadratic function. An additional parameter in the model is the noise level (SN) which is an additive Gaussian noise in the model.

When working with multivariate data sets, the covariance functions which have the length-scale as a parameter, can either apply the same length-scale l to all dimensions, or i-th dimension has its length-scale li. In the first case, the covariance functions have isotropic distance measure, the latter case uses automatic relevance determination (ARD). Radial basis network functions (RBF) is a feed-forward neural network with one hidden layer in which the nodes have radial transfer function ρ . The output of the network is given by

N X i=1 ϕ (x) =

aiρ (||x − ci||) or its normalized version:

PN ϕ (x) = Pi=N1 aiρ (||x − ci||)

i=1 ρ (||x − ci||) where ρ (||x − ci||) is usually in form of gaussian: ρ (||x − ci||) = exp ||x − ci||2 2σi2 ci is a center vector of the respective neuron, ai is a weight of the neuron, and ||x − ci|| is a norm, typically the Euclidean norm. The model parameters are the spread constant σi2 (SC), the maximum of neurons (MAX) that can be added to network during iterative learning process, and the error goal (EG) which is a mean-squared error on training set. The maximum neurons or the error goal are stopping criteria for the network induction. (4) (5) 2.5

Model Selection and Evaluation

The parameters for regression models were selected by 10-fold cross validation based on the mean-squared error (MSE) err = MSE =

N 1 X N i=1 (Yi − f¯(X))2

The cross validation is suited for limited data samples, but it is also justified method for synthetic data. 3

Experiments with Synthetic Data

As we are interested primarily in the suitability of the considered regression methods for surrogate models in black-box optimization, we compared them on synthetic data generated using standard benchmarks for continuous black-box optimization [ 8 ].

All performed experiments were implemented in Matlab. For each function, we have sampled 5000 p-dimensional data points where p ∈ {5, 10, 20, 40} and used it for a 10-fold cross-validation to compare the considered models. The result of cross validation is MSE for training set, MSE for testing set and the Kendall’s rank correlation coefficient. The significance of the difference between results obtained for two models m, m0 was tested using independent sample t-test t = ¨ resm − resm0

1k (σm2 + σm20 ) which we compare for a significance level α ∈ (0, 1) against the (1 − α2 )-quantile of the Student distribution with 2(k-1) degrees of freedom, where k is the number of cross-validation folds, degrees of freedom, and resm, σm are computed as follows: resm =

resm,i k 1 X k

i=1 1 (k − 1) i=1

X σm = (resi − res)2

For a comparison of two models, it would have been better to use paired t-test, which provide better estimates, but since we have decided to use unpaired t-test at the beginning of our experiment, we haven’t had necessary subresults to perform it.

We have used MSE together with Kendall’s rank correlation coefficient [ 1 ] between the ordering of function values according to the model (y1, . . . , yn) and according to the function used to generate the data (t1, . . . , tn) τm = (# of concordant pairs) − (# of discordant pairs) 12 n(n − 1) (6) (7) where for (t j, y j) and (tk, yk) different pairs of target value t and predicted value y, (t j, y j) and (tk, yk) are concordant if t j < tk and y j < yk, or t j > tk and y j > yk, and discordant otherwise. 3.1

Selection of Model Parameters

For each dataset, we have searched for optimal model parameters (in the case of a Gaussian process, these are its hyper-parameters) minimizing the MSE. With regression trees, we have considered different settings for the number of trees and minimum number of data points in leaves. With Gaussian processes, we have tried rational quadratic and squared exponential in their isomorphic form, and also the ARD version of squared exponential. With radial basis function networks, as a radial function, we have considered different settings of the parameters: spread constant, MSE goal, maximum of neurons. As to polynomial regression, we have used quadratic regression. See Table 3 for overview of selected parameters for each model. 3.2

Results

We will now present the results of our experiments. First, we have included a detailed Table 3 with measured values of the MSE and the Kendall’s coefficient for each dataset and each model. We can see how optimal combinations of values of parameters for each model change with higher dimensions. Random forests have lower number of trees (NT) and higher minimum number of data in leaves (ML). The comparison of the performance of each method across different dimensions of the data sets follows.

Table 4 shows the results of our experiments where we have compared four different models across 40 different data sets. For each model, we entered the number of times the model was better than the other model and we also added how many times the result was significantly better on the significance level 0.05.

A summary of the results can be seen in Table 2 and additional comments on the results follow. With 10 dimensions, the radial basis functions started performing better, although not significantly. With 20 dimensions, there are even less methods that outperforms polynomial regression significantly according to MSE and random forests were the weakest model from the triple of models RBF, GP and RT. With 40 dimensions, there is a surprising result since MSE values are much lower comparing to lower dimensions and we would expect the MSE to be growing with higher dimensions. This may be an artifact of the function definitions which suppress higher dimensions and that may lower the MSE values.

In summary, when comparing the MSE over all dimensions, the Gaussian processes were the best model for our data followed by radial basis functions and random forests before polynomial regression at the last position. With Kendall’s coefficient, the results are not that clear. Even though the Gaussian processes have the most wins, they do f16 f17 -5 -5 -5 -5 (i) f23 Katsuura Function Lunacek

bi-Rastrigin (j) f24 Function 9.23±0.04e-1 SEiso 8.3±0.1e-1 SEiso 7±0.1e-1 RQiso 5.99±0.07e-1 RQiso

MSE

10 20 Dimensions 5

10 20 Dimensions (a) Polynomial regression (b) Random forests (c) Radial basis function networks (d) Gaussian processes Dimension Method not have the most significant wins. Based on the significant wins, the best performing model were random forests.

With higher dimensions, when comparing the models based on the MSE, we may notice that the results for Gaussian processes and random forests are less significant. Which is also the case with Kendall’s coefficient, where the polynomial regression gets more wins with higher dimension.

Now we have a look at how long does it take to evaluate 10-fold cross validation for selected parameters settings for each model (see Figure 2). With higher dimensions, each method takes more time to evaluate. All the computations were performed on PC (x86-64) Intel Core i7 920 (4x 2.66 GHz + HyperThreading), 6 GB RAM. 4

Discussion and Conclusion

The figures and tables presented in Results compared four different regression methods over 40 synthetic data sets (10 functions × 4 different dimensions) generated using standard benchmark functions for continuous black-box optimization. We have shown how the performance of these methods changes with increasing dimensionality and how the time to cross-validate the models grows. We have compared the methods based on the MSE and on Kendall’s coefficient. We will now comment on each of them.

Gaussian process is probably the most complex method. With its time complexity O(N3) it takes the longest time to compute, some of the cross-validations, i.e. 10 constructions of the model, took up to 24 hours. This model was better then the others according to both MSE and Kendall’s coefficient comparison.

Random forests ended up with poorer results for 40 dimensional data and overall they were slightly behind Gaussian Processes based on the MSE. According to Kendall’s coefficient results, they were comparable with Gaussian processes and, according to the number of significant wins, they even outperformed GP. With some data sets (f19-10d, f20-05d), we have learnt 2000 trees out of 4500 samples. In these cases, we could have compare the results with nearest neighbor method.

Radial basis functions network has the clearly poorer results compared to Gaussian processes and random forests according to both the MSE and the Kendall’s coefficient.

Even though the polynomial regression was included due to its importance as traditional response surface model, the method was not always worse then all other methods. For the dimensions 20 and 40, their MSE was comparable to that of random forest. Also with higher dimensions, the results based on the Kendall’s coefficient are comparable to both GP and RBF and it even outperformed RBF.

In this paper, we have compared a selection of nonlinear methods on synthetic data sets based on their meansquared error and on the Kendall’s rank correlation coefficient. We have chosen regression methods that have been used as surrogate models in such optimization: radial basis function networks, Gaussian processes, random forests, and polynomial regression. A better accuracy of the models suggests better applicability of the models as a surrogate model for optimization. From the results we have learnt that Gaussian processes had better results in most cases, thus, would be better surrogate model compared to the others, although random forests were only slightly behind.

Acknowledgements

This research was partially supported by SVV project number 260 224.

[1] Springer Encyclopedia of Mathematics

[2] Alessandri , A. , Cassettari , L. , Mosca . R.: Nonparametric nonlinear regression using polynomial and neural approximators: a numerical comparison . Computational Management Science 6 ( 1 ) ( 2009 ), 5 - 24

[3] Bajer , L. , Holenˇ a, M.: Surrogate model for continuous and discrete genetic optimization based on rbf networks . In: Colin Fyfe ,

Peter

Tino , Darryl Charles, Cesar GarciaOsorio, and Hujun Yin, (eds), Intelligent Data Engineering and Automated Learning - IDEAL 2010 , volume 6283 of Lecture Notes in Computer Science, 251 - 258 , Springer Berlin Heidelberg, 2010

[4] Bajer , L. , Pitra , Z. , Holenˇ a, M.: Benchmarking gaussian processes and random forests surrogate models on the bbob noiseless testbed . In: GECCO 2015 , 2015

[5] Breiman , L. : Random forests . Mach. Learn . 45 ( 1 ) ( October 2001 ), 5 - 32

[6] Buche , D. , Schraudolph , N. N. , Koumoutsakos , P. : Accelerating evolutionary algorithms with gaussian process fitness function models . Systems, Man, and Cybernetics , Part C: Applications and Reviews, IEEE Transactions on 35 ( 2 ) (May 2005 ), 183 - 194

[7] Gano , S. E. , Kim , H. , Brown , D. E.: Comparison of three surrogate modeling techniques: datascape, kriging and second order regression . In: 11th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference , 2006

[8] Hansen , N. , Finck , S. , Ros , R. , Auger , A. : Real-parameter black-box optimization benchmarking 2009: noiseless functions definitions . Research Report RR-6829 , 2009

[9] Hastie , T. , Tibshirani , R. , Friedman , J.: The elements of statistical learning . Springer Series in Statistics, Springer New York Inc., New York, NY, USA, 2001

[10] Hultquist , C. , Chen , G. , Zhao , K. : A comparison of gaussian process regression, random forests and support vector regression for burn severity assessment in diseased forests . Remote Sensing Letters 5 ( 8 ) ( 2014 ), 723 - 732

[11] Kleijnen , J. P. C. , van Beers , W., van Nieuwenhuyse, I. : Constrained optimization in expensive simulation: novel approach . European Journal of Operational Research 202 ( 1 ) ( 2010 ), 164 - 174

[12] Louppe , G. : Understanding random forests: from theory to practice . PhD thesis , 2014

[13] Luo , J. , Lu , W. : Comparison of surrogate models with different methods in groundwater remediation process . Journal of Earth System Science 123 ( 7 ) ( 2014 ), 1579 - 1589

[14] Myers , R. H. , Montgomery , D. C. , Anderson-Cook , C. M.:

Response surface methodology: proces and product optimization using designed experiments, 2009

[15] Rasmussen , C. E. : Evaluation of gaussian processes and other methods for non-linear regression . Technical Report , 1996

[16] Razi , M. A. , Athappilly , K. : A comparative predictive analysis of neural networks (nns), nonlinear regression and classification and regression tree (cart) models . Expert Systems with Applications 29 ( 1 ) ( 2005 ), 65 - 74

[17] Roy , A. , Govil , S. Miranda , R.: A neural-network learning theory and a polynomial time rbf algorithm . Neural Networks, IEEE Transactions on 8(6) (Nov. 1997 ), 1301 - 1313

[18] Zhou , Z. , Ong , Y. S. , Nair , P. B. , Keane , A. J. , Lum , K. Y.: Combining global and local surrogate models to accelerate evolutionary optimization . Systems, Man, and Cybernetics , Part C: Applications and Reviews, IEEE Transactions on 37 ( 1) (Jan . 2007 ), 66 - 76