1. Introduction

Modifications of PI and EI under Gaussian Noise Assumption in Current Optima

Huabing Wang

0 0 School of Informatics, University of Edinburgh , Edinburgh , UK

Bayesian optimisation is a widely used tool for the hyper-parameter optimisations of black box functions. It implements a cheaper surrogate model such as Gaussian processes (GPs) to model search space. Acquisition functions on the top of GPs such as Probability Improvement (PI) and Expected Improvement (EI) are used to query the distribution of loss at all unevaluated positions in order to find the best one in theory. Traditionally, both acquisition functions use current optima in computations directly, but GPs assume that observations are noise corrupted. In this work, we mathematically derive modify PI and EI under Gaussian noise assumption. Modified PI and EI are compared with original versions on benchmark functions. We show that modified versions converge faster in same number of iterations and can achieve better performance in complex loss functions with reduced iterations.

eol>Bayesian Optimization Acquisition Functions Benchmark Functions

1. Introduction

Machine learning has achieved phased success. However, almost all machine learning models need to optimize hyper-parameters, such as Neural Networks, Topic Model and Random Forest. In practice, tuning the hyper-parameters includes methods such as Grid Search, Random Search [ 1 ], and Gradientbased Optimizations [ 2 ]. These mehthods are then designed in order to minimize empirical risk with the desired efficiency or convergence speed. Bayesian Optimization [ 3 ] is acted as a probabilistic approach that majorly implements Gaussian Process (GP) and utilizes its property of both prediction and uncertainty measure to achieve derivative-free optimization. It can be used when the gradient of function for optimizing is not accessible.

For Bayesian optimization, J Snoek et al. summarizes the applications in the field of machine learning, and numerical simulation shows that Bayesian optimization has the characteristics of high efficiency and strong convergence [ 4 ]. Martin Pelikan further find that hierarchy can be used to reduce problem complexity in black box optimization [ 5 ]. K Swersky et al. extends multi-task Gaussian processes to the framework of Bayesian optimization, and aims to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently [ 6 ]. J Snoek et al. further explores the use of neural networks as an alternative to GPs to model distributions over functions [ 7 ].

In fact, the principle of Bayesian Optimization is like reinforcement learning, it updates the modelling to hyper-parameters after each evaluation and then calculate the location for the next evaluation. Acquisition functions are used to calculate the desirability of each unevaluated locations, which also trades off between exploration and exploitation. Typical Acquisition functions are Upper Confidence Bound (UCB), Probability of Improvement (PI) and Expected Improvement (EI). However, the above acquisition functions does not fully take into account the deviation in the machine learning data collection process, that is, the noise contained in the current optimal. Based on the assumption of normal noise, we propose the corresponding modified version of PI and EI acquisitions, derive the corresponding explicit equations, and through a large number of numerical simulations and comparisons, the results indicates the feasibility of our proposed acquisition function. Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

2. Algorithms 2.1. Gaussian Process

Gaussian process [ 8 ] can be considered as a proxy for a black-box function which enables uncertainty quantification. Gaussian process (GP) is infinite-dimensional multivariate Gaussian distribution. Covariance matrix of this distribution is defined by kernel functions k(⋅,⋅). Imagining forms a finite discretization of input space. Assuming the distribution has zero mean, prior draws can be simulated: j, Matern class of covariance function has the following definition:

Statistical assumptions about the GP prior are represented in kernel functions. A commonly adopted kernel function is Matern kernel [ 8 ], where ν controls the smoothness of gaussian process. Let r = i − f| ∼ (0, ) k(r) = 21−ν √2νr ν Γ(ν) ( ℓ ) Kν ( ℓ √2νr )

Acquisition functions take the mean and variance at each unevaluated point as input and compute a value indicating how favorable it is to sample at this point. It trades off between exploitation and exploration: value returned by acquisition function at candidate points.

• •

Exploitation: looking for locations that minimize the posterior mean μ( ).

Exploration: looking for locations that maximize posterior variance σ2( ) Given the data C = [( 1, y1), ⋯ , ( t, yt)] observed, the next point t+1 is chosen by the ranking the Acquisition functions is defined as the expected utility u at the unevaluated location : t+1 = arg min a( |C)

a( |C) = (u( , y)| , C)

= ∫ u( , y)p(y| , C)dy with positive parameters ν and length-scale ℓ, Gamma function Γ and modified Bessel function Kν. Smooth GP kernels assumes that if x and x′ are close by, then f(x) and f(x′) have similar values.

Given noise observations 1:n at 1:n where yi ∼ (fi, σy2). For a new point n+1 , the joint probability distribution is given by

fn+1 ( 1:n ) ∼ ( , [ + σy2 ( n+1, 1:n) k( n+1, n+1) ( 1:n, n+1) ]) where over (μ( n+1), σ2( n+1))with

= ( 1:n, 1:n). After applying the rule for conditional gaussians, we can gather the posterior function values fn+1| 1:n , which follows a univariate

Gaussian distribution μ( n+1) = ( 1:n, n+1)( + σy2 )−1 1:n σ2( n + 1) = k( n+1, n+1)− ( 1:n, n+1) ( + σ )−1 ( n+1, 1:n) 2 y

GP regression estimates the probability distribution of function values on unevaluated points. For each prediction location ∗, mean μ( ∗) gives the best estimate of the function value, and variance σ2( ∗

)models the uncertainty at the point. Acquisition functions utilizes the computed distribution to guide the search for the optimal function value.

2.2. Acquisition Functions

() () () () ()

The probability p(y| , C) here is gathered from posterior distribution (μ( ), σ2( )) calculated by

GP regression.

Probability Improvement (PI), Expected Improvement (EI) and Entropy Search employ different utility functions. Other acquisition functions such as Upper confidence bound(UCB) directly invoke the mean and variance instead.

2.2.1. Probability Improvement

We can understand utility function as a reward, when f( ) ≤ ỹ, a certain amount of value is rewarded, here the reward is 1. According to the utility function above, the expected utility x can be written as the normal commutative density function of ̃y−μ(x): σ(x) aPI( ) = [u( )|C] = ∫−̃y∞ f( )(μ( ), σ2( ))d f( ) = Φ (̃y−μ( ))PI( )

σ( )

PI only cares whether f( ) is greater than ỹ, but does not count the quantity of improvement. This will result PI very likely to pick points near to previously sampled locations. As the searching trajectory reach local minimum, it will be stuck here and hardly jump out. Therefore, PI only cares about exploitation.

2.2.2. Expected Improvement

EI leverages better between exploration and exploitation. The amount of improvement with respect to the recent global optima ỹ − f( )is taken into account. Utility function of EI is defined as: u( ) = max(0, ỹ − f( )) Therefore, the expression of expected utility can be derived: aEI( ) = [u( )|C] = ∫−∞∞ max(0, ỹ− f( )) f( )(μ( ), σ2( ))d f( ) = ∫−̃y∞ (ỹ− f( )) f( )(μ( ), σ2( ))d f( ) = (ỹ− μ( ))Φ (̃y−μ( )) + σ( )ϕ (̃y−μ( ))

σ( ) σ( ) where ϕ(⋅) is the probability density function. In order to get higher value, at the left side of equation, we want to minimize μ( ); and at the right side, we want to maximize σ( ). A basic equation based trade off between exploitation and exploration are achieved here.

The trade off between exploration and exploitation can be adjusted by tunning a parameter ξ at the deduction part(ỹ − μ( )− ξ). Larger ξ will favour exploration in early steps and exploitaion later does not work well experimentally[ 9 ].

2.2.3. Modified Probability Improvement

If evaluations are noise corrupted yi|fi ∼ (fi, σy2), the current loss optimum ỹ is not a reliable sample. Instead of using the optimum directly, we consider to use the posterior distribution (μ( ̃), σ2( ̃)) at the current optimum. PI can be modified under the noise corrupted conditions in order to increase the robustness at sampling process. Let • k( , ) denotes the posterior variance σ2( )of an unevaluated point computed from gaussian process. • k( ̃, ̃) denotes the posterior variance σ2( ̃)of loss optimum . • k( , ̃) denotes the posterior covariance between unevaluated point and loss optimum. According to the rule of variance deduction of two dependent random variables: () () ()

Var[X − Y] = Var[X] + Var[Y] − 2 × Cov[X, Y]

Distribution of f( )− f( ̃) can be derived:

f( )−f(̃)(μ( )− μ( ̃), k( , )+ k( ̃, ̃)− 2k( , ̃))

Utility function of Modified Probability Improvement (MPI) is rewritten as:

0, if( )− f( ̃) > 0 u( ) = {1, if( )− f( ̃) ≤ 0 function for MPI can be derived:

Since the utility function only counts the improvement when f( )− f( ̃) ≤ 0, PI can be written as the probability of f( )− f( ̃) ≤ 0. As if X ∼ (μ, σ2), then ℙ(X < x) = Φ(x−μ). Cumulative density σ aMPI(x) = ℙ(f( )− f( ̃) ≤ 0) = Φ ( = Φ (

0−(μ( )−μ(̃)) ) √k( , )+k(̃,̃)−2k( ,̃)

μ(̃)−μ( ) ) √k( , )+k(̃,̃)−2k( ,̃)

Performance of modified versions of PI and EI are compared with the traditional PI and EI on 3 selected 2D benchmark functions. Variables including kernel functions, kernel parameters and position of pre-samplings are controlled to be the same for each set of experiment. We will visualise sampling position and global optima in search space, and current minimal loss at each iteration. Performance of 4 acquisition functions on each benchmark function will be discussed by sections.

3.1. Testing on Spere Function

Sphere function[ 11 ] has 1 global minima. It is bowl-shaped, convex and unimodal. Sphere function in d dimensions is:

A lemma of expectation on max function applied on normal distributed random variables [ 10 ] can be directly employed to get the expression of MEI:

u( ) = max(0, f( ̃)− f( )) If s ∼ (μ, σ2) [max(0, s)] = ∫0∞ s (S; μ, σ2)ds = Φ (μ)μ + ϕ (μ)σ σ σ

We already know the mean and variance of normal distribution of f( )− f( ̃)from equation 11. Mean of f( ̃)− f( ) is μ( ̃)− μ( ) , and the variance remines the same. Let ρ denotes √k( , )+ k( ̃, ̃)− 2k( , ̃), applying Lemma from equation 15:

2.2.4. Modified Expected Improvement

Same as MPI, ỹ is replaced by the posterior distribution at ̃ in Modified Expected Improvement (MEI). Expression of utility function is: 3. Experiments aMEI = [u(x)|C] = Φ(μ(̃)−μ( ))(μ( ̃)− μ( ))+ ϕ(μ(̃)−μ( ))ρ ρ ρ () () () () () () () ()

Figure 1 shows the contour of this function. Sampling locations and loss in 45 iterations for 4 acuqisation functions are in table 1, where star points represents the global optima and blue points are the sampling locations. We will compare acquisition functions in pair. PI performed competitively in the given environment, its sampling trajectory to the minima almost follows the gradient direction. After it reaches the global minima, it only sample the locations close to it. MPI shows similar performance, the difference is that it takes longer time (more iterations) to reach optima, and it will occationally jump out and sample locations far from current minima. EI and MEI puts more leverage at exploration side. Both of them will search globally before start to exploit near to loss optimum. Unlike EI, MEI converges faster after it had sampled locations close to global minima, it does not frequently jump out and searching locations far from current optima.

3.2. Testing on Six-Hump Camel Function

Six-hump camel function[ 11 ] has 6 local minima, and 2 of them are global minima. Six-hump camel function in 2 dimensions is defined as: ()

3.3. Testing on Rastrigin Function

Rastrigin function[ 11 ] is a multimodal function, and its local minimas grid distribute through out the search space. It only has 1 global minima at the center. Rastrigin function in d dimensional is defined as: f( ) = 10d + ∑id=1 [xi2 − 10cos(2πxi)] ()

3.3.2. Testing on Rastrigin Function in 100 Iterations

Table 7 shows the sampling locations and loss of 4 acquisition functions in 100 iterations. PI and MPI can sample locations near to global optima, but only PI actually exploits at the optima. EI and MEI exploits at several good local optimas close to global optima but did not exploit at global optima. All 4 acquisation functions do explore search space with a number of sampling locations. PI spends more iterations to get a relatively small loss, both MPI and MEI converge faster than PI and EI.

3.4. Experiment Summary

In simple loss functions with only a small number of minimas, MPI performs better than PI, EI and MEI. MEI is the worst one with much bigger loss and high standard deviations. In complicated loss functions with insufficient iterations, MEI and MPI is better than EI and PI. With sufficient iterations, EI is better than other acquisition functions, and MEI is the worst. In most of the conditions, loss of MPI and MEI converge faster than PI and EI.

4. Conclusions

This paper discusses the acquisition function in Bayesian Optimization in machine learning applications. Based on the traditional acquisition function, the systematic noise between the observation data and the ground truth is not fully considered. When the noise satisfies the Gaussian distribution assumption, we propose modified acquisition functions for EI and PI respectively. In addition, we believe that the following perspectives can be used as future work: • When the number of iterations increases beyond a threshold, we should consider using a more complex hypothesis space to construct the prediction of unknown points, such as Gaussian mixture distribution or depth neural network with complex structure. • When calculating the collection function of a point, the information of nearby points should be weighed at the same time, which can be realized by an algorithm similar to random forest, in which the nearby points will be assigned to a leaf node. • When the data contains non Gaussian noise, acquisition functions should be constructed correspondingly to achieve better balancing the exploration and exploitation, so as to improve the optimization efficiency.

5. References

3.3.1. Testing on Rastrigin Function in 45 Iterations

[1]

Bergstra and

Bengio , “ Random search for hyper-parameter optimization ., ” Journal of machine learning research , vol. 13 , no. 2 , 2012 .

[2]

Bengio , “ Gradient-based optimization of hyperparameters,” Neural computation , vol. 12 , no. 8 , pp. 1889 - 1900 , 2000 .

[3]

Pelikan ,

D. E.

Goldberg ,

Cantú-Paz , et al., “Boa: The bayesian optimization algorithm ,” in Proceedings of the genetic and evolutionary computation conference GECCO-99 , vol. 1 , pp. 525 - 532 , Citeseer, 1999 .

[4]

Snoek ,

Larochelle , and

R. P.

Adams , “ Practical bayesian optimization of machine learning algorithms , ” Advances in neural information processing systems , vol. 25 , 2012 .

[5]

Pelikan , “ Hierarchical bayesian optimization algorithm,” in Hierarchical Bayesian optimization algorithm , pp. 105 - 129 , Springer, 2005 .

[6]

Swersky ,

Snoek , and

R. P.

Adams , “Multi-task bayesian optimization ,” 2013 .

[7]

Snoek ,

Rippel ,

Swersky ,

Kiros ,

Satish ,

Sundaram ,

Patwary ,

Prabhat , and

Adams , “ Scalable bayesian optimization using deep neural networks , ” in International conference on machine learning , pp. 2171 - 2180 , PMLR, 2015 .

[8]

C. E.

Rasmussen and C. K. I. Williams , Gaussian process for Machine Learning . The MIT Press, 2005 .

[9]

Brochu ,

V. M.

Cora , and N. De Freitas, “ A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning , ” arXiv preprint arXiv:1012.2599 , 2010 .

[10]

Nadarajah and

Kotz , “ Exact distribution of the max/min of two gaussian random variables,” IEEE Transactions on very large scale integration (VLSI) systems , vol. 16 , no. 2 , pp. 210 - 212 , 2008 .

[11]

Molga and

Smutnicki , “ Test functions for optimization needs .” https://robertmarks.org/Classes/ENGR5358/Papers/functions.pdf year= 2005 .