Modifications of PI and EI under Gaussian Noise Assumption in Current Optima Huabing Wang School of Informatics, University of Edinburgh, Edinburgh, UK s1893474@ed.ac.uk Abstract Bayesian optimisation is a widely used tool for the hyper-parameter optimisations of black box functions. It implements a cheaper surrogate model such as Gaussian processes (GPs) to model search space. Acquisition functions on the top of GPs such as Probability Improvement (PI) and Expected Improvement (EI) are used to query the distribution of loss at all unevaluated positions in order to find the best one in theory. Traditionally, both acquisition functions use current optima in computations directly, but GPs assume that observations are noise corrupted. In this work, we mathematically derive modify PI and EI under Gaussian noise assumption. Modified PI and EI are compared with original versions on benchmark functions. We show that modified versions converge faster in same number of iterations and can achieve better performance in complex loss functions with reduced iterations. Keywords Bayesian Optimization, Acquisition Functions, Benchmark Functions 1. Introduction Machine learning has achieved phased success. However, almost all machine learning models need to optimize hyper-parameters, such as Neural Networks, Topic Model and Random Forest. In practice, tuning the hyper-parameters includes methods such as Grid Search, Random Search [1], and Gradient- based Optimizations [2]. These mehthods are then designed in order to minimize empirical risk with the desired efficiency or convergence speed. Bayesian Optimization [3] is acted as a probabilistic approach that majorly implements Gaussian Process (GP) and utilizes its property of both prediction and uncertainty measure to achieve derivative-free optimization. It can be used when the gradient of function for optimizing is not accessible. For Bayesian optimization, J Snoek et al. summarizes the applications in the field of machine learning, and numerical simulation shows that Bayesian optimization has the characteristics of high efficiency and strong convergence [4]. Martin Pelikan further find that hierarchy can be used to reduce problem complexity in black box optimization [5]. K Swersky et al. extends multi-task Gaussian processes to the framework of Bayesian optimization, and aims to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently [6]. J Snoek et al. further explores the use of neural networks as an alternative to GPs to model distributions over functions [7]. In fact, the principle of Bayesian Optimization is like reinforcement learning, it updates the modelling to hyper-parameters after each evaluation and then calculate the location for the next evaluation. Acquisition functions are used to calculate the desirability of each unevaluated locations, which also trades off between exploration and exploitation. Typical Acquisition functions are Upper Confidence Bound (UCB), Probability of Improvement (PI) and Expected Improvement (EI). However, the above acquisition functions does not fully take into account the deviation in the machine learning data collection process, that is, the noise contained in the current optimal. Based on the assumption of normal noise, we propose the corresponding modified version of PI and EI acquisitions, derive the corresponding explicit equations, and through a large number of numerical simulations and comparisons, the results indicates the feasibility of our proposed acquisition function. Copyright Β© 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2. Algorithms 2.1. Gaussian Process Gaussian process [8] can be considered as a proxy for a black-box function which enables uncertainty quantification. Gaussian process (GP) is infinite-dimensional multivariate Gaussian distribution. Covariance matrix of this distribution is defined by kernel functions k(β‹…,β‹…). Imagining 𝐱 forms a finite discretization of input space. Assuming the distribution has zero mean, prior draws 𝐟 can be simulated: f|𝐱 ∼ 𝒩(0, 𝐊) (ο€±) Statistical assumptions about the GP prior are represented in kernel functions. A commonly adopted kernel function is Matern kernel [8], where Ξ½ controls the smoothness of gaussian process. Let r = 𝐱i βˆ’ 𝐱j , Matern class of covariance function has the following definition: Ξ½ 21βˆ’Ξ½ √2Ξ½r √2Ξ½r k(r) = Ξ“(Ξ½) ( β„“ ) K Ξ½ ( β„“ ) (ο€²) with positive parameters Ξ½ and length-scale β„“, Gamma function Ξ“ and modified Bessel function K Ξ½ . Smooth GP kernels assumes that if x and xβ€² are close by, then f(x) and f(xβ€²) have similar values. Given noise observations 𝐲1:n at 𝐱1:n where yi ∼ 𝒩(fi , Οƒ2y ) . For a new point 𝐱n+1 , the joint probability distribution is given by 𝐲1:n 𝐊 + Οƒ2y 𝕀 𝐀(𝐱1:n , 𝐱n+1 ) (f ) ∼ 𝒩 (𝟎, [ ]) (ο€³) n+1 𝐀(𝐱n+1 , 𝐱1:n ) k(𝐱n+1 , 𝐱n+1 ) where 𝐊 = 𝐊(𝐱1:n , 𝐱1:n ). After applying the rule for conditional gaussians, we can gather the posterior over function values fn+1 |𝐲1:n , which follows a univariate Gaussian distribution 𝒩(ΞΌ(𝐱n+1 ), Οƒ2 (𝐱n+1 )) with ΞΌ(𝐱n+1 ) = 𝐀(𝐱1:n , 𝐱n+1 )(𝐊 + Οƒ2y 𝕀)βˆ’1 𝐲1:n Οƒ2 (𝐱n + 1) = k(𝐱n+1 , 𝐱n+1 ) βˆ’ 𝐀(𝐱1:n , 𝐱n+1 ) (𝐊 + Οƒ2y 𝕀)βˆ’1 𝐀(𝐱 n+1 , 𝐱1:n ) GP regression estimates the probability distribution of function values on unevaluated points. For each prediction location 𝐱 βˆ— , mean ΞΌ(𝐱 βˆ— ) gives the best estimate of the function value, and variance Οƒ2 (𝐱 βˆ— ) models the uncertainty at the point. Acquisition functions utilizes the computed distribution to guide the search for the optimal function value. 2.2. Acquisition Functions Acquisition functions take the mean and variance at each unevaluated point as input and compute a value indicating how favorable it is to sample at this point. It trades off between exploitation and exploration: β€’ Exploitation: looking for locations that minimize the posterior mean ΞΌ(𝐱). β€’ Exploration: looking for locations that maximize posterior variance Οƒ2 (𝐱) Given the data C = [(𝐱1 , y1 ), β‹― , (𝐱t , yt )] observed, the next point 𝐱t+1 is chosen by the ranking the value returned by acquisition function at candidate points. 𝐱t+1 = arg min a(𝐱|C) () 𝐱 Acquisition functions is defined as the expected utility u at the unevaluated location 𝐱: a(𝐱|C) = 𝔼(u(𝐱, y)|𝐱, C) () = ∫ u(𝐱, y) p(y|𝐱, C) dy The probability p(y|𝐱, C) here is gathered from posterior distribution 𝒩(ΞΌ(𝐱), Οƒ2 (𝐱)) calculated by GP regression. Probability Improvement (PI), Expected Improvement (EI) and Entropy Search employ different utility functions. Other acquisition functions such as Upper confidence bound(UCB) directly invoke the mean and variance instead. 2.2.1. Probability Improvement PI computes the likelihood that the function at unevaluated locations 𝐱 will return a value lower than the current minimum yΜƒ. Utility function of PI is defined as: 0, if(𝐱) > yΜƒ u(𝐱) = { (ο€Ά) 1, if(𝐱) ≀ yΜƒ We can understand utility function as a reward, when f(𝐱) ≀ yΜƒ , a certain amount of value is rewarded, here the reward is 1. According to the utility function above, the expected utility x can be Μƒβˆ’ΞΌ(x) y written as the normal commutative density function of Οƒ(x) : aPI (𝐱) = 𝔼[u(𝐱)|C] Μƒ y = βˆ«βˆ’βˆž 𝒩f(𝐱) (ΞΌ(𝐱), Οƒ2 (𝐱))d f(𝐱) (ο€·) Μƒβˆ’ΞΌ(𝐱) y = Ξ¦ ( Οƒ(𝐱) ) PI(𝐱) PI only cares whether f(𝐱) is greater than yΜƒ, but does not count the quantity of improvement. This will result PI very likely to pick points near to previously sampled locations. As the searching trajectory reach local minimum, it will be stuck here and hardly jump out. Therefore, PI only cares about exploitation. 2.2.2. Expected Improvement EI leverages better between exploration and exploitation. The amount of improvement with respect to the recent global optima yΜƒ βˆ’ f(𝐱) is taken into account. Utility function of EI is defined as: u(𝐱) = max(0, yΜƒ βˆ’ f(𝐱)) (ο€Έ) Therefore, the expression of expected utility can be derived: aEI (𝐱) = 𝔼[u(𝐱)|C] ∞ = βˆ«βˆ’βˆž max(0, yΜƒ βˆ’ f(𝐱)) 𝒩f(𝐱) (ΞΌ(𝐱), Οƒ2 (𝐱))d f(𝐱) Μƒ y = βˆ«βˆ’βˆž (yΜƒ βˆ’ f(𝐱)) 𝒩f(𝐱) (ΞΌ(𝐱), Οƒ2 (𝐱))d f(𝐱) (ο€Ή) Μƒβˆ’ΞΌ(𝐱) y Μƒβˆ’ΞΌ(𝐱) y = (yΜƒ βˆ’ ΞΌ(𝐱))Ξ¦ ( ) + Οƒ(𝐱)Ο• ( ) Οƒ(𝐱) Οƒ(𝐱) where Ο•(β‹…) is the probability density function. In order to get higher value, at the left side of equation, we want to minimize ΞΌ(𝐱); and at the right side, we want to maximize Οƒ(𝐱). A basic equation based trade off between exploitation and exploration are achieved here. The trade off between exploration and exploitation can be adjusted by tunning a parameter ΞΎ at the deduction part(yΜƒ βˆ’ ΞΌ(𝐱) βˆ’ ΞΎ). Larger ΞΎ will favour exploration in early steps and exploitaion later does not work well experimentally[9]. 2.2.3. Modified Probability Improvement If evaluations are noise corrupted yi |fi ∼ 𝒩(fi , Οƒ2y ), the current loss optimum yΜƒ is not a reliable sample. Instead of using the optimum directly, we consider to use the posterior distribution 𝒩(ΞΌ(𝐱̃), Οƒ2 (𝐱̃)) at the current optimum. PI can be modified under the noise corrupted conditions in order to increase the robustness at sampling process. Let β€’ k(𝐱, 𝐱) denotes the posterior variance Οƒ2 (𝐱) of an unevaluated point 𝐱 computed from gaussian process. β€’ k(𝐱̃, 𝐱̃) denotes the posterior variance Οƒ2 (𝐱̃) of loss optimum . β€’ k(𝐱, 𝐱̃) denotes the posterior covariance between unevaluated point and loss optimum. According to the rule of variance deduction of two dependent random variables: Var[X βˆ’ Y] = Var[X] + Var[Y] βˆ’ 2 Γ— Cov[X, Y] (ο€±ο€°) Distribution of f(𝐱) βˆ’ f(𝐱̃) can be derived: 𝒩f(𝐱)βˆ’f(𝐱̃) (ΞΌ(𝐱) βˆ’ ΞΌ(𝐱̃), k(𝐱, 𝐱) + k(𝐱̃, 𝐱̃) βˆ’ 2k(𝐱, 𝐱̃)) (ο€±ο€±) Utility function of Modified Probability Improvement (MPI) is rewritten as: 0, if(𝐱) βˆ’ f(𝐱̃) > 0 u(𝐱) = { (ο€±ο€²) 1, if(𝐱) βˆ’ f(𝐱̃) ≀ 0 Since the utility function only counts the improvement when f(𝐱) βˆ’ f(𝐱̃) ≀ 0, PI can be written as xβˆ’ΞΌ the probability of f(𝐱) βˆ’ f(𝐱̃) ≀ 0. As if X ∼ 𝒩(ΞΌ, Οƒ2 ), then β„™(X < x) = Ξ¦( Οƒ ). Cumulative density function for MPI can be derived: aMPI (x) = β„™(f(𝐱) βˆ’ f(𝐱̃) ≀ 0) 0βˆ’(ΞΌ(𝐱)βˆ’ΞΌ(𝐱̃)) = Ξ¦( ) √k(𝐱,𝐱)+k(𝐱̃,𝐱̃)βˆ’2k(𝐱,𝐱̃) (ο€±ο€³) ΞΌ(𝐱̃)βˆ’ΞΌ(𝐱) = Ξ¦( ) √k(𝐱,𝐱)+k(𝐱̃,𝐱̃)βˆ’2k(𝐱,𝐱̃) 2.2.4. Modified Expected Improvement Same as MPI, yΜƒ is replaced by the posterior distribution at 𝐱̃ in Modified Expected Improvement (MEI). Expression of utility function is: u(𝐱) = max(0, f(𝐱̃) βˆ’ f(𝐱)) () A lemma of expectation on max function applied on normal distributed random variables [10] can be directly employed to get the expression of MEI: If s ∼ 𝒩(ΞΌ, Οƒ2 ) ∞ 𝔼[max(0, s)] = ∫0 s𝒩(S; ΞΌ, Οƒ2 ) ds () ΞΌ ΞΌ = Ξ¦( )ΞΌ + Ο•( )Οƒ Οƒ Οƒ We already know the mean and variance of normal distribution of f(𝐱) βˆ’ f(𝐱̃) from equation 11. Mean of f(𝐱̃) βˆ’ f(𝐱) is ΞΌ(𝐱̃) βˆ’ ΞΌ(𝐱) , and the variance remines the same. Let ρ denotes √k(𝐱, 𝐱) + k(𝐱̃, 𝐱̃) βˆ’ 2k(𝐱, 𝐱̃), applying Lemma from equation 15: aMEI = 𝔼[u(x)|C] ΞΌ(𝐱̃)βˆ’ΞΌ(𝐱) ΞΌ(𝐱̃)βˆ’ΞΌ(𝐱) (ο€±ο€Ά) = Ξ¦( ρ )(ΞΌ(𝐱̃) βˆ’ ΞΌ(𝐱)) + Ο•( ρ )ρ 3. Experiments Performance of modified versions of PI and EI are compared with the traditional PI and EI on 3 selected 2D benchmark functions. Variables including kernel functions, kernel parameters and position of pre-samplings are controlled to be the same for each set of experiment. We will visualise sampling position and global optima in search space, and current minimal loss at each iteration. Performance of 4 acquisition functions on each benchmark function will be discussed by sections. 3.1. Testing on Spere Function Sphere function[11] has 1 global minima. It is bowl-shaped, convex and unimodal. Sphere function in d dimensions is: f(𝐱) = βˆ‘di=1 xi2 (ο€±ο€·) Figure 1: Coutours of Sphere Function Figure 1 shows the contour of this function. Sampling locations and loss in 45 iterations for 4 acuqisation functions are in table 1, where star points represents the global optima and blue points are the sampling locations. We will compare acquisition functions in pair. PI performed competitively in the given environment, its sampling trajectory to the minima almost follows the gradient direction. After it reaches the global minima, it only sample the locations close to it. MPI shows similar performance, the difference is that it takes longer time (more iterations) to reach optima, and it will occationally jump out and sample locations far from current minima. EI and MEI puts more leverage at exploration side. Both of them will search globally before start to exploit near to loss optimum. Unlike EI, MEI converges faster after it had sampled locations close to global minima, it does not frequently jump out and searching locations far from current optima. Table 1 Sampling Locations and Loss in 45 iterations on Sphere Function Table 2 Loss on Sphere Function Averaged in 10 Trails Acquisition Function Mean Loss Β± Standard Deviation PI 2.15 Γ— 10βˆ’4 Β± 3.04 Γ— 10βˆ’5 MPI 7.13 Γ— 10βˆ’5 Β± 7.07 Γ— 10βˆ’5 EI 1.42 Γ— 10βˆ’4 Β± 6.61 Γ— 10βˆ’5 MEI 1.81 Γ— 10βˆ’3 Β± 1.93 Γ— 10βˆ’4 Table 2 gives the mean and standard deviations of loss that averaged in 10 trails for each acquisation function. Each trail has 45 iterations and a unique random seed. MPI is better performed than PI, but MEI is worse than EI. Sphere function provides a simple situation where every acquisition function is able to sample near to the global optima. MPI shows a better exploitation ability in such a situation. 3.2. Testing on Six-Hump Camel Function Six-hump camel function[11] has 6 local minima, and 2 of them are global minima. Six-hump camel function in 2 dimensions is defined as: x4 f(𝐱) = (4 βˆ’ 2.1x12 + 31 ) x12 + x1 x2 + (βˆ’4 + 4x22 )x22 (ο€±ο€Έ) Figure 2: Coutours of Six-Hump Camel Function Table 3 Sampling Locations and Loss in 45 iterations Figure 2 shows the contour of this function. According to table 3, both PI and MPI could only find 1 global optima, but PI converges faster than MPI. EI and MEI is able to sample points equally near to both global optimum. MEI converges faster than EI at finding the first global minima. Table 4 Loss on Six-Hump Camel Function Averaged in 10 Trails Acquisition Function Mean Loss Β± Standard Deviation PI 1.1 Γ— 10βˆ’4 Β± 8.62 Γ— 10βˆ’5 MPI 7.19 Γ— 10βˆ’5 Β± 1.34 Γ— 10βˆ’4 EI 7.41 Γ— 10βˆ’3 Β± 5.73 Γ— 10βˆ’3 MEI 1.15 Γ— 10βˆ’2 Β± 9.02 Γ— 10βˆ’3 Table 4 gives the mean of loss and its standard deviation with 45 iterations averaged in 10 trails. Six-hump camel function is more complex compared with sphere function Through EI and MEI is able to find multiple optima, PI and MPI still achieve smaller loss by exploiting single one. Comparing acquisition functions in pairs, PI and MPI achieves similar performance, and EI still performs better than MEI. 3.3. Testing on Rastrigin Function Rastrigin function[11] is a multimodal function, and its local minimas grid distribute through out the search space. It only has 1 global minima at the center. Rastrigin function in d dimensional is defined as: f(𝐱) = 10d + βˆ‘di=1 [xi2 βˆ’ 10cos(2Ο€xi )] (ο€±ο€Ή) Figure 3 shows the contour of this function. As rastrigin function is complicated, we set two set of experiments in order to test the performance of acquisitions in 45 and 100 iterations respectatively. Figure 3: Coutours of Rastrigin Function 3.3.1. Testing on Rastrigin Function in 45 Iterations Table 5 shows the sampling locations and loss of 4 acquisition functions. Both PI and MPI exploit at a local optima close to global optima, and sampling points globally at the same time. EI and MEI find 4 local optima to exploit respectively, but since they has exploited too much, both of them do not have enough iterations to explore globally. The loss of PI and MPI converges faster than EI and MEI, and MPI is the fastest in the four acquisition functions. Table 5 Sampling Locations and Loss in 45 iterations Table 6 Loss on Rastrigin Function Averaged in 10 Trails Acquisition Function Mean Loss Β± Standard Deviation PI 9.41 Β± 4.41 MPI 2.57 Β± 2.17 EI 4.22 Β± 1.65 MEI 2.17 Β± 2.20 Table 6 gives the comparison of loss averaged in 10 trails. Here both MPI and MEI achieve a better loss than original acquisition functions. MPI is also more robust than PI as the standard deviation is smaller. 3.3.2. Testing on Rastrigin Function in 100 Iterations Table 7 shows the sampling locations and loss of 4 acquisition functions in 100 iterations. PI and MPI can sample locations near to global optima, but only PI actually exploits at the optima. EI and MEI exploits at several good local optimas close to global optima but did not exploit at global optima. All 4 acquisation functions do explore search space with a number of sampling locations. PI spends more iterations to get a relatively small loss, both MPI and MEI converge faster than PI and EI. Table 7 Sampling Locations and Loss in 100 iterations Table 8 Loss on Rastrigin Functions in 100 Iterations Averaged in 10 Trails Acquisition Function Mean Loss Β± Standard Deviation PI 1.18 Β± 1.38 MPI 1.70 Β± 1.26 EI 0.73 Β± 0.45 MEI 1.14 Β± 0.53 Table 8 gives the comparison of loss averaged in 10 trails. PI and EI are better than MPI and MEI. EI has the lowest loss with best robustness to get the good result. This shows in complex situation with sufficient iterations, EI shows its superiority on finding optimised result. 3.4. Experiment Summary In simple loss functions with only a small number of minimas, MPI performs better than PI, EI and MEI. MEI is the worst one with much bigger loss and high standard deviations. In complicated loss functions with insufficient iterations, MEI and MPI is better than EI and PI. With sufficient iterations, EI is better than other acquisition functions, and MEI is the worst. In most of the conditions, loss of MPI and MEI converge faster than PI and EI. 4. Conclusions This paper discusses the acquisition function in Bayesian Optimization in machine learning applications. Based on the traditional acquisition function, the systematic noise between the observation data and the ground truth is not fully considered. When the noise satisfies the Gaussian distribution assumption, we propose modified acquisition functions for EI and PI respectively. In addition, we believe that the following perspectives can be used as future work: β€’ When the number of iterations increases beyond a threshold, we should consider using a more complex hypothesis space to construct the prediction of unknown points, such as Gaussian mixture distribution or depth neural network with complex structure. β€’ When calculating the collection function of a point, the information of nearby points should be weighed at the same time, which can be realized by an algorithm similar to random forest, in which the nearby points will be assigned to a leaf node. β€’ When the data contains non Gaussian noise, acquisition functions should be constructed correspondingly to achieve better balancing the exploration and exploitation, so as to improve the optimization efficiency. 5. References [1] J. Bergstra and Y. Bengio, β€œRandom search for hyper-parameter optimization.,” Journal of machine learning research, vol. 13, no. 2, 2012. [2] Y. Bengio, β€œGradient-based optimization of hyperparameters,” Neural computation, vol. 12, no. 8, pp. 1889–1900, 2000. [3] M. Pelikan, D. E. Goldberg, E. CantΓΊ-Paz, et al., β€œBoa: The bayesian optimization algorithm,” in Proceedings of the genetic and evolutionary computation conference GECCO-99, vol. 1, pp. 525– 532, Citeseer, 1999. [4] J. Snoek, H. Larochelle, and R. P. Adams, β€œPractical bayesian optimization of machine learning algorithms,” Advances in neural information processing systems, vol. 25, 2012. [5] M. Pelikan, β€œHierarchical bayesian optimization algorithm,” in Hierarchical Bayesian optimization algorithm, pp. 105–129, Springer, 2005. [6] K. Swersky, J. Snoek, and R. P. Adams, β€œMulti-task bayesian optimization,” 2013. [7] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams, β€œScalable bayesian optimization using deep neural networks,” in International conference on machine learning, pp. 2171–2180, PMLR, 2015. [8] C. E. Rasmussen and C. K. I. Williams, Gaussian process for Machine Learning. The MIT Press, 2005. [9] E. Brochu, V. M. Cora, and N. De Freitas, β€œA tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning,” arXiv preprint arXiv:1012.2599, 2010. [10] S. Nadarajah and S. Kotz, β€œExact distribution of the max/min of two gaussian random variables,” IEEE Transactions on very large scale integration (VLSI) systems, vol. 16, no. 2, pp. 210–212, 2008. [11] M. Molga and C. Smutnicki, β€œTest functions for optimization needs.” https://robertmarks.org/Classes/ENGR5358/Papers/functions.pdf year=2005.