=Paper=
{{Paper
|id=Vol-2491/paper32
|storemode=property
|title=Maximum Entropy Bayesian Actor Critic
|pdfUrl=https://ceur-ws.org/Vol-2491/paper32.pdf
|volume=Vol-2491
|authors=Steven Homer
|dblpUrl=https://dblp.org/rec/conf/bnaic/Homer19a
}}
==Maximum Entropy Bayesian Actor Critic==
Maximum Entropy Bayesian Actor Critic Steven Homer1[0000−0002−1515−1093] AI Lab at Vrije Universiteit Brussel, 1050 Brussels, BE info@ai.vub.ac.be https://ai.vub.ac.be Abstract. In recent years Deep Reinforcement Learning [12] has achieved human-like performance or better on a variety of benchmarks such as the Atari Arcade [2]; however, Deep RL often has problems with sample efficiency and convergence brittleness. That is, to learn even the sim- plest tasks, Deep RL requires a huge amount of meaningful samples, and will only converge if the parameters are tuned just right [4]. This paper seeks to ameliorate these problems of sample inefficiency and convergence brittleness with the combination of two different reinforcement learning paradigms: Bayesian RL and Maximum Entropy RL. Bayesian reinforcement learning [8] utilizes Bayesian statistics to model the confidence in a given model, which has been shown to greatly in- crease sample efficiency [3]. Maximum entropy RL is a technique that modifies the standard reward to promote more exploration in the agent [20]. Hopefully, combining the two will retain the best of both of these properties and avoid the problems faced in deep RL altogether. This paper first derives a soft policy gradient that introduces a entropy- weighted term to the standard policy gradient function, and then applies this to the the Bayesian actor critic paradigm to augment the parameter update rule to account for the entropy-weighted value function. After de- termining a closed-form solution of the gradient with the softmax policy, the method was implemented and evaluated on the Cartpole environ- ment, signalling that there are avenues ripe for future research in this area. Keywords: Bayesian Reinforcement Learning · Maximum Entropy Re- inforcement Learning · Gaussian Processes · Bayesian Quadrature · Bayesian Actor Critic 1 Introduction This paper is structured as follows. The Background section is composed of a decent amount of exposition around Gaussian processes and Bayesian quadrature necessary to understand the Bayesian Actor Critic model. After explaining the details of Bayesian actor critic and maximum entropy RL, the main contribution 0 Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 S. Homer of the paper, Maximum Entropy Bayesian Actor Critic (MEBAC), is discussed. After the explication of MEBAC and a few implementation choices, the Results, Discussion, and Conclusion empirically analyze the model. 2 Background 2.1 Policy Gradient and Actor Critic Methods When formulating the objective of the reinforcement learning problem, the goal is always to maximize the expected return. The way different methods formulate this objective determines the way in which the problem is approached. For in- stance, the Bellman equations [1] define the optimal state-value and action-value functions in terms of maximizing the expected return, resulting in value-based methods. Alternatively, one can formulate the objective in terms of the parame- terization of a policy, and what results are methods that operate directly on the parameters of a policy instead of estimating value functions. The policy gradient theorem [18] formulates the objective in terms of the pa- rameters and finds the gradient of this objective. By moving along this gradient, the policy improves by directly adjusting its parameters. A baseline function can be added to the value function of the policy gradient to greatly reduce the variance of the gradient without affecting the expectation as long as it does not depend on the actions. By setting this baseline function to be the current es- timate of the value function, we come to the actor-critic paradigm [11]. That is, the policy parameterized by θ is the ’actor’ and the estimate of the value function is the ’critic’. X X ∇J(θ) ∝ µ(s) qπ (s, a)∇π(a|s; θ) s a (1) meaning θt+1 = θt + βδt ∇ log π(At |St ; θt ) where δt = Rt+1 + γv̂(St+1 , w) − v̂(St , w) 2.2 Bayesian Reinforcement Learning Bayesian reinforcement learning is a catch-all term for any reinforcement learning method that utilizes Bayes’ rule in some capacity. In general, the model will retain a prior distribution, which represents the current approximation, that is updated with a posterior distribution when data is seen. As more data is seen, the variance of the prior tends to decrease, signalling greater confidence in the mean value of the distribution. The problem with this approach is that unlike many other optimization techniques found throughout that operate using derivatives, Bayesian updates tend to contain integrals, making them analytically intractable unless the proper priors are chosen. p(D|h)p(h) p(h|D) = (2) p(D) Maximum Entropy Bayesian Actor Critic 3 2.3 Bayesian Expectation Many core algorithms in reinforcement learning must calculate some kind of ex- pectation. This can be taken over any distribution, though often the expectation is taken over reward trajectories or value functions. In fact, value functions and the Bellman equation [1], perhaps the most core formalisms in all of reinforce- ment learning, are posed as an expectation over returns. vπ (s) = Eπ [Gt |St = s], and qπ (s, a) = Eπ [Gt |St = s, At = a], (3) X X vπ (s) = π(a|s) p(s0 , r|s, a)[r + γvπ (s0 )] a s0 ,r The primary difference between Bayesian Reinforcement Learning and other types of reinforcement learning is that other methods tend to use a frequentist, Monte Carlo approach to calculating the expectation. That is, by averaging over many samples of the value in question, Monte Carlo approaches can find unbiased approximations of the expectation of that value. This method has the benefit of being conceptually and practically simple, but may take many, many values to converge to a good approximation. In contrast, Bayesian reinforcement learning utilizes Bayesian methods to approximate this expectation. By performing a posterior update of a prior distribution whenever new data is seen, Bayesian methods can be much more sample efficient than Monte Carlo methods. 2.4 Gaussian Processes Since in reinforcement learning we are often concerned with estimating functions, like state-value or action-value functions, Bayesian RL methods tend to place Gaussian processes over those functions in order to apprximate them. In the same way that a Gaussian distribution is probability distribution over points, a Gaussian process can be thought of as a distribution over functions [14]. By thinking of functions in terms of Hilbert spaces [10], one can equivalently think of a Gaussian process as just a multivariate distribution over infinite dimensions. A Gaussian process is fully defined by its mean and covariance, such that f (·) ∼ N (f¯(·), k(·, ·)), where f¯ is the mean function and k is the kernel function. E[f (x)] = f¯(x) and Cov[f (x), f (x0 )] = k(x, x0 ) (4) With posterior moments, E[f (x)|DM ] = f¯(x) + k(x)> C(y − f¯), (5) Cov[f (x), f (x0 )|DM ] = k(x, x0 ) − k(x)> Ck(x0 ). Where, f¯ = (f¯(x1 ), . . . , f¯(xM ))> , y = (y(x1 ), . . . , y(xM ))> , (6) k(x) = (k(x1 , x), . . . , k(xM , x))> , [K]i,j = k(xi , xj ), C = (K + Σ)−1 , 4 S. Homer 2.5 Bayesian Quadrature Bayesian Quadrature [13] is a method aimed at approximating integrals com- posed of the form: Z ρ = f (x)g(x)dx (7) One can immediately see that if either f (x) or g(x) is a probability density distribution, then this integral is just an expectation. Bayesian quadrature aims to approximate this integral by modeling one of the component functions as a Gaussian process, and performing a posterior update when new data is seen. It has been shown that we can achieve much more accurate approximation of expectations in comparison to Monte Carlo techniques using this method [15]. Taking f (x) as the random function to be modeled, the posterior moments of this integral are given by: Z E[ρ|DM ] = E[f (x)|DM ]g(x)dx, ZZ (8) 0 0 0 V ar[ρ|DM ] = Cov[f (x), f (x )|DM ]g(x)g(x )dxdx . Rewriting, E[ρ|DM ] = ρ0 + b> C(y − f¯) and V ar[ρ|DM ] = b0 − b> Cb, (9) Where, Z b= k(x)g(x)dx. Z ZZ (10) ρ0 = f¯(x)g(x)dx, b0 = k(x, x0 )g(x)g(x0 )dxdx0 . The only issue remaining is to ensure that these integrals are analytically tractable, which can be done by choosing f¯(x), g(x), and k(x, x0 ) appropriately. 2.6 Bayesian Policy Gradient By formulating the Policy Gradient Theorem in terms of Bayesian quadrature, a Bayesian policy gradient method [6] can be derived as follows. First, we specify the policy gradient theorem for the continuous case, choosing modeled random function f (x) and known function g(x) from Bayesian quadrature as follows: Z ∇J(θ) = dz µπ (z; θ)∇ log π(a|s; θ) qπ (z; θ) | {z } | {z } (11) g(x) f (x) P∞ t where µπ (s, a) = t=0 γ Pt (s, a) is the discounted state-action occupancy for policy π, and z ∈ Z are state-action pairs (s, a). This results in the posterior moments, Maximum Entropy Bayesian Actor Critic 5 Z E[∇J(θ)|Dt ] = dz g(z; θ)E[qπ (z; θ)|Dt ], Z Z (12) Cov[∇J(θ)|Dt ] = dzdz 0 g(z; θ)Cov[qπ (z), qπ (z 0 )|Dt ]g(z 0 ; θ)> Z2 Therefore, to calculate this integral, we need the posterior moments of the Gaussian process modelling qπ . 2.7 Gaussian Process Temporal Difference Learning One way of approximating the action-value function is Gaussian Process Tempo- ral Difference Learning (GPTD) [5]. This method takes the standard temporal difference learning paradigm, and instead of performing max-updates like Q- learning or expected updates like Expected Sarsa [17], GPTD places a Gaussian process over the value function v or q and performs posterior updates using the reward at each time step to approximate it with v̂ or q̂ respectively. As more data is seen, the variance of the estimate decreases, directly showing the confidence in the estimation. By modelling the relationship between the reward signal r and the action- value function qπ as a temporal difference between processes with noise N , the posterior moments of the action-value process can be found, providing an ap- proximation for qπ . r(zt ) = qπ (zt ) − γqπ (zt+1 ) + N (zt , zt+1 ) q̂π (z) = E[qπ (z)|Dt ] = kt (z)> αt , (13) ŝπ (z, z 0 ) = Cov[qπ (z), qπ (z 0 )|Dt ] = k(z, z 0 ) − kt (z)> Ckt (z 0 ) Where, αt = Ht> (Ht Kt Ht> + Σt )−1 rt−1 , (14) Ct = Ht> (Ht Kt Ht> + Σt )−1 Ht . with Ht representing the discount matrix and Σ = σ 2 Ht Ht> representing the noise covariance. 2.8 Bayesian Actor Critic Now, to move to a Bayesian actor critic [7], we take the Bayesian policy gradient and estimate the action-value function using Gaussian Process Temporal Differ- ence Learning. GPTD provides the posterior moments for the Bayesian policy gradient, and by utilizing the Fisher information kernel [16], we get a closed-form solution to the integral, resulting in the Bayesian Actor Critic. 6 S. Homer Z E[∇J(θ)|Dt ] = dzg(z; θ)kt (z)> αt , Z Z (15) dzdz g(z; θ) k(z, z 0 ) − kt (z)> Ct kt (z 0 ) g(z 0 ; θ). 0 Cov[∇J(θ)|Dt ] = Z2 Rewriting, E[∇J(θ)|Dt ] = Bt αt and Cov[∇J(θ)|Dt ] = B0 − Bt Ct Bt> , (16) Where, Z Bt = dz g(z; θ)kt (z)> , Z Z (17) B0 = dzdz 0 g(z; θ)k(z, z 0 )g(z 0 ; θ)> . Z2 To make these tractable, the kernel k = kf +ks becomes a combination of the state kernel ks (s, s0 ) and Fisher information kernel kf (z, z 0 ) = u(z; θ)> G−1 u(z 0 ; θ), with u(z; θ) the information score function and G the Fisher information matrix, where u(z; θ) = ∇ log π(a|s; θ), Ut = [u(z0 ), . . . , u(zt )], 1 estimating Ĝt = Ut Ut> (18) t+1 meaning Bt = Ut and B0 = G Finally, this leads to the Bayesian actor critic policy update rule, using the conventional or natural gradient [7]. θt+1 = θt + β∆θ where ∆θ = Ut αt (Convential gradient) (19) ∆θ = Ĝ−1 t Ut αt (Natural Gradient) Therefore, the Bayesian actor critic improves the parameters of its policy by iteratively moving along the gradient by applying the following three steps for each cycle, where a cycle is M steps long: – Use GPTD to estimate q̂π , returning αt and Ct . – Compute Ut and estimate Ĝt – Calculate ∆θ and update the parameters θ 2.9 Maximum Entropy Reinforcement Learning Normally, the standard objective function is all about maximizing the return. Maximum entropy reinforcement learning makes a change to this core goal. In- stead of solely maximizing the return, the maximum entropy objective [20] seeks to maximize the return and policy entropy. Maximum Entropy Bayesian Actor Critic 7 T X J s (π) = E(st ,at )∼ρπ r(st , at ) + αH(π(·|st )) (20) t=0 By simultaneously seeking out high rewards while seeking high entropy – that is, acting as randomly as possible – the agent will explore more. In addition, if there are multiple optimal or near optimal policies, the agent will be able to give equal weighting to all of them instead of assigning all the mass to a single policy of the set. In practice, it has been shown [9] that this objective also improves the learning rate and stability of the algorithms. 3 Maximum Entropy Bayesian Actor Critic The goal of Maximum Entropy Bayesian Actor Critic (MEBAC) is to take the maximum entropy objective function and alter the Bayesian actor critic machin- ery to work with it efficiently. Though the Soft Actor Critic [9] algorithm was the initial inspiration for utilizing the maximum entropy objective, the only things MEBAC takes from it are that objective, and the soft value function. The soft value function alters the standard value function with an entropy-like term that will give weight to more random policy states. X vπs (st ) = π(a|s)[qπ (st , at ) − log π(at |st )] (21) a 3.1 Soft Policy Gradient In the same way as the policy gradient theorem is derived from the standard value function, we begin with the soft value function vπs and derive the soft policy gradient ∇J s (θ): X ∇vπs = ∇ π(a|s)[qπ (s, a) − log π(a|s)] a X = ∇π(a|s)[qπ (s, a) − log π(a|s)] + π(a|s)∇[qπ (s, a) − log π(a|s)] a X = ∇π(a|s)[qπ (s, a) − log π(a|s)] + π(a|s)∇qπ (s, a) − ∇π(a|s) a X = ∇π(a|s)[qπ (s, a) − log π(a|s) − 1] + π(a|s)∇qπ (s, a) a ∞ XX X = P r(s → x, k, π) ∇π(a|s)[qπ (s, a) − log π(a|s) − 1] x∈S k=1 a X X s ∇J (θ) = ∇vπs (s0 ) ∝ µ(s) ∇π(a|s)[qπ (s, a) − log π(a|s) − 1] s a | {z } Entropy ”Baseline” (22) 8 S. Homer It is clear that this entropy term occupies the same location as a baseline; however, since it is a function of both the state and actions, it not a true baseline, and therefore will have an effect on the mean. This ”baseline” term will nudge the gradient in the direction of higher entropy, which matches the intuition of using a maximum entropy objective. 3.2 Max Entropy Bayesian Actor Critic Converting the soft policy gradient to the integral form, Z ∇J s (θ) = dz µπ (s, a)∇ log π(a|s)[qπ (s, a) − log π(a|s) − 1] Z = dz µπ (s, a)∇ log π(a|s)qπ (s, a) (23) Z + dz µπ (s, a)∇ log π(a|s)(− log π(a|s) − 1) The first integral is the same as the regular Bayesian Actor Critic that will update the gradient in response to newly observed rewards, whereas the second integral is the contribution of the policy entropy to the gradient. This means we can utilize the same machinery as discussed earlier for the regular Bayesian Actor Critic, and augment that term with the entropy contribution to the gradient. We could utilize the Bayesian quadrature approach to the approximate the entropy contribution integral, which would likely result in more accurate esti- mations of the integral; however, here we will instead elect to estimate it using a simple unweighted average. Similar to the estimate of the Fisher kernel Ĝ, we define an augmented score function w(z) and estimate an unbiased estimate of the integral. w(z) = ∇ log π(a|s; θ)[− log π(a|s; θ) − 1] t 1 X (24) wt = w(zi ) t + 1 i=0 Finally, the maximum entropy parameter update becomes, θt+1 = θt + β∆θ where ∆θ = Ut αt + wt (Convential gradient) (25) ∆θ = Ĝ−1 t Ut αt + wt (Natural Gradient) 4 Implementation 4.1 Softmax Policy for Discrete Actions Since we are taking the gradient of the policy, it is necessary to have a differen- tiable policy. Even more, we need the gradient of the log of the policy to also be Maximum Entropy Bayesian Actor Critic 9 differentiable, and would ideally like all of those derivatives to play well together. For the case of discrete actions, the Softmax policy fits the bill. exp(φ(s, ai )> θ) π(ai |s) = P > aj ∈A exp(φ(s, aj ) θ) ( ∇π(a|s) π(ai |s)[1 − π(aj |s)] if i = j = (26) φ(s, a) −π(ai |s)π(aj |s) if i 6= j ( ∇ log π(a|s) 1 − π(ai |s) if i = j = φ(s, a) −π(aj |s) if i 6= j where φ(s, a) is the state-action feature vector. 4.2 Technologies The GPTD [5] and BAC [7] algorithms were re-implemented from scratch pri- marily using the NumPy [19] scientific Python library. They were evaluated using OpenAI Gym [2] CartPole environment with Discrete action spaces and Box observation spaces. The sparsification procedure defined and used in the literature led to no learning at all, so it was removed in this implementation. 5 Results 30 25 name average hard 20 soft 15 10 0 250 500 750 1000 t Fig. 1. Cumulative Return for Cartpole-V0 Environment As can be seen in the plot of cumulative return, neither method performs particularly well. Though it cannot be seen in this plot, it was found that either 10 S. Homer the algorithms converged to the optimal policy almost immediately, or didn’t learn at all. Interestingly, a common occurrence was to improve for the first couple hundred episodes, followed by a ”collapse” where it seems that everything learned up to that point was thrown away. This is most likely due to the choice of the Softmax policy overflowing the capabilities of NumPy, resulting in essentially random gradient updates and therefore behavior. In the future, more effort could be put into finding a good parameterization or policy such that this ”collapse” does not happen. 6 Discussion At least for the Cartpole environment, the algorithm either doesn’t learn, or con- verges to a deterministic policy almost immediately, for both the hard version and the soft version. This identical behavior makes sense, since a deterministic policy is zero-entropy, meaning that the entropy term will not factor in, result- ing in identical behavior between the two methods. Therefore, unfortunately these results do not give any insight as to whether the MEBAC is a feasible reinforcement learning model. 7 Conclusion 7.1 Problems & Limitations The main problem with the implementation was that the gradient was usually so large as to move the policy almost immediately into a deterministic policy. This makes the addition of the entropy term useless, not allowing us to investigate the efficacy of the method. Another option would be to evaluate on a different, more difficult environment with a larger state-action space. This would likely reduce the rate of the gradient change and allow for the entropy term to have some sort of impact. Many different methods were attempted to slow down the gradient change beyond altering the temperature coefficient. Normalizing the gradient succeeded in slowing down the growth, but simply resulted in poorer performance overall, since the gradient is moving the right direction in general. Normalizing and exponentiating the state vector was also attempted, which had little to no effect on the growth, as the state vector is relatively small in magnitude to begin with. Gradient clipping is another method that might be used to slow down the gradient change, but was not utilized here. In the end, the state-action feature vector φ(s, a) was purposely naı̈ve, mean- ing it did not use any domain knowledge to create features, and was simply dot- ted with the parameter vector. This was intended to keep the method general, so as to be tested against multiple environments; however, it seems to have back- fired. Using a domain knowledge to create a feature vector would most likely result in more interesting results and performance. Maximum Entropy Bayesian Actor Critic 11 7.2 Future Work Though the results of the implementation were inconclusive, the derivation of the Maximum Entropy Bayesian Actor Critic model seems ripe for future research. There are a few avenues of inquiry that could be investigated going forward. As already mentioned, using domain knowledge to create an environment-specific feature vector may yield more interesting results on more difficult environments. Next, by opening up to a continuous action space, one could maintain a prior distribution of a Gaussian for each action parameter, which results in a nice closed-form solution for the gradient and log-gradient like the Softmax deriva- tion. Finally, instead of using hand-crafted feature vectors, one could learn the feature vector through any number of methods, including deep RL. This is an in- teresting avenue because it maintains generality while circumventing the feature vector problem. References 1. Bellman, R.: Dynamic programming. Science 153(3731), 34–37 (1966) 2. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016) 3. Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: Advances in neural information processing systems. pp. 2249–2257 (2011) 4. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Ma- chine Learning. pp. 1329–1338 (2016) 5. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with gaussian processes. In: Proceedings of the 22nd international conference on Machine learning. pp. 201– 208. ACM (2005) 6. Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Advances in neural information processing systems. pp. 457–464 (2007) 7. Ghavamzadeh, M., Engel, Y., Valko, M.: Bayesian policy gradient and actor-critic algorithms. The Journal of Machine Learning Research 17(1), 2319–2371 (2016) 8. Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al.: Bayesian reinforce- ment learning: A survey. Foundations and Trends R in Machine Learning 8(5-6), 359–483 (2015) 9. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maxi- mum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018) 10. Kennedy, R.A., Sadeghi, P.: Hilbert space methods in signal processing. Cambridge University Press (2013) 11. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in neural in- formation processing systems. pp. 1008–1014 (2000) 12. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015) 13. O’Hagan, A.: Bayes–hermite quadrature. Journal of statistical planning and infer- ence 29(3), 245–260 (1991) 14. Rasmussen, C.E.: Gaussian processes in machine learning. In: Summer School on Machine Learning. pp. 63–71. Springer (2003) 15. Rasmussen, C.E., Ghahramani, Z.: Bayesian monte carlo. Advances in neural in- formation processing systems pp. 505–512 (2003) 12 S. Homer 16. Shawe-Taylor, J., Cristianini, N., et al.: Kernel methods for pattern analysis. Cam- bridge university press (2004) 17. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018) 18. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems. pp. 1057–1063 (2000) 19. Van Der Walt, S., Colbert, S.C., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering 13(2), 22 (2011) 20. Ziebart, B.D.: Modeling purposeful adaptive behavior with the principle of maxi- mum causal entropy (2010)