=Paper= {{Paper |id=Vol-2491/paper32 |storemode=property |title=Maximum Entropy Bayesian Actor Critic |pdfUrl=https://ceur-ws.org/Vol-2491/paper32.pdf |volume=Vol-2491 |authors=Steven Homer |dblpUrl=https://dblp.org/rec/conf/bnaic/Homer19a }} ==Maximum Entropy Bayesian Actor Critic== https://ceur-ws.org/Vol-2491/paper32.pdf
       Maximum Entropy Bayesian Actor Critic

                         Steven Homer1[0000−0002−1515−1093]

                AI Lab at Vrije Universiteit Brussel, 1050 Brussels, BE
                      info@ai.vub.ac.be https://ai.vub.ac.be



        Abstract. In recent years Deep Reinforcement Learning [12] has achieved
        human-like performance or better on a variety of benchmarks such as
        the Atari Arcade [2]; however, Deep RL often has problems with sample
        efficiency and convergence brittleness. That is, to learn even the sim-
        plest tasks, Deep RL requires a huge amount of meaningful samples, and
        will only converge if the parameters are tuned just right [4]. This paper
        seeks to ameliorate these problems of sample inefficiency and convergence
        brittleness with the combination of two different reinforcement learning
        paradigms: Bayesian RL and Maximum Entropy RL.
        Bayesian reinforcement learning [8] utilizes Bayesian statistics to model
        the confidence in a given model, which has been shown to greatly in-
        crease sample efficiency [3]. Maximum entropy RL is a technique that
        modifies the standard reward to promote more exploration in the agent
        [20]. Hopefully, combining the two will retain the best of both of these
        properties and avoid the problems faced in deep RL altogether.
        This paper first derives a soft policy gradient that introduces a entropy-
        weighted term to the standard policy gradient function, and then applies
        this to the the Bayesian actor critic paradigm to augment the parameter
        update rule to account for the entropy-weighted value function. After de-
        termining a closed-form solution of the gradient with the softmax policy,
        the method was implemented and evaluated on the Cartpole environ-
        ment, signalling that there are avenues ripe for future research in this
        area.

        Keywords: Bayesian Reinforcement Learning · Maximum Entropy Re-
        inforcement Learning · Gaussian Processes · Bayesian Quadrature · Bayesian
        Actor Critic




1     Introduction

This paper is structured as follows. The Background section is composed of a
decent amount of exposition around Gaussian processes and Bayesian quadrature
necessary to understand the Bayesian Actor Critic model. After explaining the
details of Bayesian actor critic and maximum entropy RL, the main contribution
0
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2      S. Homer

of the paper, Maximum Entropy Bayesian Actor Critic (MEBAC), is discussed.
After the explication of MEBAC and a few implementation choices, the Results,
Discussion, and Conclusion empirically analyze the model.


2     Background

2.1   Policy Gradient and Actor Critic Methods

When formulating the objective of the reinforcement learning problem, the goal
is always to maximize the expected return. The way different methods formulate
this objective determines the way in which the problem is approached. For in-
stance, the Bellman equations [1] define the optimal state-value and action-value
functions in terms of maximizing the expected return, resulting in value-based
methods. Alternatively, one can formulate the objective in terms of the parame-
terization of a policy, and what results are methods that operate directly on the
parameters of a policy instead of estimating value functions.
    The policy gradient theorem [18] formulates the objective in terms of the pa-
rameters and finds the gradient of this objective. By moving along this gradient,
the policy improves by directly adjusting its parameters. A baseline function
can be added to the value function of the policy gradient to greatly reduce the
variance of the gradient without affecting the expectation as long as it does not
depend on the actions. By setting this baseline function to be the current es-
timate of the value function, we come to the actor-critic paradigm [11]. That
is, the policy parameterized by θ is the ’actor’ and the estimate of the value
function is the ’critic’.
                               X        X
                      ∇J(θ) ∝      µ(s)     qπ (s, a)∇π(a|s; θ)
                               s         a
                                                                              (1)
                  meaning    θt+1 = θt + βδt ∇ log π(At |St ; θt )
                  where   δt = Rt+1 + γv̂(St+1 , w) − v̂(St , w)


2.2   Bayesian Reinforcement Learning

Bayesian reinforcement learning is a catch-all term for any reinforcement learning
method that utilizes Bayes’ rule in some capacity. In general, the model will
retain a prior distribution, which represents the current approximation, that is
updated with a posterior distribution when data is seen. As more data is seen, the
variance of the prior tends to decrease, signalling greater confidence in the mean
value of the distribution. The problem with this approach is that unlike many
other optimization techniques found throughout that operate using derivatives,
Bayesian updates tend to contain integrals, making them analytically intractable
unless the proper priors are chosen.

                                         p(D|h)p(h)
                              p(h|D) =                                        (2)
                                            p(D)
                                          Maximum Entropy Bayesian Actor Critic                3

2.3     Bayesian Expectation
Many core algorithms in reinforcement learning must calculate some kind of ex-
pectation. This can be taken over any distribution, though often the expectation
is taken over reward trajectories or value functions. In fact, value functions and
the Bellman equation [1], perhaps the most core formalisms in all of reinforce-
ment learning, are posed as an expectation over returns.

          vπ (s) = Eπ [Gt |St = s], and qπ (s, a) = Eπ [Gt |St = s, At = a],
                                                                                              (3)
                              X         X
                    vπ (s) =     π(a|s)   p(s0 , r|s, a)[r + γvπ (s0 )]
                                 a           s0 ,r

    The primary difference between Bayesian Reinforcement Learning and other
types of reinforcement learning is that other methods tend to use a frequentist,
Monte Carlo approach to calculating the expectation. That is, by averaging over
many samples of the value in question, Monte Carlo approaches can find unbiased
approximations of the expectation of that value. This method has the benefit of
being conceptually and practically simple, but may take many, many values to
converge to a good approximation. In contrast, Bayesian reinforcement learning
utilizes Bayesian methods to approximate this expectation. By performing a
posterior update of a prior distribution whenever new data is seen, Bayesian
methods can be much more sample efficient than Monte Carlo methods.

2.4     Gaussian Processes
Since in reinforcement learning we are often concerned with estimating functions,
like state-value or action-value functions, Bayesian RL methods tend to place
Gaussian processes over those functions in order to apprximate them. In the
same way that a Gaussian distribution is probability distribution over points,
a Gaussian process can be thought of as a distribution over functions [14]. By
thinking of functions in terms of Hilbert spaces [10], one can equivalently think
of a Gaussian process as just a multivariate distribution over infinite dimensions.
A Gaussian process is fully defined by its mean and covariance, such that f (·) ∼
N (f¯(·), k(·, ·)), where f¯ is the mean function and k is the kernel function.

                 E[f (x)] = f¯(x)       and Cov[f (x), f (x0 )] = k(x, x0 )                   (4)
      With posterior moments,

                         E[f (x)|DM ] = f¯(x) + k(x)> C(y − f¯),
                                                                                              (5)
                   Cov[f (x), f (x0 )|DM ] = k(x, x0 ) − k(x)> Ck(x0 ).
      Where,


            f¯ = (f¯(x1 ), . . . , f¯(xM ))> ,       y = (y(x1 ), . . . , y(xM ))> ,
                                                                                              (6)
 k(x) = (k(x1 , x), . . . , k(xM , x))> ,        [K]i,j = k(xi , xj ),      C = (K + Σ)−1 ,
4        S. Homer

2.5     Bayesian Quadrature
Bayesian Quadrature [13] is a method aimed at approximating integrals com-
posed of the form:
                                 Z
                             ρ = f (x)g(x)dx                            (7)

    One can immediately see that if either f (x) or g(x) is a probability density
distribution, then this integral is just an expectation. Bayesian quadrature aims
to approximate this integral by modeling one of the component functions as a
Gaussian process, and performing a posterior update when new data is seen.
It has been shown that we can achieve much more accurate approximation of
expectations in comparison to Monte Carlo techniques using this method [15].
    Taking f (x) as the random function to be modeled, the posterior moments
of this integral are given by:
                                      Z
                         E[ρ|DM ] = E[f (x)|DM ]g(x)dx,
                             ZZ                                                (8)
                                                0             0     0
             V ar[ρ|DM ] =       Cov[f (x), f (x )|DM ]g(x)g(x )dxdx .

      Rewriting,


          E[ρ|DM ] = ρ0 + b> C(y − f¯)         and V ar[ρ|DM ] = b0 − b> Cb,      (9)
      Where,
                                           Z
                                      b=      k(x)g(x)dx.
                      Z                         ZZ                               (10)
               ρ0 =       f¯(x)g(x)dx,     b0 =     k(x, x0 )g(x)g(x0 )dxdx0 .

   The only issue remaining is to ensure that these integrals are analytically
tractable, which can be done by choosing f¯(x), g(x), and k(x, x0 ) appropriately.

2.6     Bayesian Policy Gradient
By formulating the Policy Gradient Theorem in terms of Bayesian quadrature, a
Bayesian policy gradient method [6] can be derived as follows. First, we specify
the policy gradient theorem for the continuous case, choosing modeled random
function f (x) and known function g(x) from Bayesian quadrature as follows:
                          Z
                 ∇J(θ) = dz µπ (z; θ)∇ log π(a|s; θ) qπ (z; θ)
                                |         {z         } | {z }               (11)
                                                  g(x)             f (x)
                           P∞     t
    where µπ (s, a) = t=0 γ Pt (s, a) is the discounted state-action occupancy
for policy π, and z ∈ Z are state-action pairs (s, a). This results in the posterior
moments,
                                        Maximum Entropy Bayesian Actor Critic              5



                              Z
               E[∇J(θ)|Dt ] =    dz g(z; θ)E[qπ (z; θ)|Dt ],
                               Z
                       Z                                                                 (12)
      Cov[∇J(θ)|Dt ] =   dzdz 0 g(z; θ)Cov[qπ (z), qπ (z 0 )|Dt ]g(z 0 ; θ)>
                             Z2

  Therefore, to calculate this integral, we need the posterior moments of the
Gaussian process modelling qπ .


2.7     Gaussian Process Temporal Difference Learning

One way of approximating the action-value function is Gaussian Process Tempo-
ral Difference Learning (GPTD) [5]. This method takes the standard temporal
difference learning paradigm, and instead of performing max-updates like Q-
learning or expected updates like Expected Sarsa [17], GPTD places a Gaussian
process over the value function v or q and performs posterior updates using the
reward at each time step to approximate it with v̂ or q̂ respectively. As more data
is seen, the variance of the estimate decreases, directly showing the confidence
in the estimation.
    By modelling the relationship between the reward signal r and the action-
value function qπ as a temporal difference between processes with noise N , the
posterior moments of the action-value process can be found, providing an ap-
proximation for qπ .

                      r(zt ) = qπ (zt ) − γqπ (zt+1 ) + N (zt , zt+1 )
                           q̂π (z) = E[qπ (z)|Dt ] = kt (z)> αt ,                        (13)
          ŝπ (z, z 0 ) = Cov[qπ (z), qπ (z 0 )|Dt ] = k(z, z 0 ) − kt (z)> Ckt (z 0 )

      Where,

                          αt = Ht> (Ht Kt Ht> + Σt )−1 rt−1 ,
                                                                                         (14)
                           Ct = Ht> (Ht Kt Ht> + Σt )−1 Ht .

   with Ht representing the discount matrix and Σ = σ 2 Ht Ht> representing
the noise covariance.


2.8     Bayesian Actor Critic

Now, to move to a Bayesian actor critic [7], we take the Bayesian policy gradient
and estimate the action-value function using Gaussian Process Temporal Differ-
ence Learning. GPTD provides the posterior moments for the Bayesian policy
gradient, and by utilizing the Fisher information kernel [16], we get a closed-form
solution to the integral, resulting in the Bayesian Actor Critic.
6         S. Homer


                                  Z
                   E[∇J(θ)|Dt ] =    dzg(z; θ)kt (z)> αt ,
                                   Z
                     Z                                                            (15)
                       dzdz g(z; θ) k(z, z 0 ) − kt (z)> Ct kt (z 0 ) g(z 0 ; θ).
                            0
                                                                     
    Cov[∇J(θ)|Dt ] =
                          Z2

      Rewriting,


        E[∇J(θ)|Dt ] = Bt αt         and Cov[∇J(θ)|Dt ] = B0 − Bt Ct Bt> ,           (16)
      Where,
                                   Z
                              Bt =     dz g(z; θ)kt (z)> ,
                                    Z
                             Z                                                       (17)
                        B0 =     dzdz 0 g(z; θ)k(z, z 0 )g(z 0 ; θ)> .
                                Z2
    To make these tractable, the kernel k = kf +ks becomes a combination of the
state kernel ks (s, s0 ) and Fisher information kernel kf (z, z 0 ) = u(z; θ)> G−1 u(z 0 ; θ),
with u(z; θ) the information score function and G the Fisher information matrix,
where

                   u(z; θ) = ∇ log π(a|s; θ),
                                            Ut = [u(z0 ), . . . , u(zt )],
                                               1
                             estimating Ĝt =     Ut Ut>                             (18)
                                              t+1
                           meaning Bt = Ut and B0 = G
   Finally, this leads to the Bayesian actor critic policy update rule, using the
conventional or natural gradient [7].

                                θt+1 = θt + β∆θ       where
                           ∆θ = Ut αt      (Convential gradient)                     (19)
                          ∆θ = Ĝ−1
                                 t Ut αt        (Natural Gradient)
    Therefore, the Bayesian actor critic improves the parameters of its policy by
iteratively moving along the gradient by applying the following three steps for
each cycle, where a cycle is M steps long:
    – Use GPTD to estimate q̂π , returning αt and Ct .
    – Compute Ut and estimate Ĝt
    – Calculate ∆θ and update the parameters θ

2.9     Maximum Entropy Reinforcement Learning
Normally, the standard objective function is all about maximizing the return.
Maximum entropy reinforcement learning makes a change to this core goal. In-
stead of solely maximizing the return, the maximum entropy objective [20] seeks
to maximize the return and policy entropy.
                                               Maximum Entropy Bayesian Actor Critic            7


                                   T
                                   X
                       J s (π) =
                                                                                  
                                          E(st ,at )∼ρπ r(st , at ) + αH(π(·|st ))          (20)
                                    t=0

    By simultaneously seeking out high rewards while seeking high entropy – that
is, acting as randomly as possible – the agent will explore more. In addition, if
there are multiple optimal or near optimal policies, the agent will be able to give
equal weighting to all of them instead of assigning all the mass to a single policy
of the set. In practice, it has been shown [9] that this objective also improves
the learning rate and stability of the algorithms.


3     Maximum Entropy Bayesian Actor Critic
The goal of Maximum Entropy Bayesian Actor Critic (MEBAC) is to take the
maximum entropy objective function and alter the Bayesian actor critic machin-
ery to work with it efficiently. Though the Soft Actor Critic [9] algorithm was the
initial inspiration for utilizing the maximum entropy objective, the only things
MEBAC takes from it are that objective, and the soft value function. The soft
value function alters the standard value function with an entropy-like term that
will give weight to more random policy states.
                                X
                    vπs (st ) =   π(a|s)[qπ (st , at ) − log π(at |st )]       (21)
                                          a


3.1       Soft Policy Gradient
In the same way as the policy gradient theorem is derived from the standard
value function, we begin with the soft value function vπs and derive the soft policy
gradient ∇J s (θ):

                       X
      ∇vπs = ∇              π(a|s)[qπ (s, a) − log π(a|s)]
                       a
                  X                                                                        
              =            ∇π(a|s)[qπ (s, a) − log π(a|s)] + π(a|s)∇[qπ (s, a) − log π(a|s)]
                   a
                X                                                             
              =   ∇π(a|s)[qπ (s, a) − log π(a|s)] + π(a|s)∇qπ (s, a) − ∇π(a|s)
                   a
                X                                                      
              =   ∇π(a|s)[qπ (s, a) − log π(a|s) − 1] + π(a|s)∇qπ (s, a)
                   a
                   ∞
                  XX                              X                                    
              =               P r(s → x, k, π)      ∇π(a|s)[qπ (s, a) − log π(a|s) − 1]
                  x∈S k=1                          a
                                  X           X
      s
 ∇J       (θ) = ∇vπs (s0 ) ∝           µ(s)       ∇π(a|s)[qπ (s, a) − log π(a|s) − 1]
                                   s          a
                                                                    |      {z      }
                                                                     Entropy ”Baseline”
                                                                                            (22)
8        S. Homer

   It is clear that this entropy term occupies the same location as a baseline;
however, since it is a function of both the state and actions, it not a true baseline,
and therefore will have an effect on the mean. This ”baseline” term will nudge
the gradient in the direction of higher entropy, which matches the intuition of
using a maximum entropy objective.


3.2     Max Entropy Bayesian Actor Critic

Converting the soft policy gradient to the integral form,
                    Z
         ∇J s (θ) = dz µπ (s, a)∇ log π(a|s)[qπ (s, a) − log π(a|s) − 1]
                    Z
                  = dz µπ (s, a)∇ log π(a|s)qπ (s, a)                            (23)
                      Z
                    + dz µπ (s, a)∇ log π(a|s)(− log π(a|s) − 1)

    The first integral is the same as the regular Bayesian Actor Critic that will
update the gradient in response to newly observed rewards, whereas the second
integral is the contribution of the policy entropy to the gradient. This means we
can utilize the same machinery as discussed earlier for the regular Bayesian Actor
Critic, and augment that term with the entropy contribution to the gradient.
    We could utilize the Bayesian quadrature approach to the approximate the
entropy contribution integral, which would likely result in more accurate esti-
mations of the integral; however, here we will instead elect to estimate it using
a simple unweighted average. Similar to the estimate of the Fisher kernel Ĝ, we
define an augmented score function w(z) and estimate an unbiased estimate of
the integral.

                    w(z) = ∇ log π(a|s; θ)[− log π(a|s; θ) − 1]
                                             t
                                        1 X                                      (24)
                               wt =             w(zi )
                                      t + 1 i=0

      Finally, the maximum entropy parameter update becomes,

                             θt+1 = θt + β∆θ      where
                    ∆θ = Ut αt + wt      (Convential gradient)                   (25)
                    ∆θ = Ĝ−1
                           t Ut αt + wt      (Natural Gradient)


4      Implementation

4.1     Softmax Policy for Discrete Actions

Since we are taking the gradient of the policy, it is necessary to have a differen-
tiable policy. Even more, we need the gradient of the log of the policy to also be
                                              Maximum Entropy Bayesian Actor Critic               9

differentiable, and would ideally like all of those derivatives to play well together.
For the case of discrete actions, the Softmax policy fits the bill.

                                                  exp(φ(s, ai )> θ)
                                  π(ai |s) = P                     >
                                                aj ∈A exp(φ(s, aj ) θ)
                                            (
                                 ∇π(a|s)     π(ai |s)[1 − π(aj |s)] if i = j
                                         =                                                (26)
                                 φ(s, a)     −π(ai |s)π(aj |s)       if i 6= j
                                            (
                             ∇ log π(a|s)    1 − π(ai |s) if i = j
                                          =
                               φ(s, a)       −π(aj |s)     if i 6= j
               where φ(s, a) is the state-action feature vector.

4.2              Technologies
The GPTD [5] and BAC [7] algorithms were re-implemented from scratch pri-
marily using the NumPy [19] scientific Python library. They were evaluated
using OpenAI Gym [2] CartPole environment with Discrete action spaces and
Box observation spaces. The sparsification procedure defined and used in the
literature led to no learning at all, so it was removed in this implementation.


5               Results



          30




          25



                                                                                        name
average




                                                                                           hard
          20                                                                               soft




          15




          10
                  0              250              500              750           1000
                                                   t



                         Fig. 1. Cumulative Return for Cartpole-V0 Environment


   As can be seen in the plot of cumulative return, neither method performs
particularly well. Though it cannot be seen in this plot, it was found that either
10      S. Homer

the algorithms converged to the optimal policy almost immediately, or didn’t
learn at all. Interestingly, a common occurrence was to improve for the first
couple hundred episodes, followed by a ”collapse” where it seems that everything
learned up to that point was thrown away. This is most likely due to the choice of
the Softmax policy overflowing the capabilities of NumPy, resulting in essentially
random gradient updates and therefore behavior. In the future, more effort could
be put into finding a good parameterization or policy such that this ”collapse”
does not happen.


6     Discussion

At least for the Cartpole environment, the algorithm either doesn’t learn, or con-
verges to a deterministic policy almost immediately, for both the hard version
and the soft version. This identical behavior makes sense, since a deterministic
policy is zero-entropy, meaning that the entropy term will not factor in, result-
ing in identical behavior between the two methods. Therefore, unfortunately
these results do not give any insight as to whether the MEBAC is a feasible
reinforcement learning model.


7     Conclusion

7.1   Problems & Limitations

The main problem with the implementation was that the gradient was usually so
large as to move the policy almost immediately into a deterministic policy. This
makes the addition of the entropy term useless, not allowing us to investigate
the efficacy of the method. Another option would be to evaluate on a different,
more difficult environment with a larger state-action space. This would likely
reduce the rate of the gradient change and allow for the entropy term to have
some sort of impact.
    Many different methods were attempted to slow down the gradient change
beyond altering the temperature coefficient. Normalizing the gradient succeeded
in slowing down the growth, but simply resulted in poorer performance overall,
since the gradient is moving the right direction in general. Normalizing and
exponentiating the state vector was also attempted, which had little to no effect
on the growth, as the state vector is relatively small in magnitude to begin
with. Gradient clipping is another method that might be used to slow down the
gradient change, but was not utilized here.
    In the end, the state-action feature vector φ(s, a) was purposely naı̈ve, mean-
ing it did not use any domain knowledge to create features, and was simply dot-
ted with the parameter vector. This was intended to keep the method general, so
as to be tested against multiple environments; however, it seems to have back-
fired. Using a domain knowledge to create a feature vector would most likely
result in more interesting results and performance.
                                    Maximum Entropy Bayesian Actor Critic           11

7.2   Future Work
Though the results of the implementation were inconclusive, the derivation of the
Maximum Entropy Bayesian Actor Critic model seems ripe for future research.
There are a few avenues of inquiry that could be investigated going forward. As
already mentioned, using domain knowledge to create an environment-specific
feature vector may yield more interesting results on more difficult environments.
Next, by opening up to a continuous action space, one could maintain a prior
distribution of a Gaussian for each action parameter, which results in a nice
closed-form solution for the gradient and log-gradient like the Softmax deriva-
tion. Finally, instead of using hand-crafted feature vectors, one could learn the
feature vector through any number of methods, including deep RL. This is an in-
teresting avenue because it maintains generality while circumventing the feature
vector problem.


References
 1. Bellman, R.: Dynamic programming. Science 153(3731), 34–37 (1966)
 2. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J.,
    Zaremba, W.: Openai gym. arXiv preprint arXiv:1606.01540 (2016)
 3. Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: Advances
    in neural information processing systems. pp. 2249–2257 (2011)
 4. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep
    reinforcement learning for continuous control. In: International Conference on Ma-
    chine Learning. pp. 1329–1338 (2016)
 5. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with gaussian processes.
    In: Proceedings of the 22nd international conference on Machine learning. pp. 201–
    208. ACM (2005)
 6. Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Advances
    in neural information processing systems. pp. 457–464 (2007)
 7. Ghavamzadeh, M., Engel, Y., Valko, M.: Bayesian policy gradient and actor-critic
    algorithms. The Journal of Machine Learning Research 17(1), 2319–2371 (2016)
 8. Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al.: Bayesian reinforce-
    ment learning: A survey. Foundations and Trends R in Machine Learning 8(5-6),
    359–483 (2015)
 9. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maxi-
    mum entropy deep reinforcement learning with a stochastic actor. arXiv preprint
    arXiv:1801.01290 (2018)
10. Kennedy, R.A., Sadeghi, P.: Hilbert space methods in signal processing. Cambridge
    University Press (2013)
11. Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. In: Advances in neural in-
    formation processing systems. pp. 1008–1014 (2000)
12. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015)
13. O’Hagan, A.: Bayes–hermite quadrature. Journal of statistical planning and infer-
    ence 29(3), 245–260 (1991)
14. Rasmussen, C.E.: Gaussian processes in machine learning. In: Summer School on
    Machine Learning. pp. 63–71. Springer (2003)
15. Rasmussen, C.E., Ghahramani, Z.: Bayesian monte carlo. Advances in neural in-
    formation processing systems pp. 505–512 (2003)
12      S. Homer

16. Shawe-Taylor, J., Cristianini, N., et al.: Kernel methods for pattern analysis. Cam-
    bridge university press (2004)
17. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press
    (2018)
18. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods
    for reinforcement learning with function approximation. In: Advances in neural
    information processing systems. pp. 1057–1063 (2000)
19. Van Der Walt, S., Colbert, S.C., Varoquaux, G.: The numpy array: a structure for
    efficient numerical computation. Computing in Science & Engineering 13(2), 22
    (2011)
20. Ziebart, B.D.: Modeling purposeful adaptive behavior with the principle of maxi-
    mum causal entropy (2010)