<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploration in Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ludvig Killingberg</string-name>
          <email>ludvig.killingberg@ntnu.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helge Langseth</string-name>
          <email>helge.langseth@ntnu.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Høgskoleringen 1, 7034 Trondheim</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Use permitted under Creative Commons License Attribution 4.0</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Posterior sampling of value functions can give eficient exploration for value-based reinforcement learning algorithms. We introduce BayesianExplore (BE), a posterior sampling-based method for reinforcement learning based on Bayesian function-space variational inference over stochastic processes. This Bayesian formalism allows us to formalize domain knowledge as a prior over the value function. Our approach, therefore, provides an alternative to reward shaping with the added benefit that the algorithm keeps seeking an optimal policy in the original environment instead of the one with altered rewards. We show that BE produces state-of-the-art eficiency in exploration with flat priors, and that it is easy to significantly improve performance by incorporating domain knowledge using simple priors.</p>
      </abstract>
      <kwd-group>
        <kwd>Bayesian deep learning</kwd>
        <kwd>reinforcement learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>0,  ⟩, where  is the state space and 
    ︀] ,
where the expectation is taken over the policy, transition, and reward distributions. An
eficient agent must be able to learn from the data it collects, but since the data is
dependent on the policy, it must also prioritize exploring states and actions that the
∼  *(·|)
agent can learn from.</p>
      <p>Related to   is the  -function,   (0,  0), defined as the expected reward of
taking action  0 in state 0 and then following policy 
thereafter:   (0,  0) :=
−
E ︀[ ∑︀  =0</p>
      <p>∞     |0 = 0,  0 =  ]︀ . Q-learning amounts to learning the  -function from ℋ  .</p>
      <sec id="sec-1-1">
        <title>The regret of a policy  is   *</title>
        <p>, the loss in the expected reward obtained by following
 instead of following the optimal policy  *. Now, a learning algorithm’s eficiency in
exploration can be measured by its cumulative regret over time.
( ) has an added perturbation
( ). After initializing  and

or selects  uniformly from 
provably ineficient.</p>
        <p>One exploration strategy often employed in Q-learning algorithms to date is the 
greedy approach, where one chooses  * = arg max ∈   (,  ) with probability 1 − 
with probability  . While  -greedy exploration ensures
exploration of the domain, its regret bound grows linearly with time, and is therefore</p>
        <p>
          A very simple test-bed for exploration strategies in reinforcement learning is the
multi-armed bandit problem. The state is void in this problem formulation, rendering  ,
 ,  0 vacuous and   a function only of   . There are several (asymptotically) optimal
algorithms for this problem, and one of the simplest ones is Thompson sampling [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Thompson sampling approximates a posterior distribution of the mean reward for each
action. The next action   is decided by sampling rewards for each action from these
posterior distributions and selecting the action that gave the highest sampled reward.
Posterior sampling methods have also been shown to behave eficiently with respect to
cumulative regret [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] on general MDPs.
        </p>
        <p>
          Fortunato et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] introduced NoisyNet as a means for balancing exploration and
exploitation. They used a neural network to represent the  -function and extended their
 such that the network has suficient stochasticity for exploration, both parameters are
learned using standard backpropagation. Fortunato et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] apply NoisyNet to three
reinforcement learning algorithms: DQN, Dueling DQN, and A3C, and show improved
performance on all of them. Later, NoisyNet was used in the Rainbow algorithm [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
a combination of six extensions to the DQN algorithms [
          <xref ref-type="bibr" rid="ref3 ref5 ref6 ref7 ref8 ref9">3, 5, 6, 7, 8, 9</xref>
          ], that shows
state-of-the-art performance across 57 Atari games.
        </p>
        <p>
          A limitation of NoisyNet is that the initial uncertainty in the value or policy function is
crucial for exploration. If the uncertainty is too high, the algorithm will struggle to learn
anything, while if the uncertainty is too low, there is nothing incentivizing exploration
and the algorithm will not explore new trajectories. The learning approach in NoisyNet
is similar to variational inference schemes such as Bayes by backprop [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], where the
weights of a neural network model are assumed to be normally distributed with mean

( ) and standard deviation
        </p>
        <p>
          ( ). However, while the objective of Bayes by backprop is
to approximate  (| ), the posterior distribution over the weights after seeing a fixed
dataset  , the weights in NoisyNet are not given a prior distribution, and the learning,
therefore, does not result in a posterior distribution over . Consequently, NoisyNet does
not have the same optimality guarantee on total regret as methods that approximate a
posterior over the value functions [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and the exploration could stop prematurely if the
standard deviations
        </p>
        <p>decline too quickly during learning. In posterior sampling-based
methods, we can fit the initial uncertainty to a prior distribution. This would mean
that with an appropriate prior, network parameter initialization is not as critical to
performance. Another advantage of posterior sampling is that it can provide a natural
way to incorporate domain knowledge. Prior knowledge can be used to create informative
prior distributions for value or policy functions.</p>
        <p>In this paper, we will therefore introduce BayesianExplore (BE), a fully Bayesian
extension of NoisyNet. The key idea is to use a Bayesian deep network to represent the
posterior distribution over  (,  ) given the history ℋ  , and thereafter use Thompson
sampling as a means to eficiently balance exploration and exploitation. To allow prior
knowledge to be eficiently encoded, we use</p>
        <p>function-space variational inference, meaning
that the model does not learn the posterior distribution over the parameters, but rather
the posterior process over the output of the model. BE relates to Q-learning in this
paper, but we note that the key idea would also apply to policy-based methods.</p>
        <p>We summarize our contributions as follows:
• We introduce BayesianExplore, a fully Bayesian Q-learning method for reinforcement
learning;</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>• We give initial results comparing BE to NoisyNet, showing competitive results;
• We show how simple heuristics can be eficiently encoded as functional priors;
• We show that these priors can significantly improve learning eficiency.
Before we delve into the background, we need to define some notation. Most of the
theory will be based on stochastic processes. The stochastic process we are interested
in generates value functions for an MDP and exists on the associated probability space.
It can be written as { (, ) : (, ) ∈  × }
. For any sample 
∈ Ω,  (·, ·,  ) is a
sample function mapping  ×  →
we will denote (, ) ∈  ×</p>
      <sec id="sec-2-1">
        <title>R. To simplify notation in the following subsections</title>
        <p>by  ∈  , the sample functions as  :
 →</p>
        <p>R, and the
associated process as ℱ . We will use f1: for a collection of 

points { ∈ X}, respectively. The marginal process at the set X = { } =1 ∈ 
f () and {f (),  ∈ X} denote the process evaluated at single point  and the set of
 is
with a slight abuse of notation denoted by ℱ (X), and we evaluate a likelihood using
sample-functions. Next,
the notation  (|ℱ (X)). For instance, if ℱ
and covariance  (, ′), f1: is a collection of 
is a Gaussian process (GP) with mean  ()
realizations from that Gaussian process,
the likelihood of  under the Gaussian jointly defined by ℱ (X).
ℱ (X) is the multivariate Gaussian obtained at X with associated parameters, f () is a
univariate Gaussian with given mean and standard deviation, and  (|ℱ (X)) evaluates</p>
        <sec id="sec-2-1-1">
          <title>2.1. Noisy Networks</title>
          <p>NoisyNet can be viewed as a stochastic process represented by a neural network with
stochastic weights. We will define its sampling distribution as 
function</p>
          <p>
            now realizes a neural network to represent the Q-function  (,  ). This
means that NoisyNet can model stochastic policies, and Fortunato et al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] show through
empirical analysis that the NoisyNet policy sometimes converges to a non-deterministic
policy. Nevertheless, they also point out that there always exists a deterministic optimal
policy for the mean squared error loss in DQN. Deep neural networks used as function
approximations in Q-learning were labeled DQN by Mnih et al. [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Later, Mnih et al.
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] made a significant improvement to the learning stability of their original DQNs by
∼  , where a sample
          </p>
          <p>BayesianExplore</p>
          <p>NoisyNet
200
introducing a target network. The target network has the same structure as the regular
network, and the weights are copied over from the regular network every  − timesteps.
The target network is used to calculate the target Q-value for the temporal diference
(TD) error. Having a more stationary target is shown to improve the stability of training.
This was a substantial improvement, as training the DQN was previously unstable.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.2. Functional Variational Bayesian Neural Networks</title>
          <p>
            Consider a supervised learning problem, where we desire a parameterized function
  :  →  to map an input  ∈  to a target  ∈  . If we train a Bayesian neural
network with stochastic weights  to represent  , the standard procedure is to assume
that the dataset  with datapoints (, ) is given, and proceed by defining a prior
distribution  () over the weights [
            <xref ref-type="bibr" rid="ref10 ref13 ref14">13, 10, 14, 15</xref>
            ]. After defining  () we can use
variational inference methods to approximate  (| ) =  (| ) ()/ ( ) and use that
to realize  . The disadvantage of this approach is that prior knowledge we might have
about the domain typically relates to the behaviour of the function  (), which is
very dificult to encode at the level of the individual weights in . Consequently, the
prior  () will in essence only act as a regularizer, and is not a suitable medium for
incorporating informative a priori knowledge.
          </p>
          <p>Functional variational Bayesian neural networks [16] is a variational inference method
for neural networks that approximates the posterior distribution in function space. This
means that our prior will be a distribution over functions, i.e., a stochastic process, and as
part of the evaluation of the evidence lower bound (ELBO) we will need to calculate the
KL-divergence from one process to another. Sun et al. [16] show that for two stochastic
processes  and  , the KL-divergence from  to  is the supremum of the marginal
KL-divergences over all finite measurement sets. Let  (X) (resp.  (X)) be the marginal
distribution of function values from the process  (resp  ) at some set of points X ∈   ,
then:</p>
          <p>KL[‖
] =</p>
          <p>sup
 ∈N,X∈</p>
          <p>KL [ (X)‖ (X)] ,
(1)
KL-divergence by using finite measurement sets
Note that as the KL-divergence between two processes is a supremum over the marginals
on the right-hand side of Equation 1, it holds for any given X that KL[‖
KL [ (X)‖ (X)]. In the following, we will nevertheless approximate the functional
 , acknowledging the fact that
we may underestimate the true KL-divergence between the two stochastic processes.
From now on we will therefore be talking about the KL-divergence between marginal
X ∈ 
define the sample function.
distributions of function values instead of between processes.</p>
          <p>Now, let the stochastic processes be represented by a neural network  . A priori we
will assume 
∼  , and use variational inference to find the posterior process 
which is parameterized by . We will think of the generative process   as follows: We
sample a vector  from, e.g., a standard Gaussian and populate a neural network with
weights   =   +   ·   , where  = ( ,  ) is the collection of parameters required to</p>
          <p>In our notation, Sun et al. [16] showed that the gradient of the KL-divergence for
functions marginalized at the measurement set X is
∇KL [ ({f ()}∈X)‖ ({f ()}∈X)] =</p>
          <p>E [︀ ∇{f ()}∈X(∇f log  ({f ()}∈X)
− ∇f log  ({f ()}∈X))]︀ . (2)</p>
          <p>Here we have used that the expected value of the score function is zero. The dificult
part in Equation (2) is to estimate ∇f log  ({f ()}∈X) and ∇f log  {f ()}∈X). The
entropy derivative ∇f log  ({f ()}∈X) is generally intractable. Depending on how we
define the prior, however,</p>
          <p>∇f log  ({f ()}∈X) can be easy to compute analytically. To
reduce variance in the gradients, we use tractable priors in this paper.</p>
          <p>To estimate the gradient of the log-density under  , ∇f log  ({f ()}∈X), Sun et al.
[16] use the Spectral Stein Gradient Estimator (SSGE) [17]. Shi et al. [17] show that, for
a diferentiable density  and positive definite kernel  (·, ·) in the Stein class of  , we can
approximate the gradient ∇f log  ({f ()}∈X). Given 
approximation [18, 19] is used to calculate the first  eigenfunctions of  ,  ^ 1, . . . ,  ^  . It
samples from  , the Nyström
follows that
where
∇f log  ({f ()}∈X) ≈
∇f    ^  ({f ()}∈X),</p>
          <p>^
︁∑

 =1
^
  = −
1 ∑︁ 
  =1</p>
          <p>^
∇f   (f ).</p>
          <p>In short, this is a method for estimating the gradient function of implicit distributions
using approximations to eigenfunctions of a kernel-based operator. We will follow Shi et al.
[17] and use the RBF kernel in all experiments. This brings three hyper-parameters to the
algorithm: the number of samples</p>
          <p>from the implicit distribution used to approximate
the gradient, the number of eigenvectors  used to approximate the gradient, and  , a
regularisation parameter that smooths the gradient function.</p>
          <p>The full objective for the functional Bayesian neural network becomes
ℰ =
1</p>
          <p>︁∑
|  | (, )∈ 
log  ( |f ()) −  · KL [  ({f ()}∈X) ‖ ({f ()}∈X)] ,
(3)
using Monte Carlo sampling: We generate f1: () with 
where we use  ( |f ()) to denote the likelihood of the observation  under the stochastic
process evaluated at  (e.g., f () could be a univariate Gaussian with given mean and
standard deviation). Note here that we approximate log  ( |f ()) in the implementation
∼  , use these to approximate
the local model f (), and thereby also approximate the log-likelihood of the observation
 under the generative process  .</p>
          <p>In order for ℰ in Equation (3) to match the functional ELBO and be a proper lower
bound for log  ( ),  should be set to  =
a lower bound for the true KL divergence b||etween the processes, so a larger value for 
1 . Sun et al. [16] note, however, that ℰ uses
is likely necessary to maintain properly calibrated posterior uncertainty. They, therefore,
use one over the batch size instead,  =</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>1 , a strategy that we will also follow.</title>
        <p>|  |
3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>Osband and Van Roy [20] prove that posterior sampling of Q-values for reinforcement
learning in finite horizon MDPs has at least a near-optimal regret bound. They further
conjecture that their bound can be improved to show optimal regret. Additionally, a
posterior sampling-based reinforcement learning algorithm can be made to utilize domain
knowledge through an appropriate prior. This should improve the policy convergence
rate. Combined with the computational eficiency of posterior sampling, this motivates
the development of a Bayesian reinforcement learning algorithm with functional priors.</p>
      <p>We will present a method based on functional variational Bayesian neural networks [16]
that allows eficient exploration, and the inclusion of domain knowledge.</p>
      <p>
        This can be achieved by modeling posterior distributions either over the policy function
or a value function, then sample an action directly from that posterior (in case of policy
focus) or use greedy action-selection based on samples from the value-function posterior.
In this paper, we will approximate the posterior distribution of the Q-value function
using DQN [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>We use the functional variational Bayesian neural network (FVBNN) [16] framework
discussed previously and compare our approach to NoisyNet. Note that when NoisyNet
uses one sample from  for each optimization step we will instead use 
samples from  .</p>
      <p>This is needed to use the FVBNN loss defined in Equation
(3) rather than the temporal
diference used in NoisyNet. The Bayesian formalism allows us to incorporate a priori
domain knowledge through the prior  , and will encourage exploration with (close to)
optimal regret [20].</p>
      <p>Recall that the learning objective in Equation (3) requires the evaluation of the
marginal {f ()}∈X. Here, the measurement set X should contain representative samples
from  . In our setting  will be state-action pairs, and 
=  × 

  consists of  examples of state-action pairs combined with the predicted  -value,
. The data-set
Penultimate Layer</p>
      <p>BayesianExplore
  = {( ,   ,   )</p>
      <p>} =1; the subscript  is used to denote the version of the target net
used to generate the   -values. In the following the set X is defined as the set of all
state-action pairs for states we have already explored:</p>
      <p>X = {( ,   ) | ∀ ∈  , ∀  ∈ } .
a Gaussian process, ℱ ∼
output vectors: the mean 
X will eventually have full support in  × 
if the MDP is ergodic.</p>
      <p>To calculate log  (  |f ( ,   )) we will assume that the underlying stochastic process is
GP. The network architecture for  is defined to produce two
and the log standard deviation 
= log  .</p>
      <p>This gives the following loss function when evaluated on a sub-sample   :
1 |∑︁  |
|  |  =1
ℒ =  · KL[ ‖ ] +</p>
      <p>+ (  −   )2 exp(−  ).</p>
      <p>
        Algorithm 1 shows a general DQN-update iteration shared by both NoisyNet and
BE where the agent interacts with the environment. The diference between NoisyNet
and BE is how the neural network is updated when Algorithm 1 makes the call to
UpdateNet in line 10. Algorithm 2 and Algorithm 3 provide two diferent definitions for
the UpdateNet function. Algorithm 2 details the procedure for updating NoisyNet. It
begins by sampling one value function  and one target function  ′. The target function
is used to calculate the target-value for the value function in line 8. After that, the
network is updated by minimising the temporal diference error. Algorithm 3 shows the
pseudo-code for BE. The first diference to NoisyNet is that the network has two outputs
for each action, the mean   and standard deviation   . Note that we use subscript 
when we are only interested in the mean (line 7 and 9) and no subscript when we fetch
both results (line 8). The next diference is that we sample 
functions from the network
(line 3) and target network (line 4). Instead of the temporal diference error, a Gaussian
log-likelihood loss is used instead (line 10). The gradient KL-loss term is calculated using
the spectral Stein gradient estimator as outlined above (call to SSGE in line 13).
Algorithm 1 DQN-update
1: function UpdateNet(, ,  )
◁ Update value network
◁ Update target network
4. Experiments
5:
6:
7:
8:
9:
10:
11:
12:
13:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Our method most closely resembles NoisyNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], so all experiments will mainly compare
the performance of BE to that baseline.
      </p>
      <sec id="sec-3-1">
        <title>4.1. Details and Hyper-parameters</title>
        <p>
          We compare our method with NoisyNet [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] on three diferent environments in the OpenAI
Gym [21]: Cartpole, MountainCar, and LunarLander. Whenever possible we will use
the hyper-parameters employed by Han et al. [22]. Additionally, our method has
hyperparameters related to the calculation of the functional KL-divergence. These are reported
in Table 1. Note the relatively low number of eigenvectors  and relatively high value for
 . Both were chosen to smooth the gradient estimates. Note also that we have used the
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15: end function
Algorithm 3 Functional Bayesian Update
  ∼    = 1, . . . , 
  ′ ∼    = 1, . . . , 
  ← arg max   (,  )  = 1, . . . , 
(   ,    ) ←   (,   )  = 1, . . . ,
        </p>
        <p>= 1, . . . , 
pre-activations rather than the weights themselves, which results in significant speedup.
This is especially important for BE, which does 
times as many samples per iteration.</p>
        <p>LRT also reduces the variance of the gradient, which stabilizes training.</p>
        <p>All experiments have been implemented in Julia, and the source code is available at
https://github.com/XXX.1</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Flat Prior</title>
        <p>First, we will examine how BE compares to NoisyNet when we do not encode a priori
knowledge about an environment into the model. To investigate this we employ a “flat”
prior, namely an improper prior with constant probability on R . An improper prior is
not a probability density (since it does not integrate to 1), but this is not problematic as
BE only requires the gradients and does not need to evaluate the probabilities themselves.
1The URL to the repository containing all the source code including scripts to reproduce our results will
be made available once the article is accepted for publication.</p>
        <p>Prior
 = 1,  = 1
 = 1,  = 10
 = 1,  = 50</p>
        <p>Flat prior
 = −1,  = 50
 = −1,  = 20
 = −1,  = 10</p>
        <p>The training curves for BE and standard NoisyNet-DQN on the three selected OpenAI
Gym environments are shown in Figure 1. While both methods can solve all environments,
BE uses considerably fewer frames to find an optimal policy for Cartpole. The methods
are comparable in the two other environments.</p>
        <p>
          Fortunato et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] noted that the learned variance in their weights increased during
exploration in some environments despite there existing an optimal deterministic solution,
and the loss provides no incentive to maintain uncertainty. Figure 2 shows the mean
standard deviation for the penultimate and final layer for BE and NoisyNet evaluated on
Cartpole. It is interesting to see that the last layer’s standard deviation for NoisyNet
continues to decrease throughout the training while BE’s uncertainty initially decreases
faster but then stabilises at a higher degree of uncertainty than NoisyNet’s. Figure 1
shows that BE has found a near-optimal solution after approximately 5k frames and
an optimal policy after 10k frames. Interestingly, the standard deviation of the weight
parameters for BE in the last layer stops decreasing after 5k iterations. This seems to
indicate that BE is satisfied that is has found a stable policy, where optimising further
would be overfitting to noise. Given that the gradient in the last layer is suficiently
small, the nature of the chain rule will cause the gradient in the penultimate layer to be
smaller, which can explain why the mean standard deviation in the penultimate layer
decreases more slowly.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Informative Prior</title>
        <p>One of the benefits of a functional prior is that we can incorporate domain knowledge
to get more eficient exploration. We will now see how our method can utilise domain
knowledge to improve sample eficiency. To measure the efectiveness of priors we will
use a distribution that has a higher concentration of large Q-values for actions that we
want to incentivize.</p>
        <p>We purposefully do not do an extensive search for a “good” prior distributions. Rather,
we are interested in the efect a “simple” prior can have on performance. The results
reported in Table 2 will reveal the efectiveness of the prior distributions. To this end,
we have chosen to define the prior as a Gaussian process with the following mean and
kernel function:
 (,  ) =  1 +  ·  2 · A⊺,
 (, ′) =  2 ( = ′),
where we use the notation that  ( ) = 1 if  is true and 0 otherwise. Values for  1,
 2, and   for each environment can be found in Table 3.  is either +1, indicating a
“helpful” prior, or −1, indicating an “unhelpful” prior. A was selected based on vague
information such as “In Cartpole, it is good to move left if the pole is leaning to the left,
and vice versa” and “In MountainCar, it is better to move left if your cart is already
moving to the left, and vice versa”.  1 and  2 were set so that the prior mean for the
Q-values would be roughly at the true mean, though we suspect this could be a restricting
factor of our prior. Experimental results for varied values of  and  , can be seen in
Table 2.</p>
        <p>An alternative approach to defining the prior could be to focus on smoothness, i.e., use
 (, ′) to incorporate that states that are similar also are likely to have similar Q-values.
This would also be a prior that does not necessarily need much domain knowledge to be
efective.</p>
        <p>
          For Cartpole, strong helpful priors result in a substantial benefit in the number of
episodes to solve the environment, and even a weak unhelpful prior outperforms NoisyNet
here. MountainCar benefits from a strong and helpful prior, solving the environment
in as little as 100 episodes. However, a vague and presumably helpful prior appears to
be harmful in this environment. For LunarLander, a strong helpful prior prevented the
algorithm from solving the environment, yet a more vague prior was beneficial. This seems
to indicate that the prior used in that environment was not very precise. Unsurprisingly,
strong unhelpful priors prevented the algorithm from solving any environment, with runs
terminated at 30k iterations for Cartpole, 500k iterations for MountainCar, and 2mill
iterations for LunarLander. We observe that the strongest priors (both “helpful” and
“unhelpful”) may restrict the exploration too much, and unless the prior is focused on an
optimal strategy, the environment is not solved. Overall, the results show that some efort
has to be put into creating efective priors for certain environments, but that domain
knowledge can be extremely valuable if it is available. Finally, BE with an appropriately
defined prior outperformed NoisyNet [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] on all environments. We conclude that the
Bayesian formulation combined with well-functioning priors can be an alternative to
other strategies to provide domain knowledge, like reward shaping.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion and Discussion</title>
      <p>This paper presents BayesianExplore (BE), a fully Bayesian reinforcement learning
algorithm. This is valuable because it is known that posterior sampling of Q-values
for reinforcement learning in finite horizon MDPs has (close to) optimal regret bound
[20]. Initial experiments show that BE is comparable to NoisyNet in well-known test
environments.</p>
      <p>Next, since we utilized recent breakthroughs in function-space variational inference
[16] to formulate the model as a stochastic process, we have the opportunity to encode
domain knowledge into prior information that can lead to faster learning. BE with an
informative prior outperforms NoisyNet in all environments.</p>
      <p>One interesting avenue for future work is to extend the approach to methods other
than the standard DQN. We hypothesise that BE can be adapted and used to improve
exploration in any algorithm that is compatible with NoisyNet. A functional Bayesian
approach for policy evaluation, where the Gaussian process we used in this paper would
be replaced by a Dirichlet process, which would permit prior distributions in policy
space rather than in value space. This can be a more intuitive representation of a priori
knowledge in many situations.
[15] W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, A. G. Wilson, A Simple Baseline
for Bayesian Uncertainty in Deep Learning (2019). URL: https://arxiv.org/abs/1902.
02476v2.
[16] S. Sun, G. Zhang, J. Shi, R. Grosse, Functional Variational Bayesian Neural
Networks, arXiv:1903.05779 [cs, stat] (2019). URL: http://arxiv.org/abs/1903.05779,
arXiv: 1903.05779.
[17] J. Shi, S. Sun, J. Zhu, A Spectral Approach to Gradient Estimation for Implicit
Distributions, arXiv:1806.02925 [cs, stat] (2018). URL: http://arxiv.org/abs/1806.
02925, arXiv: 1806.02925.
[18] E. J. Nyström, Über die praktische auflösung von integralgleichungen mit
anwendungen auf randwertaufgaben, Acta Mathematica 54 (1933) 185–204.
[19] C. Williams, M. Seeger, Using the nyström method to speed up kernel machines, in:
T. Leen, T. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing
Systems 13 (NIPS 2000), MIT Press, 2001, pp. 682–688.
[20] I. Osband, B. Van Roy, Why is Posterior Sampling Better than Optimism for
Reinforcement Learning?, arXiv:1607.00215 [cs, stat] (2017). URL: http://arxiv.org/
abs/1607.00215, arXiv: 1607.00215.
[21] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang,
W. Zaremba, OpenAI Gym, arXiv:1606.01540 [cs] (2016). URL: http://arxiv.org/
abs/1606.01540, arXiv: 1606.01540.
[22] S. Han, W. Zhou, J. Liu, S. Lü, NROWAN-DQN: A Stable Noisy Network with
Noise Reduction and Online Weight Adjustment for Exploration, arXiv:2006.10980
[cs, stat] (2020). URL: http://arxiv.org/abs/2006.10980, arXiv: 2006.10980.
[23] D. P. Kingma, T. Salimans, M. Welling, Variational dropout and the local
reparameterization trick, 2015. arXiv:1506.02557.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples</article-title>
          ,
          <source>Biometrika</source>
          <volume>25</volume>
          (
          <year>1933</year>
          )
          <fpage>285</fpage>
          -
          <lpage>294</lpage>
          . URL: https://www.jstor.org/stable/2332286. doi:
          <volume>10</volume>
          .2307/2332286, publisher: [Oxford University Press, Biometrika Trust].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Osband</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          , Generalization and Exploration via Randomized Value Functions, arXiv:
          <fpage>1402</fpage>
          .0635 [cs, stat] (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1402. 0635, arXiv:
          <fpage>1402</fpage>
          .
          <fpage>0635</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fortunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Azar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Menick</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Osband</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pietquin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Legg</surname>
          </string-name>
          , Noisy Networks for Exploration, arXiv:
          <fpage>1706</fpage>
          .10295 [cs, stat] (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/1706. 10295, arXiv:
          <fpage>1706</fpage>
          .
          <fpage>10295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Modayil</surname>
          </string-name>
          , H. van Hasselt,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaul</surname>
          </string-name>
          , G. Ostrovski,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dabney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Horgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Azar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          , Rainbow: Combining Improvements in Deep Reinforcement Learning, arXiv:
          <fpage>1710</fpage>
          .02298 [cs] (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1710.02298, arXiv:
          <fpage>1710</fpage>
          .
          <fpage>02298</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dabney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Distributional</given-names>
            <surname>Perspective</surname>
          </string-name>
          on Reinforcement Learning, arXiv:
          <fpage>1707</fpage>
          .06887 [cs, stat] (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707. 06887, arXiv:
          <fpage>1707</fpage>
          .
          <fpage>06887</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hessel</surname>
          </string-name>
          , H. van Hasselt,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , N. de Freitas,
          <article-title>Dueling Network Architectures for Deep Reinforcement Learning</article-title>
          , arXiv:
          <fpage>1511</fpage>
          .06581 [cs] (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1511.06581, arXiv:
          <fpage>1511</fpage>
          .
          <fpage>06581</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>H. van Hasselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <article-title>Deep Reinforcement Learning with Double Q-learning</article-title>
          ,
          <source>arXiv:1509</source>
          .06461 [cs] (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1509.06461, arXiv:
          <fpage>1509</fpage>
          .
          <fpage>06461</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          , Prioritized Experience Replay, arXiv:
          <fpage>1511</fpage>
          .05952 [cs] (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1511.05952, arXiv:
          <fpage>1511</fpage>
          .
          <fpage>05952</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Badia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <article-title>Asynchronous Methods for Deep Reinforcement Learning</article-title>
          , arXiv:
          <fpage>1602</fpage>
          .01783 [cs] (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1602.01783, arXiv:
          <fpage>1602</fpage>
          .
          <fpage>01783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cornebise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <article-title>Weight Uncertainty in Neural Network</article-title>
          , in: F.
          <string-name>
            <surname>Bach</surname>
          </string-name>
          , D. Blei (Eds.),
          <source>Proceedings of the 32nd International Conference on Machine Learning</source>
          , volume
          <volume>37</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , PMLR, Lille, France,
          <year>2015</year>
          , pp.
          <fpage>1613</fpage>
          -
          <lpage>1622</lpage>
          . URL: http://proceedings.mlr. press/v37/blundell15.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Playing Atari with Deep Reinforcement Learning</article-title>
          , arXiv:
          <fpage>1312</fpage>
          .5602 [cs] (
          <year>2013</year>
          ). URL: http://arxiv.org/abs/1312.5602, arXiv:
          <fpage>1312</fpage>
          .
          <fpage>5602</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          , G. Ostrovski,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sadik</surname>
          </string-name>
          , I. Antonoglou,
          <string-name>
            <given-names>H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Rezende</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <article-title>Stochastic Backpropagation and Approximate Inference in Deep Generative Models</article-title>
          , arXiv:
          <fpage>1401</fpage>
          .4082 [cs, stat] (
          <year>2014</year>
          ). URL: http://arxiv.org/abs/1401.4082, arXiv:
          <fpage>1401</fpage>
          .
          <fpage>4082</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Botev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Scalable</given-names>
            <surname>Laplace Approximation for Neural Networks</surname>
          </string-name>
          ,
          <year>2018</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Skdvd2xAZ</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>