1. Introduction

Exploration in Deep Reinforcement Learning

Ludvig Killingberg

ludvig.killingberg@ntnu.no 0 1

Helge Langseth

helge.langseth@ntnu.no 0 1 0 Norwegian University of Science and Technology , Høgskoleringen 1, 7034 Trondheim , Norway 1 Use permitted under Creative Commons License Attribution 4.0

Posterior sampling of value functions can give eficient exploration for value-based reinforcement learning algorithms. We introduce BayesianExplore (BE), a posterior sampling-based method for reinforcement learning based on Bayesian function-space variational inference over stochastic processes. This Bayesian formalism allows us to formalize domain knowledge as a prior over the value function. Our approach, therefore, provides an alternative to reward shaping with the added benefit that the algorithm keeps seeking an optimal policy in the original environment instead of the one with altered rewards. We show that BE produces state-of-the-art eficiency in exploration with flat priors, and that it is easy to significantly improve performance by incorporating domain knowledge using simple priors.

Bayesian deep learning reinforcement learning

1. Introduction

0, ⟩, where is the state space and ︀] , where the expectation is taken over the policy, transition, and reward distributions. An eficient agent must be able to learn from the data it collects, but since the data is dependent on the policy, it must also prioritize exploring states and actions that the ∼ *(·|) agent can learn from.

Related to is the -function, (0, 0), defined as the expected reward of taking action 0 in state 0 and then following policy thereafter: (0, 0) := − E ︀[ ∑︀ =0

∞ |0 = 0, 0 = ]︀ . Q-learning amounts to learning the -function from ℋ .

The regret of a policy is *

, the loss in the expected reward obtained by following instead of following the optimal policy *. Now, a learning algorithm’s eficiency in exploration can be measured by its cumulative regret over time. ( ) has an added perturbation ( ). After initializing and or selects uniformly from provably ineficient.

One exploration strategy often employed in Q-learning algorithms to date is the greedy approach, where one chooses * = arg max ∈ (, ) with probability 1 − with probability . While -greedy exploration ensures exploration of the domain, its regret bound grows linearly with time, and is therefore

A very simple test-bed for exploration strategies in reinforcement learning is the multi-armed bandit problem. The state is void in this problem formulation, rendering , , 0 vacuous and a function only of . There are several (asymptotically) optimal algorithms for this problem, and one of the simplest ones is Thompson sampling [ 1 ]. Thompson sampling approximates a posterior distribution of the mean reward for each action. The next action is decided by sampling rewards for each action from these posterior distributions and selecting the action that gave the highest sampled reward. Posterior sampling methods have also been shown to behave eficiently with respect to cumulative regret [ 2 ] on general MDPs.

Fortunato et al. [ 3 ] introduced NoisyNet as a means for balancing exploration and exploitation. They used a neural network to represent the -function and extended their such that the network has suficient stochasticity for exploration, both parameters are learned using standard backpropagation. Fortunato et al. [ 3 ] apply NoisyNet to three reinforcement learning algorithms: DQN, Dueling DQN, and A3C, and show improved performance on all of them. Later, NoisyNet was used in the Rainbow algorithm [ 4 ], a combination of six extensions to the DQN algorithms [ 3, 5, 6, 7, 8, 9 ], that shows state-of-the-art performance across 57 Atari games.

A limitation of NoisyNet is that the initial uncertainty in the value or policy function is crucial for exploration. If the uncertainty is too high, the algorithm will struggle to learn anything, while if the uncertainty is too low, there is nothing incentivizing exploration and the algorithm will not explore new trajectories. The learning approach in NoisyNet is similar to variational inference schemes such as Bayes by backprop [ 10 ], where the weights of a neural network model are assumed to be normally distributed with mean ( ) and standard deviation

( ). However, while the objective of Bayes by backprop is to approximate (| ), the posterior distribution over the weights after seeing a fixed dataset , the weights in NoisyNet are not given a prior distribution, and the learning, therefore, does not result in a posterior distribution over . Consequently, NoisyNet does not have the same optimality guarantee on total regret as methods that approximate a posterior over the value functions [ 2 ], and the exploration could stop prematurely if the standard deviations

decline too quickly during learning. In posterior sampling-based methods, we can fit the initial uncertainty to a prior distribution. This would mean that with an appropriate prior, network parameter initialization is not as critical to performance. Another advantage of posterior sampling is that it can provide a natural way to incorporate domain knowledge. Prior knowledge can be used to create informative prior distributions for value or policy functions.

In this paper, we will therefore introduce BayesianExplore (BE), a fully Bayesian extension of NoisyNet. The key idea is to use a Bayesian deep network to represent the posterior distribution over (, ) given the history ℋ , and thereafter use Thompson sampling as a means to eficiently balance exploration and exploitation. To allow prior knowledge to be eficiently encoded, we use

function-space variational inference, meaning that the model does not learn the posterior distribution over the parameters, but rather the posterior process over the output of the model. BE relates to Q-learning in this paper, but we note that the key idea would also apply to policy-based methods.

We summarize our contributions as follows: • We introduce BayesianExplore, a fully Bayesian Q-learning method for reinforcement learning;

2. Background

• We give initial results comparing BE to NoisyNet, showing competitive results; • We show how simple heuristics can be eficiently encoded as functional priors; • We show that these priors can significantly improve learning eficiency. Before we delve into the background, we need to define some notation. Most of the theory will be based on stochastic processes. The stochastic process we are interested in generates value functions for an MDP and exists on the associated probability space. It can be written as { (, ) : (, ) ∈ × } . For any sample ∈ Ω, (·, ·, ) is a sample function mapping × → we will denote (, ) ∈ ×

R. To simplify notation in the following subsections

by ∈ , the sample functions as : →

R, and the associated process as ℱ . We will use f1: for a collection of points { ∈ X}, respectively. The marginal process at the set X = { } =1 ∈ f () and {f (), ∈ X} denote the process evaluated at single point and the set of is with a slight abuse of notation denoted by ℱ (X), and we evaluate a likelihood using sample-functions. Next, the notation (|ℱ (X)). For instance, if ℱ and covariance (, ′), f1: is a collection of is a Gaussian process (GP) with mean () realizations from that Gaussian process, the likelihood of under the Gaussian jointly defined by ℱ (X). ℱ (X) is the multivariate Gaussian obtained at X with associated parameters, f () is a univariate Gaussian with given mean and standard deviation, and (|ℱ (X)) evaluates

2.1. Noisy Networks

NoisyNet can be viewed as a stochastic process represented by a neural network with stochastic weights. We will define its sampling distribution as function

now realizes a neural network to represent the Q-function (, ). This means that NoisyNet can model stochastic policies, and Fortunato et al. [ 3 ] show through empirical analysis that the NoisyNet policy sometimes converges to a non-deterministic policy. Nevertheless, they also point out that there always exists a deterministic optimal policy for the mean squared error loss in DQN. Deep neural networks used as function approximations in Q-learning were labeled DQN by Mnih et al. [ 11 ]. Later, Mnih et al. [ 12 ] made a significant improvement to the learning stability of their original DQNs by ∼ , where a sample

BayesianExplore

NoisyNet 200 introducing a target network. The target network has the same structure as the regular network, and the weights are copied over from the regular network every − timesteps. The target network is used to calculate the target Q-value for the temporal diference (TD) error. Having a more stationary target is shown to improve the stability of training. This was a substantial improvement, as training the DQN was previously unstable.

2.2. Functional Variational Bayesian Neural Networks

Consider a supervised learning problem, where we desire a parameterized function : → to map an input ∈ to a target ∈ . If we train a Bayesian neural network with stochastic weights to represent , the standard procedure is to assume that the dataset with datapoints (, ) is given, and proceed by defining a prior distribution () over the weights [ 13, 10, 14, 15 ]. After defining () we can use variational inference methods to approximate (| ) = (| ) ()/ ( ) and use that to realize . The disadvantage of this approach is that prior knowledge we might have about the domain typically relates to the behaviour of the function (), which is very dificult to encode at the level of the individual weights in . Consequently, the prior () will in essence only act as a regularizer, and is not a suitable medium for incorporating informative a priori knowledge.

Functional variational Bayesian neural networks [16] is a variational inference method for neural networks that approximates the posterior distribution in function space. This means that our prior will be a distribution over functions, i.e., a stochastic process, and as part of the evaluation of the evidence lower bound (ELBO) we will need to calculate the KL-divergence from one process to another. Sun et al. [16] show that for two stochastic processes and , the KL-divergence from to is the supremum of the marginal KL-divergences over all finite measurement sets. Let (X) (resp. (X)) be the marginal distribution of function values from the process (resp ) at some set of points X ∈ , then:

KL[‖ ] =

sup ∈N,X∈

KL [ (X)‖ (X)] , (1) KL-divergence by using finite measurement sets Note that as the KL-divergence between two processes is a supremum over the marginals on the right-hand side of Equation 1, it holds for any given X that KL[‖ KL [ (X)‖ (X)]. In the following, we will nevertheless approximate the functional , acknowledging the fact that we may underestimate the true KL-divergence between the two stochastic processes. From now on we will therefore be talking about the KL-divergence between marginal X ∈ define the sample function. distributions of function values instead of between processes.

Now, let the stochastic processes be represented by a neural network . A priori we will assume ∼ , and use variational inference to find the posterior process which is parameterized by . We will think of the generative process as follows: We sample a vector from, e.g., a standard Gaussian and populate a neural network with weights = + · , where = ( , ) is the collection of parameters required to

In our notation, Sun et al. [16] showed that the gradient of the KL-divergence for functions marginalized at the measurement set X is ∇KL [ ({f ()}∈X)‖ ({f ()}∈X)] =

E [︀ ∇{f ()}∈X(∇f log ({f ()}∈X) − ∇f log ({f ()}∈X))]︀ . (2)

Here we have used that the expected value of the score function is zero. The dificult part in Equation (2) is to estimate ∇f log ({f ()}∈X) and ∇f log {f ()}∈X). The entropy derivative ∇f log ({f ()}∈X) is generally intractable. Depending on how we define the prior, however,

∇f log ({f ()}∈X) can be easy to compute analytically. To reduce variance in the gradients, we use tractable priors in this paper.

To estimate the gradient of the log-density under , ∇f log ({f ()}∈X), Sun et al. [16] use the Spectral Stein Gradient Estimator (SSGE) [17]. Shi et al. [17] show that, for a diferentiable density and positive definite kernel (·, ·) in the Stein class of , we can approximate the gradient ∇f log ({f ()}∈X). Given approximation [18, 19] is used to calculate the first eigenfunctions of , ^ 1, . . . , ^ . It samples from , the Nyström follows that where ∇f log ({f ()}∈X) ≈ ∇f ^ ({f ()}∈X),

^ ︁∑ =1 ^ = − 1 ∑︁ =1

^ ∇f (f ).

In short, this is a method for estimating the gradient function of implicit distributions using approximations to eigenfunctions of a kernel-based operator. We will follow Shi et al. [17] and use the RBF kernel in all experiments. This brings three hyper-parameters to the algorithm: the number of samples

from the implicit distribution used to approximate the gradient, the number of eigenvectors used to approximate the gradient, and , a regularisation parameter that smooths the gradient function.

The full objective for the functional Bayesian neural network becomes ℰ = 1

︁∑ | | (, )∈ log ( |f ()) − · KL [ ({f ()}∈X) ‖ ({f ()}∈X)] , (3) using Monte Carlo sampling: We generate f1: () with where we use ( |f ()) to denote the likelihood of the observation under the stochastic process evaluated at (e.g., f () could be a univariate Gaussian with given mean and standard deviation). Note here that we approximate log ( |f ()) in the implementation ∼ , use these to approximate the local model f (), and thereby also approximate the log-likelihood of the observation under the generative process .

In order for ℰ in Equation (3) to match the functional ELBO and be a proper lower bound for log ( ), should be set to = a lower bound for the true KL divergence b||etween the processes, so a larger value for 1 . Sun et al. [16] note, however, that ℰ uses is likely necessary to maintain properly calibrated posterior uncertainty. They, therefore, use one over the batch size instead, =

1 , a strategy that we will also follow.

| | 3.

Method

Osband and Van Roy [20] prove that posterior sampling of Q-values for reinforcement learning in finite horizon MDPs has at least a near-optimal regret bound. They further conjecture that their bound can be improved to show optimal regret. Additionally, a posterior sampling-based reinforcement learning algorithm can be made to utilize domain knowledge through an appropriate prior. This should improve the policy convergence rate. Combined with the computational eficiency of posterior sampling, this motivates the development of a Bayesian reinforcement learning algorithm with functional priors.

We will present a method based on functional variational Bayesian neural networks [16] that allows eficient exploration, and the inclusion of domain knowledge.

This can be achieved by modeling posterior distributions either over the policy function or a value function, then sample an action directly from that posterior (in case of policy focus) or use greedy action-selection based on samples from the value-function posterior. In this paper, we will approximate the posterior distribution of the Q-value function using DQN [ 12 ].

We use the functional variational Bayesian neural network (FVBNN) [16] framework discussed previously and compare our approach to NoisyNet. Note that when NoisyNet uses one sample from for each optimization step we will instead use samples from .

This is needed to use the FVBNN loss defined in Equation (3) rather than the temporal diference used in NoisyNet. The Bayesian formalism allows us to incorporate a priori domain knowledge through the prior , and will encourage exploration with (close to) optimal regret [20].

Recall that the learning objective in Equation (3) requires the evaluation of the marginal {f ()}∈X. Here, the measurement set X should contain representative samples from . In our setting will be state-action pairs, and = × consists of examples of state-action pairs combined with the predicted -value, . The data-set Penultimate Layer

BayesianExplore = {( , , )

} =1; the subscript is used to denote the version of the target net used to generate the -values. In the following the set X is defined as the set of all state-action pairs for states we have already explored:

X = {( , ) | ∀ ∈ , ∀ ∈ } . a Gaussian process, ℱ ∼ output vectors: the mean X will eventually have full support in × if the MDP is ergodic.

To calculate log ( |f ( , )) we will assume that the underlying stochastic process is GP. The network architecture for is defined to produce two and the log standard deviation = log .

This gives the following loss function when evaluated on a sub-sample : 1 |∑︁ | | | =1 ℒ = · KL[ ‖ ] +

+ ( − )2 exp(− ).

Algorithm 1 shows a general DQN-update iteration shared by both NoisyNet and BE where the agent interacts with the environment. The diference between NoisyNet and BE is how the neural network is updated when Algorithm 1 makes the call to UpdateNet in line 10. Algorithm 2 and Algorithm 3 provide two diferent definitions for the UpdateNet function. Algorithm 2 details the procedure for updating NoisyNet. It begins by sampling one value function and one target function ′. The target function is used to calculate the target-value for the value function in line 8. After that, the network is updated by minimising the temporal diference error. Algorithm 3 shows the pseudo-code for BE. The first diference to NoisyNet is that the network has two outputs for each action, the mean and standard deviation . Note that we use subscript when we are only interested in the mean (line 7 and 9) and no subscript when we fetch both results (line 8). The next diference is that we sample functions from the network (line 3) and target network (line 4). Instead of the temporal diference error, a Gaussian log-likelihood loss is used instead (line 10). The gradient KL-loss term is calculated using the spectral Stein gradient estimator as outlined above (call to SSGE in line 13). Algorithm 1 DQN-update 1: function UpdateNet(, , ) ◁ Update value network ◁ Update target network 4. Experiments 5: 6: 7: 8: 9: 10: 11: 12: 13: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: Our method most closely resembles NoisyNet [ 3 ], so all experiments will mainly compare the performance of BE to that baseline.

4.1. Details and Hyper-parameters

We compare our method with NoisyNet [ 3 ] on three diferent environments in the OpenAI Gym [21]: Cartpole, MountainCar, and LunarLander. Whenever possible we will use the hyper-parameters employed by Han et al. [22]. Additionally, our method has hyperparameters related to the calculation of the functional KL-divergence. These are reported in Table 1. Note the relatively low number of eigenvectors and relatively high value for . Both were chosen to smooth the gradient estimates. Note also that we have used the 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: end function Algorithm 3 Functional Bayesian Update ∼ = 1, . . . , ′ ∼ = 1, . . . , ← arg max (, ) = 1, . . . , ( , ) ← (, ) = 1, . . . ,

= 1, . . . , pre-activations rather than the weights themselves, which results in significant speedup. This is especially important for BE, which does times as many samples per iteration.

LRT also reduces the variance of the gradient, which stabilizes training.

All experiments have been implemented in Julia, and the source code is available at https://github.com/XXX.1

4.2. Flat Prior

First, we will examine how BE compares to NoisyNet when we do not encode a priori knowledge about an environment into the model. To investigate this we employ a “flat” prior, namely an improper prior with constant probability on R . An improper prior is not a probability density (since it does not integrate to 1), but this is not problematic as BE only requires the gradients and does not need to evaluate the probabilities themselves. 1The URL to the repository containing all the source code including scripts to reproduce our results will be made available once the article is accepted for publication.

Prior = 1, = 1 = 1, = 10 = 1, = 50

Flat prior = −1, = 50 = −1, = 20 = −1, = 10

The training curves for BE and standard NoisyNet-DQN on the three selected OpenAI Gym environments are shown in Figure 1. While both methods can solve all environments, BE uses considerably fewer frames to find an optimal policy for Cartpole. The methods are comparable in the two other environments.

Fortunato et al. [ 3 ] noted that the learned variance in their weights increased during exploration in some environments despite there existing an optimal deterministic solution, and the loss provides no incentive to maintain uncertainty. Figure 2 shows the mean standard deviation for the penultimate and final layer for BE and NoisyNet evaluated on Cartpole. It is interesting to see that the last layer’s standard deviation for NoisyNet continues to decrease throughout the training while BE’s uncertainty initially decreases faster but then stabilises at a higher degree of uncertainty than NoisyNet’s. Figure 1 shows that BE has found a near-optimal solution after approximately 5k frames and an optimal policy after 10k frames. Interestingly, the standard deviation of the weight parameters for BE in the last layer stops decreasing after 5k iterations. This seems to indicate that BE is satisfied that is has found a stable policy, where optimising further would be overfitting to noise. Given that the gradient in the last layer is suficiently small, the nature of the chain rule will cause the gradient in the penultimate layer to be smaller, which can explain why the mean standard deviation in the penultimate layer decreases more slowly.

4.3. Informative Prior

One of the benefits of a functional prior is that we can incorporate domain knowledge to get more eficient exploration. We will now see how our method can utilise domain knowledge to improve sample eficiency. To measure the efectiveness of priors we will use a distribution that has a higher concentration of large Q-values for actions that we want to incentivize.

We purposefully do not do an extensive search for a “good” prior distributions. Rather, we are interested in the efect a “simple” prior can have on performance. The results reported in Table 2 will reveal the efectiveness of the prior distributions. To this end, we have chosen to define the prior as a Gaussian process with the following mean and kernel function: (, ) = 1 + · 2 · A⊺, (, ′) = 2 ( = ′), where we use the notation that ( ) = 1 if is true and 0 otherwise. Values for 1, 2, and for each environment can be found in Table 3. is either +1, indicating a “helpful” prior, or −1, indicating an “unhelpful” prior. A was selected based on vague information such as “In Cartpole, it is good to move left if the pole is leaning to the left, and vice versa” and “In MountainCar, it is better to move left if your cart is already moving to the left, and vice versa”. 1 and 2 were set so that the prior mean for the Q-values would be roughly at the true mean, though we suspect this could be a restricting factor of our prior. Experimental results for varied values of and , can be seen in Table 2.

An alternative approach to defining the prior could be to focus on smoothness, i.e., use (, ′) to incorporate that states that are similar also are likely to have similar Q-values. This would also be a prior that does not necessarily need much domain knowledge to be efective.

For Cartpole, strong helpful priors result in a substantial benefit in the number of episodes to solve the environment, and even a weak unhelpful prior outperforms NoisyNet here. MountainCar benefits from a strong and helpful prior, solving the environment in as little as 100 episodes. However, a vague and presumably helpful prior appears to be harmful in this environment. For LunarLander, a strong helpful prior prevented the algorithm from solving the environment, yet a more vague prior was beneficial. This seems to indicate that the prior used in that environment was not very precise. Unsurprisingly, strong unhelpful priors prevented the algorithm from solving any environment, with runs terminated at 30k iterations for Cartpole, 500k iterations for MountainCar, and 2mill iterations for LunarLander. We observe that the strongest priors (both “helpful” and “unhelpful”) may restrict the exploration too much, and unless the prior is focused on an optimal strategy, the environment is not solved. Overall, the results show that some efort has to be put into creating efective priors for certain environments, but that domain knowledge can be extremely valuable if it is available. Finally, BE with an appropriately defined prior outperformed NoisyNet [ 3 ] on all environments. We conclude that the Bayesian formulation combined with well-functioning priors can be an alternative to other strategies to provide domain knowledge, like reward shaping.

5. Conclusion and Discussion

This paper presents BayesianExplore (BE), a fully Bayesian reinforcement learning algorithm. This is valuable because it is known that posterior sampling of Q-values for reinforcement learning in finite horizon MDPs has (close to) optimal regret bound [20]. Initial experiments show that BE is comparable to NoisyNet in well-known test environments.

Next, since we utilized recent breakthroughs in function-space variational inference [16] to formulate the model as a stochastic process, we have the opportunity to encode domain knowledge into prior information that can lead to faster learning. BE with an informative prior outperforms NoisyNet in all environments.

One interesting avenue for future work is to extend the approach to methods other than the standard DQN. We hypothesise that BE can be adapted and used to improve exploration in any algorithm that is compatible with NoisyNet. A functional Bayesian approach for policy evaluation, where the Gaussian process we used in this paper would be replaced by a Dirichlet process, which would permit prior distributions in policy space rather than in value space. This can be a more intuitive representation of a priori knowledge in many situations. [15] W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, A. G. Wilson, A Simple Baseline for Bayesian Uncertainty in Deep Learning (2019). URL: https://arxiv.org/abs/1902. 02476v2. [16] S. Sun, G. Zhang, J. Shi, R. Grosse, Functional Variational Bayesian Neural Networks, arXiv:1903.05779 [cs, stat] (2019). URL: http://arxiv.org/abs/1903.05779, arXiv: 1903.05779. [17] J. Shi, S. Sun, J. Zhu, A Spectral Approach to Gradient Estimation for Implicit Distributions, arXiv:1806.02925 [cs, stat] (2018). URL: http://arxiv.org/abs/1806. 02925, arXiv: 1806.02925. [18] E. J. Nyström, Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben, Acta Mathematica 54 (1933) 185–204. [19] C. Williams, M. Seeger, Using the nyström method to speed up kernel machines, in: T. Leen, T. Dietterich, V. Tresp (Eds.), Advances in Neural Information Processing Systems 13 (NIPS 2000), MIT Press, 2001, pp. 682–688. [20] I. Osband, B. Van Roy, Why is Posterior Sampling Better than Optimism for Reinforcement Learning?, arXiv:1607.00215 [cs, stat] (2017). URL: http://arxiv.org/ abs/1607.00215, arXiv: 1607.00215. [21] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, OpenAI Gym, arXiv:1606.01540 [cs] (2016). URL: http://arxiv.org/ abs/1606.01540, arXiv: 1606.01540. [22] S. Han, W. Zhou, J. Liu, S. Lü, NROWAN-DQN: A Stable Noisy Network with Noise Reduction and Online Weight Adjustment for Exploration, arXiv:2006.10980 [cs, stat] (2020). URL: http://arxiv.org/abs/2006.10980, arXiv: 2006.10980. [23] D. P. Kingma, T. Salimans, M. Welling, Variational dropout and the local reparameterization trick, 2015. arXiv:1506.02557.

[1]

W. R.

Thompson , On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples , Biometrika 25 ( 1933 ) 285 - 294 . URL: https://www.jstor.org/stable/2332286. doi: 10 .2307/2332286, publisher: [Oxford University Press, Biometrika Trust].

[2]

Osband ,

B. Van

Roy ,

Wen , Generalization and Exploration via Randomized Value Functions, arXiv: 1402 .0635 [cs, stat] ( 2016 ). URL: http://arxiv.org/abs/1402. 0635, arXiv: 1402 . 0635 .

[3]

Fortunato ,

M. G.

Azar ,

Piot ,

Menick , I. Osband ,

Graves ,

Mnih ,

Munos ,

Hassabis ,

Pietquin ,

Blundell ,

Legg , Noisy Networks for Exploration, arXiv: 1706 .10295 [cs, stat] ( 2019 ). URL: http://arxiv.org/abs/1706. 10295, arXiv: 1706 . 10295 .

[4]

Hessel ,

Modayil , H. van Hasselt,

Schaul , G. Ostrovski,

Dabney ,

Horgan ,

Piot ,

Azar ,

Silver , Rainbow: Combining Improvements in Deep Reinforcement Learning, arXiv: 1710 .02298 [cs] ( 2017 ). URL: http://arxiv.org/abs/1710.02298, arXiv: 1710 . 02298 .

[5]

M. G.

Bellemare ,

Dabney ,

Munos ,

A Distributional

Perspective on Reinforcement Learning, arXiv: 1707 .06887 [cs, stat] ( 2017 ). URL: http://arxiv.org/abs/1707. 06887, arXiv: 1707 . 06887 .

[6]

Wang ,

Schaul ,

Hessel , H. van Hasselt,

Lanctot , N. de Freitas, Dueling Network Architectures for Deep Reinforcement Learning , arXiv: 1511 .06581 [cs] ( 2016 ). URL: http://arxiv.org/abs/1511.06581, arXiv: 1511 . 06581 .

[7] H. van Hasselt ,

Guez ,

Silver , Deep Reinforcement Learning with Double Q-learning , arXiv:1509 .06461 [cs] ( 2015 ). URL: http://arxiv.org/abs/1509.06461, arXiv: 1509 . 06461 .

[8]

Schaul ,

Quan ,

Antonoglou ,

Silver , Prioritized Experience Replay, arXiv: 1511 .05952 [cs] ( 2016 ). URL: http://arxiv.org/abs/1511.05952, arXiv: 1511 . 05952 .

[9]

Mnih ,

A. P.

Badia ,

Mirza ,

Graves ,

T. P.

Lillicrap ,

Harley ,

Silver ,

Kavukcuoglu , Asynchronous Methods for Deep Reinforcement Learning , arXiv: 1602 .01783 [cs] ( 2016 ). URL: http://arxiv.org/abs/1602.01783, arXiv: 1602 . 01783 .

[10]

Blundell ,

Cornebise ,

Kavukcuoglu ,

Wierstra , Weight Uncertainty in Neural Network , in: F. Bach , D. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings of Machine Learning Research , PMLR, Lille, France, 2015 , pp. 1613 - 1622 . URL: http://proceedings.mlr. press/v37/blundell15.html.

[11]

Mnih ,

Kavukcuoglu ,

Silver ,

Graves ,

Antonoglou ,

Wierstra ,

Riedmiller , Playing Atari with Deep Reinforcement Learning , arXiv: 1312 .5602 [cs] ( 2013 ). URL: http://arxiv.org/abs/1312.5602, arXiv: 1312 . 5602 .

[12]

Mnih ,

Kavukcuoglu ,

Silver ,

A. A.

Rusu ,

Veness ,

M. G.

Bellemare ,

Graves ,

M. A.

Riedmiller ,

Fidjeland , G. Ostrovski,

Petersen ,

Beattie ,

Sadik , I. Antonoglou,

King ,

Kumaran ,

Wierstra ,

Legg ,

Hassabis , Human-level control through deep reinforcement learning , Nature 518 ( 2015 ) 529 - 533 .

[13]

D. J.

Rezende ,

Mohamed ,

Wierstra , Stochastic Backpropagation and Approximate Inference in Deep Generative Models , arXiv: 1401 .4082 [cs, stat] ( 2014 ). URL: http://arxiv.org/abs/1401.4082, arXiv: 1401 . 4082 .

[14]

Ritter ,

Botev ,

Barber ,

A Scalable

Laplace Approximation for Neural Networks , 2018 . URL: https://openreview.net/forum?id= Skdvd2xAZ .