Deep Quality-Value (DQV) Learning Matthia Sabatelli1 , Gilles Louppe1 , Pierre Geurts1 , and Marco A. Wiering2 1 Montefiore Institute, Department of Electrical Engineering and Computer Science, Université de Liège, Belgium 2 Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen, The Netherlands Abstract. We present Deep Quality-Value Learning (DQV), a novel model-free Deep Reinforcement Learning (DRL) algorithm which learns an approximation of the state-value function (V ) alongside an approxi- mation of the state-action value function (Q). We empirically show that simultaneously learning both value functions results in faster and better learning when compared to DRL methods which only learn an approxi- mation of the Q function. Keywords: Deep Reinforcement Learning · Model-Free Deep Reinforce- ment Learning · Temporal Difference Learning · Function Approximators 1 Preliminaries We formulate a Reinforcement Learning (RL) setting as a Markov Decision Pro- cess (MDP) consisting of a finite set of states S = {s1 , s2 , ..., sn }, actions, A, and a time-counter variable t [5]. In each state st ∈ S, the RL agent can per- form an action at ∈ A(st ) after which it transits to the next state as defined by a transition probability distribution p(st+1 |st , at ). At each transition from st to st+1 the agent receives a reward signal rt coming from the reward func- tion <(st , at , st+1 ). The actions of the agent are selected based on its policy π : S → A that maps each state to a particular action. For every state s ∈ S, P∞ k under policy π its value function is defined as: V π (s) = E k=0 γ rt+k st =   P∞ k s, π , while its state-action value function as: Qπ (s, a) = E k=0 γ rt+k st =  s, at = a, π . Both functions are computed with respect to the discount fac- tor γ ∈ [0, 1]. The goal of an RL agent is to find a policy π ∗ that realizes the optimal expected return: V ∗ (s) = max V π (s), for all s ∈ S and the opti- π mal Q value function: Q∗ (s, a) = max Qπ (s, a) for all s ∈ S and a ∈ A. Both π value functions satisfy the  Bellman optimality equation  as given by V ∗ (st ) = max st+1 p(st+1 |st , at ) <(st , at , st+1 )+γV ∗ (st+1 ) for the state-value function, P a   and by Q∗ (st , at ) = st+1 p(st+1 |st , at ) <(st , at , st+1 ) + γ max Q∗ (st+1 , at+1 ) , P at+1 Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 M. Sabatelli et al. for the state-action value function. In what follows we show how to learn an approximation of both value functions with deep learning methods [2]. 2 The Deep Quality-Value (DQV) Learning Algorithm Deep Quality-Value (DQV) Learning learns an approximation of the V function alongside an approximation of the Q function. This is done with two neural networks, respectively parametrized as Φ and θ, and two objective functions that can be minimized by gradient descent. Such objectives adapt two tabular RL update rules presented in [7] and result in the  following losses for learning the Q − 2 and V functions: L(θ) = Ehst ,at ,rt ,st+1 i∼U (D) rt +γV (st+1 ; Φ )−Q(st , at ; θ) ;   2 L(Φ) = Ehst ,at ,rt ,st+1 i∼U (D) rt + γV (st+1 ; Φ− ) − V (st ; Φ) . Both losses are computed with respect to the same target (rt + γV (st+1 ); Φ− ), which uses an older version of the V network (Φ− ) for computing the temporal-difference errors. Minimizing these objectives is done over batches of RL trajectories (h st , at , rt , st+1 i) that get uniformly sampled from a memory buffer D. Our results [4], presented in Fig. 1, show that DQV learns significantly faster and better than DQN [3] and DDQN [6] on several DRL test-beds coming from the Open-AI-Gym environment [1]. This empirically highlights the benefits of learning two value functions simultaneously and makes DQV a new faster synchronous value-based algorithm present in DRL. Acrobot CartPole Pong 0 200 20 150 10 −200 Reward Reward Reward 100 0 −400 −10 50 DQV DQV DQV DQN DQN DQN −20 DDQN 0 DDQN DDQN −600 0 50 100 150 200 0 50 100 150 200 250 300 0 500 1,000 1,500 Episodes Episodes Episodes Enduro Boxing 60 600 40 400 Reward Reward 20 200 0 DQV DQV DQN −20 DQN 0 DDQN DDQN 0 100 200 300 400 500 0 200 400 600 800 1,000 Episodes Episodes Fig. 1. The results obtained by DQV on several RL environments coming from the Open-AI-Gym [1] benchmark. DQV learns significantly faster than DQN and DDQN on all test-beds. Results adapted from [4]. Deep Quality-Value (DQV) Learning 3 References 1. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul- man, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016. 2. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015. 3. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. 4. Matthia Sabatelli, Gilles Louppe, Pierre Geurts, and Marco Wiering. Deep quality value (dqv) learning. In Advances in Neural Information Processing Systems, Deep Reinforcement Learning Workshop. Montreal, 2018. 5. Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998. 6. Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In AAAI, volume 16, pages 2094–2100, 2016. 7. Marco A Wiering. QV (lambda)-learning: A new on-policy reinforcement learning algorithm. In Proceedings of the 7th European Workshop on Reinforcement Learning, pages 17–18, 2005.