Deep Quality-Value (DQV) Learning

    Matthia Sabatelli1 , Gilles Louppe1 , Pierre Geurts1 , and Marco A. Wiering2
1
   Montefiore Institute, Department of Electrical Engineering and Computer Science,
                            Université de Liège, Belgium
 2
   Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence
                      University of Groningen, The Netherlands


        Abstract. We present Deep Quality-Value Learning (DQV), a novel
        model-free Deep Reinforcement Learning (DRL) algorithm which learns
        an approximation of the state-value function (V ) alongside an approxi-
        mation of the state-action value function (Q). We empirically show that
        simultaneously learning both value functions results in faster and better
        learning when compared to DRL methods which only learn an approxi-
        mation of the Q function.

        Keywords: Deep Reinforcement Learning · Model-Free Deep Reinforce-
        ment Learning · Temporal Difference Learning · Function Approximators


1     Preliminaries
We formulate a Reinforcement Learning (RL) setting as a Markov Decision Pro-
cess (MDP) consisting of a finite set of states S = {s1 , s2 , ..., sn }, actions, A,
and a time-counter variable t [5]. In each state st ∈ S, the RL agent can per-
form an action at ∈ A(st ) after which it transits to the next state as defined
by a transition probability distribution p(st+1 |st , at ). At each transition from
st to st+1 the agent receives a reward signal rt coming from the reward func-
tion <(st , at , st+1 ). The actions of the agent are selected based on its policy
π : S → A that maps each state to a particular action. For every state s ∈ S,
                                                                  P∞ k
under policy π its value function is defined as: V π (s) = E        k=0 γ rt+k st =
                                                               
                                                                  P∞ k
s, π , while its state-action value function as: Qπ (s, a) = E      k=0 γ rt+k st =
             
s, at = a, π . Both functions are computed with respect to the discount fac-
tor γ ∈ [0, 1]. The goal of an RL agent is to find a policy π ∗ that realizes
the optimal expected return: V ∗ (s) = max V π (s), for all s ∈ S and the opti-
                                               π
mal Q value function: Q∗ (s, a) = max Qπ (s, a) for all s ∈ S and a ∈ A. Both
                                         π
value functions satisfy the
                           Bellman optimality equation            as given by V ∗ (st ) =
max st+1 p(st+1 |st , at ) <(st , at , st+1 )+γV ∗ (st+1 ) for the state-value function,
     P
 a
                                                                                         
and by Q∗ (st , at ) = st+1 p(st+1 |st , at ) <(st , at , st+1 ) + γ max Q∗ (st+1 , at+1 ) ,
                      P
                                                                    at+1


Copyright c 2019 for this paper by its authors. Use permitted under Creative
               Commons License Attribution 4.0 International (CC BY 4.0).
2                   M. Sabatelli et al.

for the state-action value function. In what follows we show how to learn an
approximation of both value functions with deep learning methods [2].


2               The Deep Quality-Value (DQV) Learning Algorithm

Deep Quality-Value (DQV) Learning learns an approximation of the V function
alongside an approximation of the Q function. This is done with two neural
networks, respectively parametrized as Φ and θ, and two objective functions that
can be minimized by gradient descent. Such objectives adapt two tabular RL
update rules presented in [7] and result in the    following losses for learning the Q
                                                                    −
                                                                                    2
and V functions: L(θ) = Ehst ,at ,rt ,st+1 i∼U (D) rt +γV (st+1 ; Φ )−Q(st , at ; θ) ;
                                                                     
                                                                   2
L(Φ) = Ehst ,at ,rt ,st+1 i∼U (D) rt + γV (st+1 ; Φ− ) − V (st ; Φ) . Both losses are
computed with respect to the same target (rt + γV (st+1 ); Φ− ), which uses an
older version of the V network (Φ− ) for computing the temporal-difference errors.
Minimizing these objectives is done over batches of RL trajectories (h st , at , rt ,
st+1 i) that get uniformly sampled from a memory buffer D. Our results [4],
presented in Fig. 1, show that DQV learns significantly faster and better than
DQN [3] and DDQN [6] on several DRL test-beds coming from the Open-AI-Gym
environment [1]. This empirically highlights the benefits of learning two value
functions simultaneously and makes DQV a new faster synchronous value-based
algorithm present in DRL.


                           Acrobot                                                                     CartPole                                                                                Pong
            0                                                               200                                                                                     20


                                                                            150                                                                                     10
         −200
Reward


                                                                  Reward


                                                                                                                                                         Reward


                                                                            100                                                                                         0


         −400                                                                                                                                                     −10
                                                                             50
                                                   DQV                                                                                DQV                                                                     DQV
                                                   DQN                                                                                DQN                                                                     DQN
                                                                                                                                                                  −20
                                                  DDQN                           0                                                   DDQN                                                                     DDQN
         −600
                0     50     100            150       200                            0     50    100     150            200        250       300                            0           500           1,000    1,500
                           Episodes                                                                    Episodes                                                                               Episodes

                                                                            Enduro                                                                                 Boxing
                                                                                                                              60
                                            600

                                                                                                                              40

                                            400
                                   Reward


                                                                                                               Reward


                                                                                                                              20


                                            200                                                                                0

                                                                                                  DQV                                                                                    DQV
                                                                                                  DQN                    −20                                                             DQN
                                             0                                                   DDQN                                                                                   DDQN

                                                  0         100            200       300   400     500                                   0         200            400       600   800     1,000
                                                                            Episodes                                                                               Episodes


Fig. 1. The results obtained by DQV on several RL environments coming from the
Open-AI-Gym [1] benchmark. DQV learns significantly faster than DQN and DDQN on
all test-beds. Results adapted from [4].
                                        Deep Quality-Value (DQV) Learning          3

References
1. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schul-
   man, Jie Tang, and Wojciech Zaremba.             OpenAI gym.       arXiv preprint
   arXiv:1606.01540, 2016.
2. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
   521(7553):436, 2015.
3. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
   Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
   Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,
   518(7540):529, 2015.
4. Matthia Sabatelli, Gilles Louppe, Pierre Geurts, and Marco Wiering. Deep quality
   value (dqv) learning. In Advances in Neural Information Processing Systems, Deep
   Reinforcement Learning Workshop. Montreal, 2018.
5. Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning,
   volume 135. MIT press Cambridge, 1998.
6. Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with
   double Q-learning. In AAAI, volume 16, pages 2094–2100, 2016.
7. Marco A Wiering. QV (lambda)-learning: A new on-policy reinforcement learning
   algorithm. In Proceedings of the 7th European Workshop on Reinforcement Learning,
   pages 17–18, 2005.