<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Continuous versus discrete action spaces for deep reinforcement learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julius Stopforth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deshendran Moodley</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Arti cial Intelligence Research</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Cape Town</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Reinforcement learning problems may have either a discrete or continuous action space that greatly a ects the algorithm used. Deep reinforcement learning algorithms have already been applied to both discrete and continuous action spaces. In this work we compare the performance of two well established model-free DRL algorithms: Deep Q Network for discrete action spaces, and the continuous action space variant Deep Deterministic Policy Gradient on the same RL problem of the LunarLander. Furthermore, we investigate to what extent Experience Replay a ects the comparative performance of both algorithms for limited training times.</p>
      </abstract>
      <kwd-group>
        <kwd>reinforcement learning</kwd>
        <kwd>continuous control</kwd>
        <kwd>deep neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this work, we attempt to compare the e ect of discrete and continuous action
spaces on the training of a deep reinforcement learning agent. Speci cally, we
look at the performance of the well established Deep Q-Network (DQN)
algorithm[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] compared to its continuous action space variant the Deep Deterministic
Policy Gradient (DDPG) algorithm[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The research aims to determine if and or when there are distinct advantages
to using discrete or continuous action spaces when designing new DRL problems
and algorithms. In this work we present preliminary results for both the DQN
and DDPG algorithms to a known RL problem of the LunarLander using OpenAI
Gym[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. By comparing the performance of the aforementioned algorithms in a
known environment, we hope to gain insight into how the di erence between
continuous and discrete action spaces a ects the training and performance of
these algorithms.
The LunarLander environment provided by OpenAI already has two variants for
both discrete and continuous action spaces and was used without modi cation.
      </p>
      <p>Julius Stopforth and Deshendran Moodley</p>
      <p>The LunarLander is considered \solved" when the algorithm achieves an
average reward of 200 points on 100 independent trials.</p>
      <p>Each algorithm is given 100, 200, and 500 episodes to train before measuring
the average reward over 100 independent trials. The experiements were repeated
10 times each in order to eliminate the possibility of a singularly excellent result
and facilitates the aim of comparative analysis between the two variations in the
algorithms used.</p>
      <p>Both algorithms were implemented the same network structure of a single
fully connected hidden layer of 10 nodes. The network structures used ReLU
activation layers and the RMSProp optimiser. Huber loss was used for the
algorithms. The learning rate and greediness of both algorithms was also kept the
same.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>
        In comparison to the DQN algorithm, the DDPG algorithm performed worse
over 500 training episodes.
The preliminary results presented in this work are align with the results obtained
from the HEDGER algorithm[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and suggest that DQN outperforms DDPG
when the number of training episodes is limited. However, the results presented
are inconclusive and limited.
      </p>
      <p>Ongoing work includes extending the number of training episodes as well as
the increasing the complexity of the deep learning structures used in order to
gain deeper insight into the performance of the algorithms.</p>
      <p>Continuous versus discrete action spaces for deep reinforcement learning</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Brockman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettersson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Openai gym (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunt</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pritzel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heess</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tassa</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Continuous control with deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1509.02971</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidjeland</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrovski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ),
          <volume>529</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Smart</surname>
          </string-name>
          , W.D.,
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          :
          <article-title>Practical reinforcement learning in continuous spaces</article-title>
          .
          <source>In: ICML</source>
          . pp.
          <volume>903</volume>
          {
          <issue>910</issue>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>