=Paper=
{{Paper
|id=Vol-2540/paper34
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-2540/FAIR2019_paper_47.pdf
|volume=Vol-2540
}}
==None==
<pdf width="1500px">https://ceur-ws.org/Vol-2540/FAIR2019_paper_47.pdf</pdf>
<pre>
    Continuous versus discrete action spaces for
           deep reinforcement learning

                  Julius Stopforth1,2 and Deshendran Moodley1,2
                                1
                                  University of Cape Town
                      2
                          Center for Artificial Intelligence Research


       Abstract. Reinforcement learning problems may have either a discrete
       or continuous action space that greatly affects the algorithm used. Deep
       reinforcement learning algorithms have already been applied to both dis-
       crete and continuous action spaces. In this work we compare the per-
       formance of two well established model-free DRL algorithms: Deep Q
       Network for discrete action spaces, and the continuous action space vari-
       ant Deep Deterministic Policy Gradient on the same RL problem of the
       LunarLander. Furthermore, we investigate to what extent Experience
       Replay affects the comparative performance of both algorithms for lim-
       ited training times.

       Keywords: reinforcement learning, continuous control, deep neural net-
       works


1    Introduction

In this work, we attempt to compare the effect of discrete and continuous action
spaces on the training of a deep reinforcement learning agent. Specifically, we
look at the performance of the well established Deep Q-Network (DQN) algo-
rithm[3] compared to its continuous action space variant the Deep Deterministic
Policy Gradient (DDPG) algorithm[2].
    The research aims to determine if and or when there are distinct advantages
to using discrete or continuous action spaces when designing new DRL problems
and algorithms. In this work we present preliminary results for both the DQN
and DDPG algorithms to a known RL problem of the LunarLander using OpenAI
Gym[1]. By comparing the performance of the aforementioned algorithms in a
known environment, we hope to gain insight into how the difference between
continuous and discrete action spaces affects the training and performance of
these algorithms.


2    Experiments

The LunarLander environment provided by OpenAI already has two variants for
both discrete and continuous action spaces and was used without modification.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2       Julius Stopforth and Deshendran Moodley

    The LunarLander is considered “solved” when the algorithm achieves an
average reward of 200 points on 100 independent trials.
    Each algorithm is given 100, 200, and 500 episodes to train before measuring
the average reward over 100 independent trials. The experiements were repeated
10 times each in order to eliminate the possibility of a singularly excellent result
and facilitates the aim of comparative analysis between the two variations in the
algorithms used.
    Both algorithms were implemented the same network structure of a single
fully connected hidden layer of 10 nodes. The network structures used ReLU
activation layers and the RMSProp optimiser. Huber loss was used for the algo-
rithms. The learning rate and greediness of both algorithms was also kept the
same.


3   Results

In comparison to the DQN algorithm, the DDPG algorithm performed worse
over 500 training episodes.


    Table 1. Average reward for 100 trials for the DQN with experience replay

                      No. training episodes 100 200 500

                      Average reward           -430 -420 -420


    Table 2. Average reward for 100 trials for the DDPG with experience replay

                      No. training episodes 100 200 500

                      Average reward           -733 -687 -718


4   Discussion

The preliminary results presented in this work are align with the results obtained
from the HEDGER algorithm[4] and suggest that DQN outperforms DDPG
when the number of training episodes is limited. However, the results presented
are inconclusive and limited.
   Ongoing work includes extending the number of training episodes as well as
the increasing the complexity of the deep learning structures used in order to
gain deeper insight into the performance of the algorithms.
     Continuous versus discrete action spaces for deep reinforcement learning         3

References
1. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J.,
   Zaremba, W.: Openai gym (2016)
2. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,
   Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint
   arXiv:1509.02971 (2015)
3. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,
   Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level con-
   trol through deep reinforcement learning. Nature 518(7540), 529 (2015)
4. Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces.
   In: ICML. pp. 903–910 (2000)

</pre>