<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparison between Deep Q-Networks and Deep Symbolic Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aimoré R. R. Dutra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artur S. d'Avila Garcez</string-name>
          <email>a.garcez@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>City, University of London</institution>
          ,
          <addr-line>London, EC1V 0HB</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep Reinforcement Learning (DRL) has had several breakthroughs, from helicopter controlling and Atari games to the Alpha-Go success. Despite their success, DRL still lacks several important features of human intelligence, such as transfer learning, planning and interpretability. We compare two DRL approaches at learning and generalization: Deep Q-Networks and Deep Symbolic Reinforcement Learning. We implement simplified versions of these algorithms and propose two simple problems. Results indicate that although the symbolic approach is promising at generalizing and faster learning in one of the problems, it can fail systematically in the other, very similar problem. Keywords: Deep Reinforcement Learning, Deep Q-Networks, Neural-Symbolic Integration.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The combination of classical Reinforcement Learning with Deep Neural Networks
achieved human level capabilities at solving some difficult problems, especially in
games with Deep Q-Networks (DQNs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There is no doubt that Deep Reinforcement
Learning (DRL) has offered new perspectives for the areas of automation and AI. But
why are these methods so successful? And why are they still unable to solve many
problems that seem so simple for humans? Despite their success, DRL has several
drawbacks. First, they need large training sets and hence learn slowly. Second, they are
very task specific - a trained network that performs well on one task often performs
very poorly on another, even very similar task. Third, they are difficult to extract a
human-comprehensible chain of reasons for the action choices that the system makes.
      </p>
      <p>
        Some authors have been trying to solve some of the above shortcomings by adding
prior knowledge to the system, using model-based architectures and other AI concepts
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One claims to have designed an architecture that solves at once all these
shortcomings by combining neural-network learning with aspects of symbolic AI, called Deep
Symbolic Reinforcement Learning (DSRL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this paper, in an attempt to
understand better the advantages of a symbolic approach to Reinforcement Learning, we
implement and compare two simplified versions of DQN and DSRL at learning a simple
video game policy.
      </p>
      <p>Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purposes.</p>
    </sec>
    <sec id="sec-2">
      <title>The Video Game</title>
      <p>The Deep Q-Network (DQN) was reduced to a simple Q-Learning algorithm by
removing its convolutional and function approximation layers. These layers do not seem to
play a major role in how an agent makes its decisions. They basically reduce the
dimensionality of the states. In the Deep Symbolic Reinforcement Learning (DSRL), we
ignored the first low-level extraction part. In our implementation, we skip this first part
by sending the location and type of each object directly to the agent. In addition, only
a spatial representation is considered, since there is no complex dynamics relating to
time in the game. The simplified versions of DQN and DSRL were implemented in
Python 3.5.</p>
      <p>Fig. 1 shows three initial configurations of the proposed game. The star-shaped
object is the Agent, the negative sign denotes a Trap, and the positive sign is the Goal.
The agent can move up, left, right and down, and it stays at the same place when it tries
to move into the wall. The reward is increased by 1 and decreased by 10 whenever the
Agent’s position is the same as the Goal
and the Trap, respectively. The game only
restarts if the Agent’s position is the Goal.</p>
      <p>The environment is fully-observable,
sequential, static, discrete, unknown, infinite,
stationary and deterministic. Two toy ex- Fig. 1. Three initial game configurations
amples are proposed to evaluate how DQN
and DSRL apply their learned knowledge in a new, similar situation, namely, training
in configuration 1 and testing in 2 (c.f. Fig. 1), and training in 2 and testing in 3.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>assumed that the best action should remain the same (move right). The position of the
Trap did not have any influence in the Agent’s decision because the algorithm treats
different types of objects independently. In other words, the DSRL Agent does not
know what rewards to expect from a Trap in a new location.</p>
      <p>In the second example (trained in conf. 2 and tested in 3), the situation is quite
different, as Fig. 3 shows. DSRL
learns how to make the right deci- Steps
smioann,ceandduritnhgus tehsatsingg.ooDdQpNerffolar-t 120 1 51 101 151 201 251 301
lines as a result of not knowing the 70
states in the test phase. It is interest- s
ing noting that DSRL avoided the rad20
Trap during testing because it has eRw
learned how to translate from conf. -30
2 to 3 (but not how to reflect from
conf. 1 to 2 (c.f. Fig. 2), or to rotate -80 DSRL Train DSRL Test
a configuration, which should pro- Random DQN Train
duce similar results as Fig. 2 for ob- DQN Test
vious reasons). Such an ability to
generalize to new situations is very Fig. 3. Trained in conf. 2 and tested in conf. 3
important, as it allows an agent to learn from similar states without having to experience
them all. In the case of DSRL, generalizations bring faster learning, but seem limited
to translations of configurations.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have compared two model-free RL approaches, DRL and DSRL, on their
generalization capacity using two toy examples. Both have limitations at learning “the rules of
the game” for succeeding in different configurations. One key finding is that
transforming pixels into symbols can become a channel not only for reducing the state-space, but
to enable rules between objects to be created. These rules offer a way of generalizing
states, and could guide an agent during exploration. Assisted by high level rules, an
agent should learn faster by exploring its environment more efficiently. Thus, as future
work, we shall consider the combination of model-free and model-based approaches
with symbolic rules being used for faster and hopefully more effective learning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Marta</given-names>
            <surname>Garnelo</surname>
          </string-name>
          , Kai Arulkumaran, and
          <string-name>
            <given-names>Murray</given-names>
            <surname>Shanahan</surname>
          </string-name>
          .
          <article-title>Towards deep symbolic reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1609.05518</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          ,
          <volume>521</volume>
          (
          <issue>7553</issue>
          ):
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Volodymyr</given-names>
            <surname>Mnih</surname>
          </string-name>
          , Koray Kavukcuoglu, David Silver et al.
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          ,
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>