A Comparison between Deep Q-Networks and Deep Symbolic Reinforcement Learning Aimoré R. R. Dutra1 and Artur S. d’Avila Garcez1 1 City, University of London, London, EC1V 0HB, UK aimorerrd@hotmail.com, a.garcez@city.ac.uk Abstract. Deep Reinforcement Learning (DRL) has had several breakthroughs, from helicopter controlling and Atari games to the Alpha-Go success. Despite their success, DRL still lacks several important features of human intelligence, such as transfer learning, planning and interpretability. We compare two DRL approaches at learning and generalization: Deep Q-Networks and Deep Symbolic Reinforcement Learning. We implement simplified versions of these algorithms and propose two simple problems. Results indicate that although the symbolic approach is promising at generalizing and faster learning in one of the problems, it can fail systematically in the other, very similar problem. Keywords: Deep Reinforcement Learning, Deep Q-Networks, Neural-Symbolic Integration. 1 Introduction The combination of classical Reinforcement Learning with Deep Neural Networks achieved human level capabilities at solving some difficult problems, especially in games with Deep Q-Networks (DQNs) [3]. There is no doubt that Deep Reinforcement Learning (DRL) has offered new perspectives for the areas of automation and AI. But why are these methods so successful? And why are they still unable to solve many problems that seem so simple for humans? Despite their success, DRL has several drawbacks. First, they need large training sets and hence learn slowly. Second, they are very task specific - a trained network that performs well on one task often performs very poorly on another, even very similar task. Third, they are difficult to extract a human-comprehensible chain of reasons for the action choices that the system makes. Some authors have been trying to solve some of the above shortcomings by adding prior knowledge to the system, using model-based architectures and other AI concepts [2]. One claims to have designed an architecture that solves at once all these shortcom- ings by combining neural-network learning with aspects of symbolic AI, called Deep Symbolic Reinforcement Learning (DSRL) [1]. In this paper, in an attempt to under- stand better the advantages of a symbolic approach to Reinforcement Learning, we im- plement and compare two simplified versions of DQN and DSRL at learning a simple video game policy. Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purposes. 2 2 The Video Game The Deep Q-Network (DQN) was reduced to a simple Q-Learning algorithm by remov- ing its convolutional and function approximation layers. These layers do not seem to play a major role in how an agent makes its decisions. They basically reduce the di- mensionality of the states. In the Deep Symbolic Reinforcement Learning (DSRL), we ignored the first low-level extraction part. In our implementation, we skip this first part by sending the location and type of each object directly to the agent. In addition, only a spatial representation is considered, since there is no complex dynamics relating to time in the game. The simplified versions of DQN and DSRL were implemented in Python 3.5. Fig. 1 shows three initial configurations of the proposed game. The star-shaped ob- ject is the Agent, the negative sign denotes a Trap, and the positive sign is the Goal. The agent can move up, left, right and down, and it stays at the same place when it tries to move into the wall. The reward is increased by 1 and decreased by 10 whenever the Agent’s position is the same as the Goal and the Trap, respectively. The game only restarts if the Agent’s position is the Goal. The environment is fully-observable, se- quential, static, discrete, unknown, infinite, stationary and deterministic. Two toy ex- Fig. 1. Three initial game configurations amples are proposed to evaluate how DQN and DSRL apply their learned knowledge in a new, similar situation, namely, training in configuration 1 and testing in 2 (c.f. Fig. 1), and training in 2 and testing in 3. 3 Results and Discussion Fig. 2 shows that both algorithms (DQN and DSRL) learn well during the training phase, but in the test phase, while DQN has a behavior similar to random, DSRL always falls into the Trap before reaching the Goal. This shows that, while DQN could not learn from conf. 1 what to do in Steps conf. 2; and DSRL learned some- 1 51 101 151 201 251 301 thing completely wrong for conf. 2 100 (always move to the right). It is as 50 if any prior knowledge in DSRL 0 Rewards had to be undefeasible, which is an -50 unrealistic constraint. DQN, by -100 contrast, had never seen the states -150 in the test case during training; -200 DSRL Train DSRL Test thus, it assumed a random policy. Random DQN Train The reason why DSRL has very low reward is because the Goal’s DQN Test location did not change from train- Fig. 2. Trained in conf. 1 and tested in conf. 2 ing to test. Thus, our DSRL Agent 3 assumed that the best action should remain the same (move right). The position of the Trap did not have any influence in the Agent’s decision because the algorithm treats different types of objects independently. In other words, the DSRL Agent does not know what rewards to expect from a Trap in a new location. In the second example (trained in conf. 2 and tested in 3), the situation is quite dif- ferent, as Fig. 3 shows. DSRL Steps learns how to make the right deci- sion, and thus has good perfor- 1 51 101 151 201 251 301 120 mance during testing. DQN flat lines as a result of not knowing the 70 states in the test phase. It is interest- ing noting that DSRL avoided the Rewards 20 Trap during testing because it has learned how to translate from conf. -30 2 to 3 (but not how to reflect from conf. 1 to 2 (c.f. Fig. 2), or to rotate -80 DSRL Train DSRL Test a configuration, which should pro- Random DQN Train duce similar results as Fig. 2 for ob- DQN Test vious reasons). Such an ability to Fig. 3. Trained in conf. 2 and tested in conf. 3 generalize to new situations is very important, as it allows an agent to learn from similar states without having to experience them all. In the case of DSRL, generalizations bring faster learning, but seem limited to translations of configurations. 4 Conclusion We have compared two model-free RL approaches, DRL and DSRL, on their general- ization capacity using two toy examples. Both have limitations at learning “the rules of the game” for succeeding in different configurations. One key finding is that transform- ing pixels into symbols can become a channel not only for reducing the state-space, but to enable rules between objects to be created. These rules offer a way of generalizing states, and could guide an agent during exploration. Assisted by high level rules, an agent should learn faster by exploring its environment more efficiently. Thus, as future work, we shall consider the combination of model-free and model-based approaches with symbolic rules being used for faster and hopefully more effective learning. References [1] Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforce- ment learning. arXiv preprint arXiv:1609.05518, 2016. [2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436– 444, 2015. [3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.