A Virtual Maze Game to Explain Reinforcement Learning Youri Coppens1,2[0000−0003−1124−0731] , Eugenio Bargiacchi1 , and Ann Nowé1 1 Vrije Universiteit Brussel, Brussels, Belgium 2 Université Libre de Bruxelles, Brussels, Belgium yocoppen@ai.vub.ac.be Abstract. We demonstrate how Virtual Reality can explain the ba- sic concepts of Reinforcement Learning through an interactive maze game. A player takes the role of an autonomous learning agent and must learn the shortest path to a hidden treasure through experience. This application visualises the learning process of Watkins’ Q(λ), one of the fundamental algorithms in the field. A video can be found at https://youtu.be/sLJRiUBhQqM. Keywords: Reinforcement Learning · Education · Virtual Reality We present a Virtual Reality (VR) treasure hunt game, teaching the ba- sic concepts behind Reinforcement Learning (RL) in an engaging way, without the necessity for mathematical formulas or hands-on programming sessions. RL tackles the problem of sequential decision-making within an environment, where an agent must act in order to maximise collected reward over time. Immersive VR allows us to put the playing user in the shoes of an RL agent, demonstrat- ing through direct experience how new knowledge is acquired and processed. The user’s perspective is aligned with the learning agent as much as possible to create a sense of presence in the RL environment through the head-mounted display. The game puts the player in a foggy maze, with the task to find a hidden treasure. The fog restricts the player’s vision to that of an RL agent, namely its current position (state) and available actions (Figure 1). The treasure allows the player to intuitively grasp the concept of reward in a standard RL process. The user can freely select actions and decide where to explore depending on the available information. All information collected via this exploration is fed to an RL algorithm, Q(λ) [2], which then displays the results of the learning back to the user via colours and numeric values. The player’s task is to find a treasure chest hidden in a grid-world maze (Figure 2). The maze additionally contains multiple empty chests to incentivise exploration. The player is paired with a Q(λ) learning agent which computes Q- values, values associated with each state-action pair that estimate the expected future reward resulting from executing a particular action in a particular state. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Y. Coppens et al. Fig. 1. Available actions and respec- Fig. 2. Top view of the virtual maze. tive Q-values in a cell of the maze from After finding the treasure, the agent the player’s point of view. receives a reward and updates the cell values, shown here in green shades. As the player explores, Q(λ) updates the Q-values for each state-action pair and displays them on the ground. The highest Q-value for each state is visualised by shading the floor of each cell in green. Additionally, the eligibility traces of Q(λ), a mechanism that propagates knowledge on experienced reward over Q- values, are seen by the user as a trail of floating arrows. The player obtains a reward of 10 when the hidden treasure chest is found. Afterwards, the user is asked to repeat the task from another starting position, as Q(λ) requires episodic conditioning, under the guise of ‘practice makes perfect’. After several trials, the maze will start to show a colour gradient towards the hidden treasure, which in turn helps the user to select the optimal direction to move. This allows the user to understand that reward is discounted over time. As the task is repeated, it will take less time for the player to enter a part of the maze which has already been visited before and thus contains updated Q-values. Our demonstration has the potential to educate a broad audience on the dynamics of Reinforcement Learning [1]. A moderator directs the demonstration to ensure the game progresses smoothly and to keep the spectating audience involved. The moderator enhances the user experience by explaining the game’s purpose and the mechanisms of Q(λ) on a level adapted to the present audience. The demonstration was developed in C# using the Unity3D engine, the SteamVR plugin and the VRTK software framework. The user plays the game through a HTC Vive VR-system, on a computer containing a VR-ready GPU. The play field requires a minimal surface of 2 by 2 meters. References 1. Coppens, Y., Bargiacchi, E., Nowé, A.: Reinforcement learning 101 with a virtual reality game. In: Proceedings of the 1st International Workshop on Education in Artificial Intelligence K-12 (August 2019) 2. Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis, King’s College, Cambridge, United Kingdom (May 1989)