1. Introduction

Using Episodic Memories

Sam Blakeman

samrobertallan.blakeman@sony.com 1

Denis Mareschal

d.mareschal@bbk.ac.uk 0

Malet Street

HX UK

0 Centre for Brain and Cognitive Development, Department of Psychological Sciences , Birkbeck , University of London 1 Sony AI , Wiesenstrasse 5, Schlieren, 8952 , Switzerland

2022

Deep Reinforcement Learning (RL) involves the use of Deep Neural Networks (DNNs) to make sequential decisions in order to maximize reward. For many tasks the resulting sequence of actions produced by a Deep RL policy can be long and dificult to understand for humans. A crucial component of human explanations is selectivity, whereby only key decisions and causes are recounted. Imbuing Deep RL agents with such an ability would make their resulting policies easier to understand from a human perspective and generate a concise set of instructions to aid the learning of future agents. To this end we use a Deep RL agent with an episodic memory system to identify and recount key decisions during policy execution. We show that these decisions form a short, human readable explanation that can also be used to speed up the learning of naive Deep RL agents.

Memories

1. Introduction

The ability to explain how to solve a task allows humans to share learnt knowledge and speed up the collective learning process. A naive approach to generating an explanation would be to recall every decision made during the task. However, this is often undesirable because it leads to prohibitively long and complex explanations that cannot be easily understood by the recipient. It is therefore crucial that any explanation generating process is able to identify a small subset of key decisions that are fundamental for solving the task [ 1 ]. In the social sciences this is referred to as explanation selection and refers to the fact that human explanations are biased to only a few important events or causes [ 2 ]. Current approaches to generating explanations in Deep RL algorithms typically operate at the level of individual decisions, for example by computing saliency scores for all input features. They therefore do not produce selective task-level explanations and fundamentally lack selectivity in their explanations [ 3 ].

To address this question we build on our previous work that outlined a framework for imbuing Deep RL algorithms with a hippocampal learning system [ 4 ]. The approach, termed Complementary Temporal Diference Learning (CTDL), was inspired by the theory of Complementary Learning Systems (CLS) [ 5 ]. CLS theory states that the brain relies upon the complementary properties of the neocortex and the hippocampus to perform complex behaviour. The explicit communication between the neocortical and hippocampal learning system in CTDL is of interest for generating selective task-level explanations because it provides a mechanism for identifying key decisions based on the current task.

We propose that the content of the hippocampal learning system (represented as a SelfOrganizing Map (SOM)) can be used to generate partial explanations of how to solve the current task in terms of which action to select when. After learning the task, the memories that the agent uses from the SOM can be stored as a short ordered list of key states and actions. This list can then be interpreted as a partial explanation of how to solve the task and can be given to other agents to speed up their learning process. We demonstrate the eficacy of this approach in both the grid world and continuous mountain car domains. We visually explore the quality of the generated explanations and also perform a quantitative assessment by measuring the improvement in performance when a naive agent receives the explanation.

2. Methods

We generate partial explanations at the task-level by selecting a subset of the memories stored in the hippocampal learning system (i.e. the SOM) of CTDL and presenting them as a temporal sequence (Figure 1). In order to make this selection, we ask the agent to perform a test trial at the end of learning. During this test trial no further learning occurs and we keep a list of every memory that was used from the SOM along with its associated tabular value. We also keep a record of the action that was taken and the calculated weighting value (β) for each memory. β is calculated at each time-step using the Euclidean distance between the current state and the closest matching memory in the SOM. We use this β to reduce the length of the list post-hoc so that the explanation is more concise and understandable based on a pre-defined threshold value (Figure 1).

After generating an explanation from CTDL, the list of memories, values and actions can be provided to other agents to improve the eficiency of their learning. In order to utilise the list the receiving agent simply needs to calculate the weighting value between the current state of the environment and the memories in the list on each time-step. If the weighting is greater than a predefined threshold (e.g. 0.5) for a memory in the list then the agent’s current action and value estimate can be set to that memory’s action and value. If multiple memories have a weighting greater than the threshold then the one with the highest value is used. The benefits of this simple mechanism are two-fold; (1) the policy is guided towards critical actions early on and (2) RL algorithms that use a value function can use the associated values to bootstrap value estimates during learning. While this mechanism of providing explanations can be used for any RL algorithm, we can enhance it further if the explanation is being provided to CTDL. In this case, the list of memories, values and actions can be used to randomly initialise the entries of the SOM. These entries are fixed throughout the course of learning so that they are not overwritten.

3. Results

The implementational details of all the simulations can be found in Blakeman and Mareschal [ 6 ]. For the grid world experiments, we trained 12 agents for 1000 episodes on a grid world and then generated an explanation after every 200 episodes. Figure 2A shows an explanation extracted from a best performing agent after 1000 episodes of training. The best agent was the one that achieved the most reward on a test trial after training. If a tie existed then the agent with the highest training reward was chosen. Since an explanation is simply a list of state-action pairings it can easily be inspected and qualitatively assessed. From visual inspection, the explanation includes the essential decisions needed to solve each grid world. Crucially, the explanation does not include every action taken by the agent, which demonstrates that the explanation mechanism is able to select only the most important state-action pairings. For the continuous mountain car task, 50 agents were trained for 1000 episodes and explanations were generated at the very end of training. Figure 2B shows an explanation extracted from the best performing agent. As with the grid worlds, the explanation does not involve every decision made by the agent but instead represent key decisions for solving the task.

Figure 3 compares the performance of agents on the continuous mountain car task that received an explanation from CTDL vs. those that did not. Agents that received an explanation achieved higher levels of reward on average than those that did not. Importantly, the provision of an explanation did not appear to lead to the discovery of a better overall policy since the best performing agents in both cases reached a similar level of performance (see the dashed lines in Figure 3). This is to be expected given that the provided explanations describe the strategies learnt by the agents without an explanation and so in both cases the policies should be qualitatively similar. The provision of an explanation therefore appears to increase the probability of an agent finding a previously learnt policy rather than discovering a new optimal policy. As the explanations are generated from the best original agent, the agents receiving the explanation benefit from the increased probability of finding this best policy and so the average performance of the overall population increases.

4. Discussion

Explanations of how to solve a task often involve a summary of the key decisions required to complete it, an ability referred to as selectivity [ 2, 3 ]. Classic Deep Reinforcement Learning (RL) approaches lack this ability because all actions are reported when executing a policy. Recently, Complementary Temporal Diference Learning (CTDL) has been proposed which uses a Deep Neural Network (DNN) and a Self-Organizing Map (SOM) to solve the RL problem. Importantly, CTDL uses the errors produced by the DNN to update the contents of the SOM. In efect this results in the SOM storing episodic memories of states and actions that led to the largest errors during learning. We therefore use the contents of the SOM to generate task-level explanations as they provide an intuitive summary of most the important state-action pairs for solving the task at hand.

From a qualitative perspective, the explanations generated from CTDL appeared to capture the critical structure of the current task. In the grid world experiments, the sequence of states and actions can be followed in order to trace a route to the goal location but they do not exhaustively cover the whole trajectory. Similarly, in the continuous mountain car task, the basic strategy of gaining momentum can be easily seen from the generated explanation without the need to report every action taken. The explanations therefore gave us a condensed view of the strategy learnt by the agent in an understandable and human-readable format. In order to obtain a quantitative assessment of the explanations generated from CTDL, we also provided them to naive agents at the start of learning to see whether they improved performance. In the case of both the grid worlds and the continuous mountain car task, we saw better average performance, faster learning and increased robustness when an explanation was provided to the agent.

Acknowledgments

This work was funded by a Human-Like Computing Network kick-start award (EPSRC, UK). We thank NVIDIA for a hardware grant that provided the Graphics Processing Unit (GPU) used to run the simulations.

[1]

Amir ,

Doshi-Velez ,

Sarne , Summarizing agent strategies, Autonomous Agents and Multi-Agent Systems 33 ( 2019 ) 628 - 644 .

[2]

Miller , Explanation in artificial intelligence: Insights from the social sciences , Artificial intelligence 267 ( 2019 ) 1 - 38 .

[3]

Alvarez-Melis , H. Daumé

III

J. W.

Vaughan ,

Wallach , Weight of evidence as a basis for human-oriented explanations , arXiv preprint arXiv: 1910 . 13503 ( 2019 ).

[4]

Blakeman ,

Mareschal , A complementary learning systems approach to temporal diference learning , Neural Networks 122 ( 2020 ) 218 - 230 .

[5]

J. L.

McClelland ,

B. L.

McNaughton , R. C. O'Reilly , Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory ., Psychological review 102 ( 1995 ) 419 .

[6]

Blakeman ,

Mareschal , Generating explanations from deep reinforcement learning using episodic memory , arXiv preprint arXiv:2205.08926 ( 2022 ).