Generating Explanations from Deep Reinforcement
Learning Using Episodic Memories
Sam Blakeman1,∗ , Denis Mareschal2
1
Sony AI, Wiesenstrasse 5, Schlieren, 8952, Switzerland
2
Centre for Brain and Cognitive Development, Department of Psychological Sciences, Birkbeck, University of London,
Malet Street, WC1E 7HX UK


                                         Abstract
                                         Deep Reinforcement Learning (RL) involves the use of Deep Neural Networks (DNNs) to make sequential
                                         decisions in order to maximize reward. For many tasks the resulting sequence of actions produced by
                                         a Deep RL policy can be long and difficult to understand for humans. A crucial component of human
                                         explanations is selectivity, whereby only key decisions and causes are recounted. Imbuing Deep RL
                                         agents with such an ability would make their resulting policies easier to understand from a human
                                         perspective and generate a concise set of instructions to aid the learning of future agents. To this end
                                         we use a Deep RL agent with an episodic memory system to identify and recount key decisions during
                                         policy execution. We show that these decisions form a short, human readable explanation that can also
                                         be used to speed up the learning of naive Deep RL agents.

                                         Keywords
                                         Deep Reinforcement Learning, Explanation, Complementary Learning Systems, Episodic Memory


1. Introduction
The ability to explain how to solve a task allows humans to share learnt knowledge and speed
up the collective learning process. A naive approach to generating an explanation would be
to recall every decision made during the task. However, this is often undesirable because it
leads to prohibitively long and complex explanations that cannot be easily understood by the
recipient. It is therefore crucial that any explanation generating process is able to identify a
small subset of key decisions that are fundamental for solving the task [1]. In the social sciences
this is referred to as explanation selection and refers to the fact that human explanations
are biased to only a few important events or causes [2]. Current approaches to generating
explanations in Deep RL algorithms typically operate at the level of individual decisions, for
example by computing saliency scores for all input features. They therefore do not produce
selective task-level explanations and fundamentally lack selectivity in their explanations [3].
   To address this question we build on our previous work that outlined a framework for imbuing
Deep RL algorithms with a hippocampal learning system [4]. The approach, termed Comple-
mentary Temporal Difference Learning (CTDL), was inspired by the theory of Complementary
Learning Systems (CLS) [5]. CLS theory states that the brain relies upon the complementary

∗
    Corresponding author.
Envelope-Open samrobertallan.blakeman@sony.com (S. Blakeman); d.mareschal@bbk.ac.uk (D. Mareschal)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
properties of the neocortex and the hippocampus to perform complex behaviour. The explicit
communication between the neocortical and hippocampal learning system in CTDL is of interest
for generating selective task-level explanations because it provides a mechanism for identifying
key decisions based on the current task.
   We propose that the content of the hippocampal learning system (represented as a Self-
Organizing Map (SOM)) can be used to generate partial explanations of how to solve the current
task in terms of which action to select when. After learning the task, the memories that the
agent uses from the SOM can be stored as a short ordered list of key states and actions. This list
can then be interpreted as a partial explanation of how to solve the task and can be given to
other agents to speed up their learning process. We demonstrate the efficacy of this approach
in both the grid world and continuous mountain car domains. We visually explore the quality
of the generated explanations and also perform a quantitative assessment by measuring the
improvement in performance when a naive agent receives the explanation.


2. Methods
We generate partial explanations at the task-level by selecting a subset of the memories stored
in the hippocampal learning system (i.e. the SOM) of CTDL and presenting them as a temporal
sequence (Figure 1). In order to make this selection, we ask the agent to perform a test trial at
the end of learning. During this test trial no further learning occurs and we keep a list of every
memory that was used from the SOM along with its associated tabular value. We also keep a
record of the action that was taken and the calculated weighting value (β) for each memory. β
is calculated at each time-step using the Euclidean distance between the current state and the
closest matching memory in the SOM. We use this β to reduce the length of the list post-hoc so
that the explanation is more concise and understandable based on a pre-defined threshold value
(Figure 1).
   After generating an explanation from CTDL, the list of memories, values and actions can be
provided to other agents to improve the efficiency of their learning. In order to utilise the list
the receiving agent simply needs to calculate the weighting value between the current state
of the environment and the memories in the list on each time-step. If the weighting is greater
than a predefined threshold (e.g. 0.5) for a memory in the list then the agent’s current action
and value estimate can be set to that memory’s action and value. If multiple memories have a
weighting greater than the threshold then the one with the highest value is used. The benefits
of this simple mechanism are two-fold; (1) the policy is guided towards critical actions early
on and (2) RL algorithms that use a value function can use the associated values to bootstrap
value estimates during learning. While this mechanism of providing explanations can be used
for any RL algorithm, we can enhance it further if the explanation is being provided to CTDL.
In this case, the list of memories, values and actions can be used to randomly initialise the
entries of the SOM. These entries are fixed throughout the course of learning so that they are
not overwritten.
Figure 1: Process of generating explanations from Complementary Temporal Difference Learning (CTDL).
Step 1: After training an agent via CTDL, a test trial is performed. During the test trial, an ordered list is
kept of all the memories used from the Self-Organizing Map (SOM). In addition to the memory (m), the
value associated with that memory (v), the degree to which the value was used (𝛽) and the action taken
(a) are also recorded. Step 2: After the test trial has been completed the list is pruned to provide a partial
explanation of how to solve the task. For each unique memory in the list, only the row with the highest value
of 𝛽 is kept. This ensures that each memory only has a single associated value and action. In addition, all
rows where 𝛽 < 0.5 are removed as they formed the minority of the value prediction and so were not heavily
relied upon by the agent.


3. Results
The implementational details of all the simulations can be found in Blakeman and Mareschal [6].
For the grid world experiments, we trained 12 agents for 1000 episodes on a grid world and then
generated an explanation after every 200 episodes. Figure 2A shows an explanation extracted
from a best performing agent after 1000 episodes of training. The best agent was the one that
achieved the most reward on a test trial after training. If a tie existed then the agent with the
highest training reward was chosen. Since an explanation is simply a list of state-action pairings
it can easily be inspected and qualitatively assessed. From visual inspection, the explanation
includes the essential decisions needed to solve each grid world. Crucially, the explanation
does not include every action taken by the agent, which demonstrates that the explanation
mechanism is able to select only the most important state-action pairings. For the continuous
mountain car task, 50 agents were trained for 1000 episodes and explanations were generated
at the very end of training. Figure 2B shows an explanation extracted from the best performing
agent. As with the grid worlds, the explanation does not involve every decision made by the
agent but instead represent key decisions for solving the task.
   Figure 3 compares the performance of agents on the continuous mountain car task that
received an explanation from CTDL vs. those that did not. Agents that received an explanation
achieved higher levels of reward on average than those that did not. Importantly, the provision
of an explanation did not appear to lead to the discovery of a better overall policy since the
best performing agents in both cases reached a similar level of performance (see the dashed
Figure 2: Example explanations generated from Complementary Temporal Difference Learning (CTDL).
The explanation is represented by stars, which correspond to memories extracted from the Self-Organizing
Map (SOM). (A) Grid world environment: The agent starts on the yellow square and has to move to the green
square, which is associated with a reward of +1. The dark blue squares are associated with a reward of -1 and
every action causes a reward of -0.05. (B) Continuous mountain car environment: The agent has to gather
momentum in order to escape from the valley and reach the flag for a reward of +100. The x-axis represents
the position of the car and the y-axis represents the velocity. The dashed line indicates the trajectory of the
car for a single test trial after learning.


lines in Figure 3). This is to be expected given that the provided explanations describe the
strategies learnt by the agents without an explanation and so in both cases the policies should
be qualitatively similar. The provision of an explanation therefore appears to increase the
probability of an agent finding a previously learnt policy rather than discovering a new optimal
policy. As the explanations are generated from the best original agent, the agents receiving the
explanation benefit from the increased probability of finding this best policy and so the average
performance of the overall population increases.


4. Discussion
Explanations of how to solve a task often involve a summary of the key decisions required to
complete it, an ability referred to as selectivity [2, 3]. Classic Deep Reinforcement Learning (RL)
approaches lack this ability because all actions are reported when executing a policy. Recently,
Complementary Temporal Difference Learning (CTDL) has been proposed which uses a Deep
Neural Network (DNN) and a Self-Organizing Map (SOM) to solve the RL problem. Importantly,
CTDL uses the errors produced by the DNN to update the contents of the SOM. In effect this
results in the SOM storing episodic memories of states and actions that led to the largest errors
during learning. We therefore use the contents of the SOM to generate task-level explanations
as they provide an intuitive summary of most the important state-action pairs for solving the
task at hand.
   From a qualitative perspective, the explanations generated from CTDL appeared to capture
Figure 3: The performance on the continuous mountain car task of the original agents (No Explanation),
the agents that received an explanation generated from the best original agent at the end of learning (Ex-
planation), and the agents that received a random sample of the memories generated by the best original
agent at the end of learning (Shuffled Explanation). 50 agents were trained on the continuous mountain car
task for 1000 episodes. The agent with the highest total reward on the final episode was chosen to provide
explanations. Explanations were generated by running the chosen agent on 20 test episodes after training.
The explanations were then used to train 50 new agents with each new agent picking one at random. Solid
lines indicate the average performance over 50 agents. Dashed lines indicate the best performing agent for
each group. (A) Performance on the first 50 episodes of training. (B) Performance on all 1000 episodes of
training.


the critical structure of the current task. In the grid world experiments, the sequence of states
and actions can be followed in order to trace a route to the goal location but they do not
exhaustively cover the whole trajectory. Similarly, in the continuous mountain car task, the
basic strategy of gaining momentum can be easily seen from the generated explanation without
the need to report every action taken. The explanations therefore gave us a condensed view of
the strategy learnt by the agent in an understandable and human-readable format. In order to
obtain a quantitative assessment of the explanations generated from CTDL, we also provided
them to naive agents at the start of learning to see whether they improved performance. In
the case of both the grid worlds and the continuous mountain car task, we saw better average
performance, faster learning and increased robustness when an explanation was provided to
the agent.


Acknowledgments
This work was funded by a Human-Like Computing Network kick-start award (EPSRC, UK).
We thank NVIDIA for a hardware grant that provided the Graphics Processing Unit (GPU) used
to run the simulations.
References
[1] O. Amir, F. Doshi-Velez, D. Sarne, Summarizing agent strategies, Autonomous Agents and
    Multi-Agent Systems 33 (2019) 628–644.
[2] T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial
    intelligence 267 (2019) 1–38.
[3] D. Alvarez-Melis, H. Daumé III, J. W. Vaughan, H. Wallach, Weight of evidence as a basis
    for human-oriented explanations, arXiv preprint arXiv:1910.13503 (2019).
[4] S. Blakeman, D. Mareschal, A complementary learning systems approach to temporal
    difference learning, Neural Networks 122 (2020) 218–230.
[5] J. L. McClelland, B. L. McNaughton, R. C. O’Reilly, Why there are complementary learning
    systems in the hippocampus and neocortex: insights from the successes and failures of
    connectionist models of learning and memory., Psychological review 102 (1995) 419.
[6] S. Blakeman, D. Mareschal, Generating explanations from deep reinforcement learning
    using episodic memory, arXiv preprint arXiv:2205.08926 (2022).