<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Episodic Memories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sam Blakeman</string-name>
          <email>samrobertallan.blakeman@sony.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Mareschal</string-name>
          <email>d.mareschal@bbk.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malet Street</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>HX UK</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Brain and Cognitive Development, Department of Psychological Sciences</institution>
          ,
          <addr-line>Birkbeck</addr-line>
          ,
          <institution>University of London</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sony AI</institution>
          ,
          <addr-line>Wiesenstrasse 5, Schlieren, 8952</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Deep Reinforcement Learning (RL) involves the use of Deep Neural Networks (DNNs) to make sequential decisions in order to maximize reward. For many tasks the resulting sequence of actions produced by a Deep RL policy can be long and dificult to understand for humans. A crucial component of human explanations is selectivity, whereby only key decisions and causes are recounted. Imbuing Deep RL agents with such an ability would make their resulting policies easier to understand from a human perspective and generate a concise set of instructions to aid the learning of future agents. To this end we use a Deep RL agent with an episodic memory system to identify and recount key decisions during policy execution. We show that these decisions form a short, human readable explanation that can also be used to speed up the learning of naive Deep RL agents.</p>
      </abstract>
      <kwd-group>
        <kwd>Memories</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The ability to explain how to solve a task allows humans to share learnt knowledge and speed
up the collective learning process. A naive approach to generating an explanation would be
to recall every decision made during the task. However, this is often undesirable because it
leads to prohibitively long and complex explanations that cannot be easily understood by the
recipient. It is therefore crucial that any explanation generating process is able to identify a
small subset of key decisions that are fundamental for solving the task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the social sciences
this is referred to as explanation selection and refers to the fact that human explanations
are biased to only a few important events or causes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Current approaches to generating
explanations in Deep RL algorithms typically operate at the level of individual decisions, for
example by computing saliency scores for all input features. They therefore do not produce
selective task-level explanations and fundamentally lack selectivity in their explanations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        To address this question we build on our previous work that outlined a framework for imbuing
Deep RL algorithms with a hippocampal learning system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The approach, termed
Complementary Temporal Diference Learning (CTDL), was inspired by the theory of Complementary
Learning Systems (CLS) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. CLS theory states that the brain relies upon the complementary
properties of the neocortex and the hippocampus to perform complex behaviour. The explicit
communication between the neocortical and hippocampal learning system in CTDL is of interest
for generating selective task-level explanations because it provides a mechanism for identifying
key decisions based on the current task.
      </p>
      <p>We propose that the content of the hippocampal learning system (represented as a
SelfOrganizing Map (SOM)) can be used to generate partial explanations of how to solve the current
task in terms of which action to select when. After learning the task, the memories that the
agent uses from the SOM can be stored as a short ordered list of key states and actions. This list
can then be interpreted as a partial explanation of how to solve the task and can be given to
other agents to speed up their learning process. We demonstrate the eficacy of this approach
in both the grid world and continuous mountain car domains. We visually explore the quality
of the generated explanations and also perform a quantitative assessment by measuring the
improvement in performance when a naive agent receives the explanation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>We generate partial explanations at the task-level by selecting a subset of the memories stored
in the hippocampal learning system (i.e. the SOM) of CTDL and presenting them as a temporal
sequence (Figure 1). In order to make this selection, we ask the agent to perform a test trial at
the end of learning. During this test trial no further learning occurs and we keep a list of every
memory that was used from the SOM along with its associated tabular value. We also keep a
record of the action that was taken and the calculated weighting value (β) for each memory. β
is calculated at each time-step using the Euclidean distance between the current state and the
closest matching memory in the SOM. We use this β to reduce the length of the list post-hoc so
that the explanation is more concise and understandable based on a pre-defined threshold value
(Figure 1).</p>
      <p>After generating an explanation from CTDL, the list of memories, values and actions can be
provided to other agents to improve the eficiency of their learning. In order to utilise the list
the receiving agent simply needs to calculate the weighting value between the current state
of the environment and the memories in the list on each time-step. If the weighting is greater
than a predefined threshold (e.g. 0.5) for a memory in the list then the agent’s current action
and value estimate can be set to that memory’s action and value. If multiple memories have a
weighting greater than the threshold then the one with the highest value is used. The benefits
of this simple mechanism are two-fold; (1) the policy is guided towards critical actions early
on and (2) RL algorithms that use a value function can use the associated values to bootstrap
value estimates during learning. While this mechanism of providing explanations can be used
for any RL algorithm, we can enhance it further if the explanation is being provided to CTDL.
In this case, the list of memories, values and actions can be used to randomly initialise the
entries of the SOM. These entries are fixed throughout the course of learning so that they are
not overwritten.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        The implementational details of all the simulations can be found in Blakeman and Mareschal [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
For the grid world experiments, we trained 12 agents for 1000 episodes on a grid world and then
generated an explanation after every 200 episodes. Figure 2A shows an explanation extracted
from a best performing agent after 1000 episodes of training. The best agent was the one that
achieved the most reward on a test trial after training. If a tie existed then the agent with the
highest training reward was chosen. Since an explanation is simply a list of state-action pairings
it can easily be inspected and qualitatively assessed. From visual inspection, the explanation
includes the essential decisions needed to solve each grid world. Crucially, the explanation
does not include every action taken by the agent, which demonstrates that the explanation
mechanism is able to select only the most important state-action pairings. For the continuous
mountain car task, 50 agents were trained for 1000 episodes and explanations were generated
at the very end of training. Figure 2B shows an explanation extracted from the best performing
agent. As with the grid worlds, the explanation does not involve every decision made by the
agent but instead represent key decisions for solving the task.
      </p>
      <p>Figure 3 compares the performance of agents on the continuous mountain car task that
received an explanation from CTDL vs. those that did not. Agents that received an explanation
achieved higher levels of reward on average than those that did not. Importantly, the provision
of an explanation did not appear to lead to the discovery of a better overall policy since the
best performing agents in both cases reached a similar level of performance (see the dashed
lines in Figure 3). This is to be expected given that the provided explanations describe the
strategies learnt by the agents without an explanation and so in both cases the policies should
be qualitatively similar. The provision of an explanation therefore appears to increase the
probability of an agent finding a previously learnt policy rather than discovering a new optimal
policy. As the explanations are generated from the best original agent, the agents receiving the
explanation benefit from the increased probability of finding this best policy and so the average
performance of the overall population increases.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>
        Explanations of how to solve a task often involve a summary of the key decisions required to
complete it, an ability referred to as selectivity [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Classic Deep Reinforcement Learning (RL)
approaches lack this ability because all actions are reported when executing a policy. Recently,
Complementary Temporal Diference Learning (CTDL) has been proposed which uses a Deep
Neural Network (DNN) and a Self-Organizing Map (SOM) to solve the RL problem. Importantly,
CTDL uses the errors produced by the DNN to update the contents of the SOM. In efect this
results in the SOM storing episodic memories of states and actions that led to the largest errors
during learning. We therefore use the contents of the SOM to generate task-level explanations
as they provide an intuitive summary of most the important state-action pairs for solving the
task at hand.
      </p>
      <p>From a qualitative perspective, the explanations generated from CTDL appeared to capture
the critical structure of the current task. In the grid world experiments, the sequence of states
and actions can be followed in order to trace a route to the goal location but they do not
exhaustively cover the whole trajectory. Similarly, in the continuous mountain car task, the
basic strategy of gaining momentum can be easily seen from the generated explanation without
the need to report every action taken. The explanations therefore gave us a condensed view of
the strategy learnt by the agent in an understandable and human-readable format. In order to
obtain a quantitative assessment of the explanations generated from CTDL, we also provided
them to naive agents at the start of learning to see whether they improved performance. In
the case of both the grid worlds and the continuous mountain car task, we saw better average
performance, faster learning and increased robustness when an explanation was provided to
the agent.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was funded by a Human-Like Computing Network kick-start award (EPSRC, UK).
We thank NVIDIA for a hardware grant that provided the Graphics Processing Unit (GPU) used
to run the simulations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Amir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sarne</surname>
          </string-name>
          , Summarizing agent strategies,
          <source>Autonomous Agents and Multi-Agent Systems</source>
          <volume>33</volume>
          (
          <year>2019</year>
          )
          <fpage>628</fpage>
          -
          <lpage>644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Explanation in artificial intelligence: Insights from the social sciences</article-title>
          ,
          <source>Artificial intelligence 267</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alvarez-Melis</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <article-title>Weight of evidence as a basis for human-oriented explanations</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>13503</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Blakeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mareschal</surname>
          </string-name>
          ,
          <article-title>A complementary learning systems approach to temporal diference learning</article-title>
          ,
          <source>Neural Networks</source>
          <volume>122</volume>
          (
          <year>2020</year>
          )
          <fpage>218</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>McClelland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L.</given-names>
            <surname>McNaughton</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. C. O'Reilly</surname>
          </string-name>
          ,
          <article-title>Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory</article-title>
          .,
          <source>Psychological review 102</source>
          (
          <year>1995</year>
          )
          <fpage>419</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Blakeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mareschal</surname>
          </string-name>
          ,
          <article-title>Generating explanations from deep reinforcement learning using episodic memory</article-title>
          ,
          <source>arXiv preprint arXiv:2205.08926</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>