<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combating Stagnation in Reinforcement Learning Through 'Guided Learning' With 'Taught-Response Memory'?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Trinity College Dublin, School of Computer Science and Statistics, Artificial Intelligence Discipline, ADAPT Centre</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We present the concept of Guided Learning, which outlines a framework that allows a Reinforcement Learning agent to effectively 'ask for help' as it encounters stagnation. Either a human or expert agent supervisor can then optionally 'guide' the agent as to how to progress beyond the point of stagnation. This guidance is encoded in a novel way using a separately trained neural network referred to as a 'Taught Response Memory' that can be recalled when another 'similar' situation arises in the future. This paper shows how Guided Learning is algorithm independent and can be applied in any Reinforcement Learning context. Our results achieved superior performance over the agents non-guided counterpart with minimal guidance, achieving, on average, increases of 136% and 112% in the rate of progression of the champion and average genomes respectively. This is due to the fact that Guided Learning allows the agent to exploit more information and thus, the agent's need for exploration is reduced.</p>
      </abstract>
      <kwd-group>
        <kwd>Active learning</kwd>
        <kwd>Agent teaching</kwd>
        <kwd>Evolutionary algorithms</kwd>
        <kwd>Interactive adaptive learning</kwd>
        <kwd>Stagnation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>One of the primary problems with training any kind of modern AI in a
Reinforcement Learning environment is stagnation. Stagnation occurs when the agent
ceases to make progress in solving the current task prior to either the goal or the
agents maximum effectiveness being reached. The reduction of stagnation is an
important topic for reducing training times and increasing overall performance
in cases where training times are limited.</p>
      <p>This paper will present a method to reduce stagnation and define a framework
for a kind of interactive teaching/guidance where either a human or expert agent
supervisor can guide a learning agent past stagnation.
* This publication emanated from research conducted with the financial support of
Science Foundation Ireland (SFI) under Grant Number 13/RC/2106.
c 2019 for this paper by its authors. Use permitted under CC BY 4.0.</p>
    </sec>
    <sec id="sec-2">
      <title>G2uided LKe.aTrnuinnsgtead and J. Beel</title>
      <p>
        In terms of related work, we will briefly discuss Teaching and Interactive
Adaptive Learning. The concept of Teaching[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] encompasses agent-to-agent [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
agentto-human [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and human-to-agent teaching [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Guided Learning is a form of
Teaching that can take advantage of both human-to-agent and agent-to-agent.
Interactive Adaptive Learning is defined as a combination of Active Learning,
a type of Machine Learning where the algorithm is allowed to query some
information source in order to obtain the desired outputs, and Adaptive Stream
Mining which concerns itself with how the algorithm should adapt when dealing
with time changing data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2
      </p>
      <sec id="sec-2-1">
        <title>Guided Learning</title>
        <p>Guided Learning encodes guidance using what we refer to as Taught Response
Memories (TRMs), which we define as: a memory of a series of actions that an
agent has been taught in response to specific stimuli. A TRM is an abstract
concept but its representation must allow for some plasticity in order to adapt the
memory over time, this allows a TRM to tend towards a more optimal solution
for a single stimulus or towards its applicability, more generally, to other stimuli.
In this paper we represent TRMs as separately trained feed-forward neural
networks. TRMs may consist of multiple actions and this can cause non-convergence
when conflicting actions are presented, therefore we define a special case TRM,
referred to as a Single Action TRM (SATRM). Using SATRMs, multiple actions
can be split into their single action components, therefore removing any
conflicting actions. Due their independence from the underlying algorithm, TRMs (and
subsequently Guided Learning) can be used with any Reinforcement Learning
algorithm.</p>
        <p>
          The ideal implementation of Guided Learning can be best described using an
example. In the game Super Mario Bros, when a reinforcement agent stagnates at
the first green pipe (see Fig. 1 in Appendix A), the agent can request guidance
from a supervisor. If no guidance is received within a given time period, the
algorithm will continue as normal. Any guidance received is encoded as a new
TRM. The TRM can be ‘recalled’ in order to attempt to jump over, not only
the first green pipe but the second, and the third and so on. A TRM is ‘recalled’
if the current stimulus falls within a certain ‘similarity threshold’, θ &lt; t, of the
stimulus for which the TRM was trained, i.e. θ = arccos |aa|.|bb| where a and b are
the stimulus vectors. Because each TRM is plastic, it can tend towards getting
more optimal at either jumping over that one specific green pipe or jumping over
multiple green pipes. This also helps in cases where guidance is sub-optimal. A
full implementation of Guided Learning can recall the TRM, not only in the
first level or in other levels of the game but in other games entirely with similar
mechanics to the original game (i.e. another platform or ‘jump and run’ based
game, where the agent is presented with a barrier in front of it). For more
information please refer to the extended version of this manuscript [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Methodology</title>
        <p>
          The effectiveness of a limited implementation of Guided Learning1 will be
measured using the first level of the game Super Mario Bros2. The underlying
Reinforcement Learning algorithm used was Neural Evolution of Augmenting
Topologies (NEAT)[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. NEAT was chosen firstly due to it’s applicability as a
Reinforcement Learning algorithm and secondly due to NEATs nature as an
Evolutionary Algorithm. The original intent was to reuse TRMs across multiple
genomes. While this worked to an extent (see Avg Fitness metric in Fig. 3 in
Appendix B.1), it was not as successful as originally hoped. This is because
different genomes tend to progress in distinct ways and future work still remains
in regards to TRM reuse. Stagnation was defined as evaluating 4 generations
without the champion genome making progress.
        </p>
        <p>To evaluate Guided Learning, a baseline was created that only consisted of the
NEAT algorithm. The stimulus was represented as raw pixel data with some
dimensionality reduction (see Fig. 2 in Appendix A). The Guided Learning
implementation then takes the baseline and makes the following changes: 1)
Allows the agent to ‘ask for help’ from a human supervisor when stagnation is
encountered. 2) Encodes received guidance as SATRMs. 3) Activates SATRMs
as ‘similar’ situations are encountered.</p>
        <p>Both the baseline and Guided Learning algorithms were evaluated 50 times,
each to the 150th generation. ‘Best Fitness’ and ‘Average Fitness’ results refer
to the fitness of the champion genome and average fitness of the population at
each generation respectively. Where ‘fitness’ is defined as the distance the agent
moves across the level.
4</p>
      </sec>
      <sec id="sec-2-3">
        <title>Results &amp; Discussion</title>
        <p>For Guided Learning, an average of 10 interventions were given over an average
period of about 8 hours. Interventions were not given at each opportunity
presented and were instead lazily applied, averaging to 1 intervention for every 3
requests. The run-time of Guided Learning was mostly hindered by the overhead
of checking for stimulus similarity, this resulted in an extra run-time of about
2x the baseline. This run-time can be substantially improved with some future
work.</p>
        <p>Guided Learning achieved 136% and 112% improvements in the regression slopes
for both the Mean Best Fitness and Mean Average Fitness respectively (see Fig.
3 in Appendix A). We also looked at the best and worst performing cases. These
results can be seen in Fig. 4 and Table 2 in Appendix B.2.</p>
        <sec id="sec-2-3-1">
          <title>1 https://github.com/BeelGroup/Guided-Learning</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>2 Disclaimer: The ROM used during the creation of this work was created as an archival</title>
          <p>backup from a genuine NES cartridge and was NOT downloaded/distributed over
the internet.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>G4uided LKe.aTrnuinnsgtead and J. Beel</title>
      <p>The results obtained show good promise for Guided Learnings potential as such
results were obtained with only a partial implementation and much future work
still remains.</p>
      <p>Some of the limitations of Guided Learning include the need for some kind
of supervisor, its current run-time and its domain dependence i.e. a TRM for
‘jump and run’ games would not work in other games with different mechanics
or reinforcement scenarios.</p>
      <p>
        Future work will include: 1) Building Guided Learning using more state of the art
Reinforcement Learning algorithms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. 2) Using a more generalized encoding of
the stimulus to allow TRMs to be re-used more readily while still balancing the
false-negative and false-positive activation trade-off (i.e. feeding raw pixel data
into a trained classifier). 3) Implementing TRM adaptation. 4) Taking advantage
of poorly performing TRMs as a method of showing the agent what not to do
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. 5) Run-time optimization by offloading the similarity check and guidance
request to separate threads, this would mean that the agent would no longer
wait for input and TRM selection predictions can also be made as the current
stimulus converges towards a valid TRM stimulus.
(2018),
[On
      </p>
    </sec>
    <sec id="sec-4">
      <title>G6uided LKe.aTrnuinnsgtead and J. Beel</title>
      <p>B.2</p>
      <p>Best &amp; Worst Case Results
Best Fitness (Highest Slope)
Best Fitness (Lowest Slope)
Avg Fitness (Highest Slope)
Avg Fitness (Lowest Slope)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hussein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elyan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaber</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jayne</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Deep reward shaping from demonstrations</article-title>
          .
          <source>In: 2017 International Joint Conference on Neural Networks (IJCNN)</source>
          . pp.
          <fpage>510</fpage>
          -
          <lpage>517</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kottke</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Interactive adaptive learning http://www.daniel.kottke.eu/
          <year>2018</year>
          <article-title>/tutorial-interactive-adaptive-learning</article-title>
          ,
          <source>line; accessed June 18</source>
          ,
          <year>2019</year>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          :
          <article-title>Self-improving reactive agents based on reinforcement learning, planning and teaching</article-title>
          .
          <source>Machine learning 8(3-4)</source>
          ,
          <fpage>293</fpage>
          -
          <lpage>321</lpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidjeland</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrovski</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ),
          <volume>529</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stanley</surname>
            ,
            <given-names>K.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miikkulainen</surname>
          </string-name>
          , R.:
          <article-title>Evolving neural networks through augmenting topologies</article-title>
          .
          <source>Evolutionary computation 10(2)</source>
          ,
          <fpage>99</fpage>
          -
          <lpage>127</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.E.,
          <string-name>
            <surname>Carboni</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fachantidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torrey</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Reinforcement learning agents providing advice in complex video games</article-title>
          .
          <source>Connection Science</source>
          <volume>26</volume>
          (
          <issue>1</issue>
          ),
          <fpage>45</fpage>
          -
          <lpage>63</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tunstead</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beel</surname>
          </string-name>
          , J.:
          <article-title>Combating stagnation in reinforcement learning through 'guided learning' with 'taught-response memory' [extended version]</article-title>
          .
          <source>arXiv</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zhan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fachantidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.E.:
          <article-title>Agents teaching humans in reinforcement learning tasks</article-title>
          .
          <source>In: Proceedings of the Adaptive and Learning Agents Workshop (AAMAS)</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>