<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Similarity-based Approaches to Learning from Demonstration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brandon Packard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Drexel University Philadelphia</institution>
          ,
          <addr-line>PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>282</fpage>
      <lpage>286</lpage>
      <abstract>
        <p>Learning from Demonstration (LfD) is the process of learning to accomplish a task by observing a demonstrator performing that task. My research so far has focused in three problems related to LfD: similarity measures for structured data, feature selection/construction for structured data, and active learning. A description of each area, what I have explored so far, and a brief statement on results are provided. My plans for future research work are also stated and brie y discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright c 2016 for this paper by its authors. Copying permitted for private
and academic purposes. In Proceedings of the ICCBR 2016 Workshops.</p>
      <p>
        Atlanta, Georgia, United States of America
case-based approaches to LfD. More speci cally, examining similarity measures
for structured, symbolic data and feature selection methods for these case-based
approaches. The second problem, and the one that I am currently focusing on, is
that of active-learning strategies for LfD. Since LfD violates the i.i.d. assumption
of standard supervised learning, standard supervised approaches are not suitable
(they lead to quadratic error growth over time [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) Therefore, I am investigating
dataset collection algorithms which will address this issue.
      </p>
      <p>The rest of this paper is organized as follows: Section 2 discusses the use
of structured similarity measures in LfD, and Section 3 discusses feature
construction and selection in LfD. Next, Section 4 discusses my current work in
active learning and some basic preliminary results. Finally, Section 5 wraps up
the paper with conclusions and then a discussion of my intended future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Structured Similarity Measures for LfD</title>
      <p>
        Structured representations of data work well for LfD problems, since in many
domains of interest it is more natural to represent the world state using these
representations rather than a more traditional propositional feature-vector
approach. For example, in Super Mario, one of our evaluation domains, there can
be an arbitrary number of enemies surrounding the player character. With a
feature vector approach, it is tough to represent the information pertaining to this
changing number of enemies. By using a representation such as horn clauses [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
however, the world state is easy to encode regardless of the number of enemies.
Cases are built from information from the current world state and past actions.
Also, we assume cases provided by the demonstrator are good quality.
      </p>
      <p>In nearest neighbor (and many other CBR approaches), one or more
similarity methods are used to determine what world state from the case-base most
closely matches the currently encountered case. For propositional data, this often
means directly calculating the distance between the numbers using some
mathematical formula, but for symbolic data di erent methods need to be employed.
The similarity measures that I have experimented with are:
{ Jaccard: Given two clauses (represented as sets of terms), we de ne the
Jaccard similarity as the size of their intersection over the size of their union,
which can be characterized by the following equation: J (X; Y ) = jjXX\[YY jj ,
where X \ Y is a clause that contains only those predicates present both in
X and in Y , and X [ Y is a clause that contains the union of predicates in
X and in Y . j j represents the number of predicates in a clause. Notice that
for a predicate to be in the intersection, it must be a perfect match.
{ Levenshtein: Given two clauses (which are unordered sets of terms), the
Levenshtein distance counts the number of edit operations that we need to
perform to one of the clauses, in order to convert it into the other. The edit
operations that we consider are adding a value, removing a value, and
substituting a value by another. However, true edit distance has an exponential
time complexity, so a matrix approximation which has a polynomial cost
was employed instead.
{ Weighted Levenshtein: This measure is simply Levenshtein distance with
weights on the values (for example, given 3 items in Minecraft (a wooden
sword, an iron sword, and a block of dirt), Levensthein would consider a
wooden sword no closer to a golden sword than it is to a dirt block, while
Weighted Levenshtein would recognize that it is certainly more similar to
the former than the latter.</p>
      <p>
        These methods, along with two baselines (a random agent and an agent who
just picks the overall most likely action from the case-base regardless of the
current world state) and an Euclidean-distance based method that ran on a
propositional version of the data were empirically investigated in the domain of
Minecraft [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It was found that Weighted Levensthein yielded the overall best
results, followed by Levensthein and Propositional being about the same, and
then Jaccard Distance. It is also interesting to note that although all four
methods performed signi cantly better than both baselines, there was no method that
performed statistically signi cantly better than any of the other three methods.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Feature Selection</title>
      <p>Many times, when a domain is being encoded into a structured representation,
irrelevant data tends to get encoded together with what the important data {
primarily because it is not always clear what is needed for the machine learning
algorithm to perform well. This confuses machine learning algorithms and can
result in less e ective learning. Therefore, feature selection and construction can
help extract what data is the most important, and therefore bolster learning.</p>
      <p>
        Features can be roughly de ned as the aspects of the data which are used to
determine what case from the case-base is closest to the currently encountered
case. In my research, I have examined various strategies for feature selection,
most of which are based on the well known \wrapper" methods:
{ Additive Wrappers: Wrappers which start with an empty subset of features
and greedily select the best feature at each iteration.
{ Subtractive Wrappers: Wrappers which start with a subset of features and
greedily dispose of the features whose removal provides the best results
{ Filter Methods: Methods which test each feature individually and gets rid
of all but the top X of them (in my experiments, values of 10%, 20%, and
30% were used for X).
{ Filtered Wrappers: Methods which rst use a lter, and then perform an
additive wrapper on the remaining set of features.
{ Naive Sampling Feature Selection (NSFS): This is an alternative, wrapper
inspired method which searches for the best feature subset based on Monte
Carlo sampling, which uses Nave Sampling [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>All of these methods were empirically tested in the domains of Super Mario
and Minecraft. It was found that all of these methods performed better than
using all the features for both domains. The best (but not statistically
significantly so) methods were subtractive wrappers, ltered wrappers, and NSFS.
These methods were also compared in terms of the number of calls to the
similarity method which were required, wherein the ltered wrapper approach had
by far the least number of calls, and the subtractive wrappers had by far the
most, leading to the reasonable conclusion that ltered wrappers were overall
the \best" method when accounting for both how well the feature selection
performed and how long it took to execute. In terms of feature construction, tests
performed in these domains with either type of feature construction increased
results, but tests that included both types of feature construction performed
signi cantly better than those which did not include any feature construction.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Active Learning for LfD</title>
      <p>One of the major open problems in LfD is that when a learning agent moves out
of the space for which it has demonstration data, the error starts to compound
quadratically. One way to account for this is to attempt to give the learning
agent demonstration data for situations in which it has none. Active learning is
one technique for attempting to do so, by having the agent determine when it is
stuck and then requesting more data from the demonstrator for that situation.</p>
      <p>
        One of the cutting edge algorithms for this is DAgger [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. DAgger is an
iterative algorithm that works by having the demonstrator control 100% of the
time for the rst iteration, then exponentially less and less over future iterations,
and stores the demonstrator's actions for each world state.
      </p>
      <p>However, DAgger requires a lot of demonstrator interaction. In order to lessen
how often the demonstrator needs to be consulted, I have come up with an
alternative active LfD algorithm denoted as SALT. SALT is an iterative algorithm,
like DAgger, but uses three di erent policies to control when and for how long
the demonstrator is queried. Basically, SALT runs the demonstrator for one
iteration to get a starting set of training data, and then lets the learner take over.
The learner uses one of a set of policies to determine whether it has moved out
of the state space for which is has demonstrations (that is to say, whether its
current reward di ers signi cantly from what the demonstrator's would have
been) and calls the demonstrator for data when it is. This estimation of the
reward the demonstrator would have obtained after a certain amount of time
is estimated from analyzing past demonstrations. There is also a set of policies
which determine how long the demonstrator should have control (for example,
for a certain number of time steps or until the end of the trajectory), and another
which determines how far the world state should be backed up (either a certain
number of timesteps or until the start of the trajectory) before the demonstrator
takes control. In addition, there are two avors of SALT:
{ SALTr (SALT with relabeling): This version of SALT has the demonstrator
relabel the training data afterwards, so that at every time step we know
what the demonstrator would have done (much like in DAgger).
{ SALTn (SALT without relabeling): This version of SALT does not have the
demonstrator relabel the training data afterwards, so only data from when
the demonstrator is in control is added to the case-base.</p>
      <p>So far, SALT has been compared to DAgger in the domain of Super Mario.
Preliminary results seem to indicate that it trains agents which are not
significantly worse than those DAgger trains, but requires much less demonstrator
interaction than DAgger, making it more feasible for human demonstrators.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In conclusion, my research work so far has focused around three key ideas. The
rst of these is analyzing various structural similarity measures and how well
they perform on LfD tasks. The second key idea is that of feature selection and
feature construction, or removing/adding features to the data in order to let
the learner more e ectively learn the demonstrator's behavior. Finally, active
learning, or letting the learner decide when to ask for more training data from
the demonstrator, has been my most recent area of focus.</p>
      <p>In the future, I would like to keep exploring active learning and improving
upon SALT to more e ectively tackle the open LfD problem of leaving the state
space for which the learner has demonstration data. In addition to this, I would
like to extend my work to a second evaluation domain (a grid-based puzzle game
known as \Thermometers") as well as a human demonstrator. Finally, I would
like to explore the idea of weighting the data in the case-base based either on
who was performing the task at the time or from what iteration of the algorithm
they were added to the learner.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Heyes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Motor learning by observation: Evidence from a serial reaction time task</article-title>
          .
          <source>The Quarterly Journal of Experimental Psychology: Section A</source>
          <volume>55</volume>
          (
          <issue>2</issue>
          ),
          <volume>593</volume>
          {
          <fpage>607</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Nienhuys-Cheng,
          <string-name>
            <given-names>S.H.</given-names>
            ,
            <surname>Wolf</surname>
          </string-name>
          , R.d.:
          <source>Foundations of Inductive Logic Programming</source>
          . Springer-Verlag New York, Inc., Secaucus, NJ, USA (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ontanon</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The combinatorial multi-armed bandit problem and its application to real-time strategy games</article-title>
          .
          <source>In: Ninth Arti cial Intelligence and Interactive Digital Entertainment Conference</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Packard</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Ontan~on, S.:
          <article-title>Learning behavior from demonstration in minecraft via symbolic similarity measures</article-title>
          .
          <source>In: Proceedings of the Foundations of Digital Games conference (FDG)</source>
          <year>2015</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagnell</surname>
          </string-name>
          , D.:
          <article-title>E cient reductions for imitation learning</article-title>
          .
          <source>In: International Conference on Arti cial Intelligence and Statistics</source>
          . pp.
          <volume>661</volume>
          {
          <issue>668</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gordon</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagnell</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>A reduction of imitation learning and structured prediction to no-regret online learning</article-title>
          .
          <source>arXiv preprint arXiv:1011.0686</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schaal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>Learning from demonstration</article-title>
          .
          <source>Advances in neural information processing systems</source>
          pp.
          <volume>1040</volume>
          {
          <issue>1046</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>