<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Embodiment Adaptation from Interactive Tra jectory Preferences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Walton</string-name>
          <email>michael.walton@navy.mil</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben Migliori</string-name>
          <email>benjamin.migliori@navy.mil</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Reeder</string-name>
          <email>john.d.reeder@navy.mil</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Space and Naval Warfare Systems Center Pacific</institution>
        </aff>
      </contrib-group>
      <fpage>95</fpage>
      <lpage>97</lpage>
      <abstract>
        <p>Imitation learning provides an attractive approach to communicate complex goals to autonomous systems in domains where explicit reward functions are unavailable, tedious to specify or rely on substantial or high-cost expert knowledge. Standard Imitation Learning implicitly assumes that the embodiment of the learning agent and the teacher are either the same or intuitively compatible from the perspective of the demonstrator. In this work, we consider control tasks which violate these assumptions and propose a framework for estimating embodiment adaptors using human feedback expressed through pairwise preferences over control trajectories.</p>
      </abstract>
      <kwd-group>
        <kwd>Imitation Learning</kwd>
        <kwd>Preference Learning</kwd>
        <kwd>Reinforcement Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Recent advances in reinforcement learning (RL) have largely been driven by
scaling algorithms well understood in simple task domains to complex,
highdimensional problems using deep neural networks for value function
approximation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and policy learning [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In the standard formulation of a reinforcement
learning problem, often posed as a Markov Decision Process (MDP), one
assumes access to a reward function R : S × A → R which associates a scalar
reward with agent actions a ∈ A taken in states s ∈ S. The agents’ objective,
therefore, is to maximize it’s cumulative reward. In many well posed control
tasks, this objective may be straightforward to specify: the score of a game, the
goal configuration in robotic manipulation tasks, forward velocity for walking or
crawling.
      </p>
      <p>
        Complementary to RL, Imitation Learning provides an approach for learning
a control policy without an explicit reward function. This approach is desirable in
problems domains where a concise goal statement may be challenging to express
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Prior work has also explored imitation learning to improve the sample
efficiency of reinforcement learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Conventional approaches to imitation
learning, however, fundamentally rely on the availability of demonstrations of
expert control in the form of observation, action tuples. Demonstration data may
be acquired through teleoperation 1 or kinesthetic teaching 2. In the former case,
the imitator and the demonstrator are assumed to have the same embodiment,
eg. their state and action spaces are assumed to be consistent. In the latter, the
demonstrator must inhabit the same physical space as the embodied agent and
must be able to efficiently pose and manipulate its effectors.
      </p>
      <p>Many complex control tasks may exhibit incompatibilities between the
embodiments of the demonstrator and the imitating agent. Consider for instance
a robotic arm we may wish to train to perform household tasks such as
preparing food; pose estimates of a human demonstrator’s arm will yield sequences of
actions with different degrees of freedom and dynamics than the imitating arm.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        Our proposed approach takes two stages: In the first stage the human
demonstrator provides undirected feedback to the agent to optimize a policy πα : AH → A`
which translates between the demonstrators action space AH and agent’s action
space A`. This is achieved through trajectory preference learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], however
in our formulation preferences are assigned to the trajectory that best matched
the demonstrators’ desired action. Formally, we state that a trajectory τ 1 is
preferred, denoted to τ 2 following a reward function r known only to the
demonstrator if:
τ 1
τ 2 ≡ X r(at1, πα(at1)) &gt;
t
      </p>
      <p>X r(at2, πα(at2))
t</p>
      <p>
        After each interaction, a pairwise preference is assigned between the two
trajectories and an reward function approximation rˆ is estimated using the method
specified in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The embodiment adaptation policy is then subsequently trained
to maximized rˆ using standard reinforcement learning. After learning an
embodiment adaptation policy, the second phase uses this mechanism to learn a behavior
policy πβ from translated demonstrations using (for instance) behavioral cloning.
In this simple formulation, the optimal policy given expert demonstrations D
is the policy that minimizes the divergence between πβ and the expert actions
translated by πα; assuming continuous actions, we may define this objective in
terms of the quadratic loss:
πβ∗ = arg min E(s,a)∼D[(πα(a) − πβ (s))2]
      </p>
      <p>πβ∈Πβ</p>
      <p>
        We propose two proof of concept embodiment translation tasks to
demonstrate the utility of our method: a classic gridworld with discrete state and action
1 The demonstrator directly controls the agent which records action selections for
imitation
2 The demonstrator physically manipulates an embodied agent by applying force to
its effectors; demonstration in these scenarios may be, for instance, resultant torques
on the joints of a robotic arm
(1)
(2)
EmbodimenEtmAbdoadpi mtaetniotnAdfraopmtatIinotnerfarocmtivIentTerraacjteicvteoTryraPjercetfoerryenPcreesferences
spaces and the continuous control problem lunar lander. In the lunar lander task,
for instance, the human demonstrator must select thrust directions using the up,
left and right keys; it is observed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] that humans tend to fail on this task.
Distinct from previous work, we hypothesize that this is an unintuitive interface
for a human operator to demonstrate correct behavior. A more natural
interface, perhaps, may be a joystick-like interface. We apply our method to learn an
embodiment adaptor policy πα which translates continuous forces applied to a
joystick to sequences of discrete thruster pulses which are compatible with the
imitator’s embodiment.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Christiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leike</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , T.B.,
          <string-name>
            <surname>Martic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Legg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Deep reinforcement learning from human preferences (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hadfield-Menell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dragan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The off-switch game (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hester</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vecerik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietquin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horgan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sendonaris</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dulac-Arnold</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osband</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agapiou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leibo</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gruslys</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep q-learning from demonstrations (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hester</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vecerik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietquin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sendonaris</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dulac-Arnold</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osband</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agapiou</surname>
            , J.,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Leibo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gruslys</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Learning from demonstrations for real world reinforcement learning (04</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunt</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pritzel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heess</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tassa</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Continuous control with deep reinforcement learning (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Playing atari with deep reinforcement learning (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dragan</surname>
            ,
            <given-names>A.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Shared autonomy via deep reinforcement learning (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>