=Paper=
{{Paper
|id=Vol-2192/ialatecml_paper10
|storemode=property
|title=Embodiment Adaptation from Interactive Trajectory Preferences
|pdfUrl=https://ceur-ws.org/Vol-2192/ialatecml_paper10.pdf
|volume=Vol-2192
|authors=Michael Walton,Ben Migliori,John Reeder
|dblpUrl=https://dblp.org/rec/conf/pkdd/WaltonMR18
}}
==Embodiment Adaptation from Interactive Trajectory Preferences==
<pdf width="1500px">https://ceur-ws.org/Vol-2192/ialatecml_paper10.pdf</pdf>
<pre>
      Embodiment Adaptation from Interactive
             Trajectory Preferences

                  Michael Walton, Ben Migliori, John Reeder

                Space and Naval Warfare Systems Center Pacific
                   http://www.public.navy.mil/spawar/Pacific
        {michael.walton, benjamin.migliori, john.d.reeder}@navy.mil


Keywords: Imitation Learning · Preference Learning · Reinforcement Learning


1   Introduction
Imitation learning provides an attractive approach to communicate complex
goals to autonomous systems in domains where explicit reward functions are
unavailable, tedious to specify or rely on substantial or high-cost expert knowl-
edge. Standard Imitation Learning implicitly assumes that the embodiment of
the learning agent and the teacher are either the same or intuitively compatible
from the perspective of the demonstrator. In this work, we consider control tasks
which violate these assumptions and propose a framework for estimating embod-
iment adaptors using human feedback expressed through pairwise preferences
over control trajectories.


2   Background
Recent advances in reinforcement learning (RL) have largely been driven by
scaling algorithms well understood in simple task domains to complex, high-
dimensional problems using deep neural networks for value function approxima-
tion [6] and policy learning [5]. In the standard formulation of a reinforcement
learning problem, often posed as a Markov Decision Process (MDP), one as-
sumes access to a reward function R : S × A → R which associates a scalar
reward with agent actions a ∈ A taken in states s ∈ S. The agents’ objective,
therefore, is to maximize it’s cumulative reward. In many well posed control
tasks, this objective may be straightforward to specify: the score of a game, the
goal configuration in robotic manipulation tasks, forward velocity for walking or
crawling.
     Complementary to RL, Imitation Learning provides an approach for learning
a control policy without an explicit reward function. This approach is desirable in
problems domains where a concise goal statement may be challenging to express
[1], [2]. Prior work has also explored imitation learning to improve the sample
efficiency of reinforcement learning [3], [4]. Conventional approaches to imitation
learning, however, fundamentally rely on the availability of demonstrations of
expert control in the form of observation, action tuples. Demonstration data may


                                        95
Embodiment
2          Adaptation
      M. Walton et al. from Interactive Trajectory Preferences

be acquired through teleoperation 1 or kinesthetic teaching 2 . In the former case,
the imitator and the demonstrator are assumed to have the same embodiment,
eg. their state and action spaces are assumed to be consistent. In the latter, the
demonstrator must inhabit the same physical space as the embodied agent and
must be able to efficiently pose and manipulate its effectors.
    Many complex control tasks may exhibit incompatibilities between the em-
bodiments of the demonstrator and the imitating agent. Consider for instance
a robotic arm we may wish to train to perform household tasks such as prepar-
ing food; pose estimates of a human demonstrator’s arm will yield sequences of
actions with different degrees of freedom and dynamics than the imitating arm.


3   Methods

Our proposed approach takes two stages: In the first stage the human demonstra-
tor provides undirected feedback to the agent to optimize a policy πα : AH → A`
which translates between the demonstrators action space AH and agent’s action
space A` . This is achieved through trajectory preference learning [1], however
in our formulation preferences are assigned to the trajectory that best matched
the demonstrators’ desired action. Formally, we state that a trajectory τ 1 is
preferred, denoted  to τ 2 following a reward function r known only to the
demonstrator if:
                             X                            X
                 τ1  τ2 ≡         r(a1t , πα (a1t )) >       r(a2t , πα (a2t ))     (1)
                               t                          t

    After each interaction, a pairwise preference is assigned between the two tra-
jectories and an reward function approximation r̂ is estimated using the method
specified in [1]. The embodiment adaptation policy is then subsequently trained
to maximized r̂ using standard reinforcement learning. After learning an embodi-
ment adaptation policy, the second phase uses this mechanism to learn a behavior
policy πβ from translated demonstrations using (for instance) behavioral cloning.
In this simple formulation, the optimal policy given expert demonstrations D
is the policy that minimizes the divergence between πβ and the expert actions
translated by πα ; assuming continuous actions, we may define this objective in
terms of the quadratic loss:

                     πβ∗ = arg min E(s,a)∼D [(πα (a) − πβ (s))2 ]                    (2)
                            πβ ∈Πβ

    We propose two proof of concept embodiment translation tasks to demon-
strate the utility of our method: a classic gridworld with discrete state and action
1
  The demonstrator directly controls the agent which records action selections for
  imitation
2
  The demonstrator physically manipulates an embodied agent by applying force to
  its effectors; demonstration in these scenarios may be, for instance, resultant torques
  on the joints of a robotic arm


                                             96
Embodiment AdaptationAdaptation
         Embodiment   from Interactive Trajectory
                                from Interactive    Preferences
                                                 Trajectory Preferences               3

spaces and the continuous control problem lunar lander. In the lunar lander task,
for instance, the human demonstrator must select thrust directions using the up,
left and right keys; it is observed in [7] that humans tend to fail on this task.
Distinct from previous work, we hypothesize that this is an unintuitive interface
for a human operator to demonstrate correct behavior. A more natural inter-
face, perhaps, may be a joystick-like interface. We apply our method to learn an
embodiment adaptor policy πα which translates continuous forces applied to a
joystick to sequences of discrete thruster pulses which are compatible with the
imitator’s embodiment.


References
1. Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep
   reinforcement learning from human preferences (2017)
2. Hadfield-Menell, D., Dragan, A., Abbeel, P., Russell, S.: The off-switch game (2016)
3. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D.,
   Quan, J., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., Leibo, J.Z.,
   Gruslys, A.: Deep q-learning from demonstrations (2017)
4. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris,
   A., Dulac-Arnold, G., Osband, I., Agapiou, J., Z. Leibo, J., Gruslys, A.: Learning
   from demonstrations for real world reinforcement learning (04 2017)
5. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.,
   Wierstra, D.: Continuous control with deep reinforcement learning (2015)
6. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,
   Riedmiller, M.: Playing atari with deep reinforcement learning (2013)
7. Reddy, S., Dragan, A.D., Levine, S.: Shared autonomy via deep reinforcement learn-
   ing (2018)


                                          97

</pre>