=Paper=
{{Paper
|id=Vol-2192/ialatecml_paper10
|storemode=property
|title=Embodiment Adaptation from Interactive Trajectory Preferences
|pdfUrl=https://ceur-ws.org/Vol-2192/ialatecml_paper10.pdf
|volume=Vol-2192
|authors=Michael Walton,Ben Migliori,John Reeder
|dblpUrl=https://dblp.org/rec/conf/pkdd/WaltonMR18
}}
==Embodiment Adaptation from Interactive Trajectory Preferences==
Embodiment Adaptation from Interactive Trajectory Preferences Michael Walton, Ben Migliori, John Reeder Space and Naval Warfare Systems Center Pacific http://www.public.navy.mil/spawar/Pacific {michael.walton, benjamin.migliori, john.d.reeder}@navy.mil Keywords: Imitation Learning · Preference Learning · Reinforcement Learning 1 Introduction Imitation learning provides an attractive approach to communicate complex goals to autonomous systems in domains where explicit reward functions are unavailable, tedious to specify or rely on substantial or high-cost expert knowl- edge. Standard Imitation Learning implicitly assumes that the embodiment of the learning agent and the teacher are either the same or intuitively compatible from the perspective of the demonstrator. In this work, we consider control tasks which violate these assumptions and propose a framework for estimating embod- iment adaptors using human feedback expressed through pairwise preferences over control trajectories. 2 Background Recent advances in reinforcement learning (RL) have largely been driven by scaling algorithms well understood in simple task domains to complex, high- dimensional problems using deep neural networks for value function approxima- tion [6] and policy learning [5]. In the standard formulation of a reinforcement learning problem, often posed as a Markov Decision Process (MDP), one as- sumes access to a reward function R : S × A → R which associates a scalar reward with agent actions a ∈ A taken in states s ∈ S. The agents’ objective, therefore, is to maximize it’s cumulative reward. In many well posed control tasks, this objective may be straightforward to specify: the score of a game, the goal configuration in robotic manipulation tasks, forward velocity for walking or crawling. Complementary to RL, Imitation Learning provides an approach for learning a control policy without an explicit reward function. This approach is desirable in problems domains where a concise goal statement may be challenging to express [1], [2]. Prior work has also explored imitation learning to improve the sample efficiency of reinforcement learning [3], [4]. Conventional approaches to imitation learning, however, fundamentally rely on the availability of demonstrations of expert control in the form of observation, action tuples. Demonstration data may 95 Embodiment 2 Adaptation M. Walton et al. from Interactive Trajectory Preferences be acquired through teleoperation 1 or kinesthetic teaching 2 . In the former case, the imitator and the demonstrator are assumed to have the same embodiment, eg. their state and action spaces are assumed to be consistent. In the latter, the demonstrator must inhabit the same physical space as the embodied agent and must be able to efficiently pose and manipulate its effectors. Many complex control tasks may exhibit incompatibilities between the em- bodiments of the demonstrator and the imitating agent. Consider for instance a robotic arm we may wish to train to perform household tasks such as prepar- ing food; pose estimates of a human demonstrator’s arm will yield sequences of actions with different degrees of freedom and dynamics than the imitating arm. 3 Methods Our proposed approach takes two stages: In the first stage the human demonstra- tor provides undirected feedback to the agent to optimize a policy πα : AH → A` which translates between the demonstrators action space AH and agent’s action space A` . This is achieved through trajectory preference learning [1], however in our formulation preferences are assigned to the trajectory that best matched the demonstrators’ desired action. Formally, we state that a trajectory τ 1 is preferred, denoted to τ 2 following a reward function r known only to the demonstrator if: X X τ1 τ2 ≡ r(a1t , πα (a1t )) > r(a2t , πα (a2t )) (1) t t After each interaction, a pairwise preference is assigned between the two tra- jectories and an reward function approximation r̂ is estimated using the method specified in [1]. The embodiment adaptation policy is then subsequently trained to maximized r̂ using standard reinforcement learning. After learning an embodi- ment adaptation policy, the second phase uses this mechanism to learn a behavior policy πβ from translated demonstrations using (for instance) behavioral cloning. In this simple formulation, the optimal policy given expert demonstrations D is the policy that minimizes the divergence between πβ and the expert actions translated by πα ; assuming continuous actions, we may define this objective in terms of the quadratic loss: πβ∗ = arg min E(s,a)∼D [(πα (a) − πβ (s))2 ] (2) πβ ∈Πβ We propose two proof of concept embodiment translation tasks to demon- strate the utility of our method: a classic gridworld with discrete state and action 1 The demonstrator directly controls the agent which records action selections for imitation 2 The demonstrator physically manipulates an embodied agent by applying force to its effectors; demonstration in these scenarios may be, for instance, resultant torques on the joints of a robotic arm 96 Embodiment AdaptationAdaptation Embodiment from Interactive Trajectory from Interactive Preferences Trajectory Preferences 3 spaces and the continuous control problem lunar lander. In the lunar lander task, for instance, the human demonstrator must select thrust directions using the up, left and right keys; it is observed in [7] that humans tend to fail on this task. Distinct from previous work, we hypothesize that this is an unintuitive interface for a human operator to demonstrate correct behavior. A more natural inter- face, perhaps, may be a joystick-like interface. We apply our method to learn an embodiment adaptor policy πα which translates continuous forces applied to a joystick to sequences of discrete thruster pulses which are compatible with the imitator’s embodiment. References 1. Christiano, P., Leike, J., Brown, T.B., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences (2017) 2. Hadfield-Menell, D., Dragan, A., Abbeel, P., Russell, S.: The off-switch game (2016) 3. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., Leibo, J.Z., Gruslys, A.: Deep q-learning from demonstrations (2017) 4. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J., Z. Leibo, J., Gruslys, A.: Learning from demonstrations for real world reinforcement learning (04 2017) 5. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning (2015) 6. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning (2013) 7. Reddy, S., Dragan, A.D., Levine, S.: Shared autonomy via deep reinforcement learn- ing (2018) 97