<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Control Policies for Virtual Grasping Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sheldon Andrews</string-name>
          <email>sheldon.andrews@mail.mcgill.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Kry</string-name>
          <email>kry@cs.mcgill.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, McGill University</institution>
        </aff>
      </contrib-group>
      <fpage>12</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Figure 1: Grasps synthesized by our reinforcement learning framework, including two at right on objects not included in the training episodes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Human grasping is one of the most challenging problems in
computer animation. Physically based grasp synthesis involves
coordination and contact, and depends on many variables such as the
shape, size, texture, and physical properties of the target object.
The traditional approaches have originated in the robotics
community and employ a combination of motion planning (to move within
reach of a target) and contact planning (to achieve a stable grasp
configuration). For example, the GraspIt! platform [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has been
used to develop such solutions to the grasping problem.
      </p>
      <p>
        Grasping in computer animation has typically focused on
control algorithms that involve the solution of carefully designed
optimization problems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or in other cases only addresses the reuse
of examples under specific initial conditions [5]. In contrast, we
present a novel application of reinforcement learning (RL) to grasp
synthesis in a physically based virtual environment. A set of basis
controllers is used to move the hand along coordinated joint
trajectories and a control policy is learned for choosing among them
with the goal of automatically synthesizing grasps. Our approach
is straightforward and preliminary results show success in learning
stable grasps that generalize across objects (see Figure 1).
      </p>
    </sec>
    <sec id="sec-2">
      <title>EXPERIMENTAL SETUP</title>
      <p>We use a hand model (depicted in Figure 2) that approximates the
shape and kinematics of a human hand. Each finger has 3 phalanges
and 3 joints, corresponding to finger segments and “knuckles”,
respectively. For simplicity, the visual and physical representations
use capsules to model the finger segments and a box for the palm.
The joint connecting the first segment of each finger to the palm is
modelled as a universal joint, whereas the others are rotary joints
with a single degree of freedom.</p>
      <p>The state representation is relative to a coordinate system
centred at the palm and aligned with the hand (the coordinate system
denoted OH in Figure 2). A palm centric frame is used because the
objective is stable grasp synthesis and we assume that the goal is
always to grasp a target object. The state consists of 29 continuous
state variables:
~p - 3D position of the target object in frame OH
~v - 3D linear velocity of the target object in frame OH
~o - 3D orientation of the target object in frame OH ,
represented by Euler angles</p>
      <p>qi; j - hinge angle of the jth joint on the ith finger
fi - abduction angle of finger i</p>
      <p>The hand and graspable objects are simulated using a physics
engine (CMLabs Vortex). Hand posture is managed by a PD controller
(with stiffness resembling that of a human hand) that actuates joints
according to coordinated motion of joint angles or joint velocities,
e.g., open, close, adduct the fingers, pinching pose.
3</p>
    </sec>
    <sec id="sec-3">
      <title>REINFORCEMENT LEARNING</title>
      <p>We use a combination of value iteration and SARSA to explore the
state-action space and compute an optimal control policy
approximation (see [6] for details). The value iteration algorithm begins
by selecting one of the user-defined states from an initiation set,
which is user-defined, and rigorously explores all possible
stateaction pairs by running the agent until a maximum number of steps
n is reached. The initial state set is user-defined and allows the user
to direct the agent toward regions of the state space which are
relevant. The pseudo-code for the recursive function used to perform
value iteration is given in Algorithm 1. Here, ~s represents the
current state of the task, as described in Section 2. An action, a, is
represented by a PD controller that moves the hand in some
coordinated motion, also described in Section 2.
Algorithm 1 valueIteration(~s; a; n)
if n &gt; 0 then
save(~s)
for all a0 2 A do
~s0 nextState(~s; a)
R reward(~s; a)
v valueIteration(~s0; a0; n</p>
      <p>R + gv</p>
      <p>This value iteration step is followed by training episodes of
SARSA and e-greedy exploration, essentially “rounding out” the
value function. The capacity to inject strategies into the learned
policy is useful for leading the policy toward grasping behaviours
that are desired in a given scenario. We provide the user with this
ability by making direct changes to the value function, which are
further refined in subsequent training episodes leading to desirable
behaviours in nearby states.</p>
      <p>
        The value function, Q~s;a, is represented using a nearest
neighbour function approximator, constructed incrementally during
learning (see Algorithm 1). The current state is added to the value
function if its distance to the nearest neighbour exceeds a novel
state threshold h (similar to the trusted state policy used by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
Otherwise, the current state is aliased by its nearest neighbour, and
the value of its neighbour is updated. At each step, the current state
is stored and used to query the agent for an action. The new action
is given control of the hand and the simulation is advanced using
forward dynamics. Post step, a reward (or penalty) is given to the
agent based on the new state.
      </p>
      <p>
        As recommended by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the reward function, reward(~s; a), is a
combination of immediate and delayed rewards, chosen to
accelerate the learning process. Immediate rewards encourage the agent
to take actions that will likely lead to good grasping configurations
in the short term; delayed rewards, which are given more sparsely,
are large bonuses given to the agent when it finally achieves some
milestone.
      </p>
      <p>Immediate rewards. An immediate reward, Rp, is given when
the agent moves toward the position of the target object and is
calculated as the decrease in Euclidean distance between the target
object and grasping centre compared to the previous time step. The
reward Rc is given when the agent chooses an action that increases
the number of finger segments in contact with the target object; a
penalty is given if the number of finger segments is reduced. The
value is simply equal to the number of contacts. Typically, large
relative velocity between the hand and object will not lead to
successful grasping, so penalty, Rv, proportional to the magnitude of
the linear velocity of the target is given. Another negative reward,
Re, equal to the effort required to performing an action, is used to
dissuade the agent from choosing irrelevant actions. The effort is
estimated as the sum of the magnitudes of the torques at each joint
integrated over the simulation time step.</p>
      <p>Delayed rewards. A very large delayed reward, Rq, is given
when the hand finally achieves a grasp, and is based on the
quality of the grasp. The reward is calculated as the minimum
distance from the origin of the set of wrench vectors (combined force
and torque) due to contact between the hand and the target object.
The grasp is stable when the convex hull of the available contact
wrenches contains the origin of the wrench space; the ability of the
grasp to resist perturbations improves as the distance from the
origin to the surface of the hull increases. We filter transient grasps
by disregarding the grasp quality metric when the velocity of the
target object projected onto the palm-aligned axis (see Figure 2) is
positive.</p>
      <p>The total reward R is calculated at every simulation time step as
R = wpRp + wcRc + wqRq
wvRv
weRe;
where the w factors denote the weight of the reward components.
For our experiments, we used wp = 0:1, wc = 0:5, wq = 1000,
wv = 0:2, and we = 0:001. The weights for the reward function
were manually adjusted until the agent performed well during
preliminary trials. This was largely a trial-and-error process based on
observation and required some intuition.
4</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>Our experiments were performed using a 2.6 GHz Intel quad-core
processor and 4 GB of memory. The time required to run a
typical episode was 2 seconds. The simulation was allowed to run
faster than real-time, achieving frame rates of 120 500 frames per
second, and a mean of 250 frames per second.</p>
      <p>The average time required to query the control policy was 2:9
ms, and involved performing a nearest neighbour search of about
10000 states. The average time for computing the grasp quality
reward Rq was 21:4ms.</p>
      <p>Figure 1 shows grasps synthesized using our method, including
two experiments (at right) where the policy was tested with objects
not seen in the training episodes. The agent was trained with an
initiation set of 6 states using the test object (green box). After
freezing the control policy, the agent was tested using a series of
unseen test objects, in this case, a coffee mug and a banana. The
agent was able to successfully grasp each of these objects, however
a limitation of the approach is that the agent sometimes produces
grasps that are either marginally stable or aesthetically awkward.
5</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>The preliminary results suggest that it is possible for an RL agent
to learn a control policy enabling a virtual hand to do stable
grasping of target objects. The agent successfully controlled the hand
to perform grasp synthesis, including grasping moving objects and
objects of unseen shape and size. While the learning process is
automated (the user need only select an initiation set), it also allows
for user interaction at different stages through the injection of
example strategies.</p>
      <p>Generating a robust control policy involves protracted
computing time. Our training times could be improved by increasing the
efficiency of the convex hull algorithm used to calculate the grasp
quality, and the value iteration and SARSA algorithm could be
parallelized and offloaded to multiple CPUs. Learning the reward
function, e.g., by inverse reinforcement learning, may lead to
better trade-offs between grasp quality and other aspects of the
performance, and allow for variations on the grasping task. Finally,
to improve the aesthetic quality of the synthesized motion, we are
developing methods to generate controllers from a motion capture
corpus.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work was supported in part by grants from NSERC and
GRAND NCE.
[5] N. S. Pollard and V. B. Zordan. Physically based grasping control from
example. In Proc. 2005 ACM SIGGRAPH/Eurographics symposium on
Computer animation, pages 311–318, 2005.
[6] R. S. Sutton and A. G. Barto. Reinforcement Learning: An
Introduction. MIT Press, 1998.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Coros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Beaudoin</surname>
          </string-name>
          , and M. van de Panne.
          <article-title>Robust task-based control policies for physics-based characters</article-title>
          .
          <source>ACM Transactions on Graphics (Proc. SIGGRAPH Asia)</source>
          ,
          <volume>28</volume>
          (
          <issue>5</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Dextrous manipulation from a grasping pose</article-title>
          .
          <source>ACM Transactions on Graphics</source>
          ,
          <volume>28</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Mataric</surname>
          </string-name>
          .
          <article-title>Reward functions for accelerated learning</article-title>
          .
          <source>In Proc. 11th International Conference on Machine Learning</source>
          , pages
          <fpage>181</fpage>
          -
          <lpage>189</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Allen</surname>
          </string-name>
          .
          <article-title>Graspit! a versatile simulator for robotic grasping</article-title>
          .
          <source>IEEE Robotics and Automation Magazine</source>
          ,
          <volume>11</volume>
          (
          <issue>4</issue>
          ):
          <fpage>110</fpage>
          -
          <lpage>122</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>