<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Solving the Real Robot Challenge Using Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robert McCarthy</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco Roldan Sanchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiang Wang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Cordova Bulens</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin McGuinness</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noel O'Connor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen Redmond</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight SFI Research Centre for Data Analytics</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper details our winning submission to Phase 1 of the 2021 Real Robot Challenge4; a challenge in which a three-fingered robot must carry a cube along specified goal trajectories. To solve Phase 1, we use a pure reinforcement learning approach which requires minimal expert knowledge of the robotic system or of robotic grasping in general. A sparse, goal-based reward is employed in conjunction with Hindsight Experience Replay to teach the control policy to move the cube to the desired x and y coordinates. Simultaneously, a dense distance-based reward is employed to teach the policy to lift the cube to the desired z coordinate. The policy is trained in simulation with domain randomisation before being transferred to the real robot for evaluation. Although performance tends to worsen after this transfer, our best policy can successfully lift the real cube along goal trajectories via an efective pinching grasp. Our approach5 outperforms all other submissions, including those leveraging more traditional robotic control techniques, and is the first pure learningbased method to solve this challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>Robotic Manipulation</kwd>
        <kwd>Deep Reinforcement Learning</kwd>
        <kwd>Real Robot Challenge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Dexterous robotic manipulation is applicable in various industrial and domestic
settings. However, current state-of-the-art robotic control strategies generally
struggle in unstructured tasks which require high degrees of dexterity.
Datadriven learning methods are promising for these challenging manipulation tasks,
yet related research has been limited by the costly nature of real-robot
experimentation. In light of these issues, the Real Robot Challenge (RRC) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims to
advance the state-of-the-art in robotic manipulation by providing participants
(a) Simulation
(b) Reality
with remote access to well-maintained robotic platforms, allowing for cheap and
easy real-robot experimentation. To further support easy experimentation, users
are also provided with a simulated version of the robotic setup (see Figure 1).
      </p>
      <p>The 2021 RRC consists of an initial qualifying Pre-Phase performed purely in
simulation, followed by independent Phases 1 and 2, both performed on the real
robot. Full details can be found in the ‘Protocol’ section of the RRC website4.
This paper focuses solely on our approach to Phase 1 of the competition.</p>
      <p>
        In Phase 1, participants are tasked with solving the challenging ‘Move Cube
on Trajectory’ task. In this task, a cube must be carried along a goal trajectory
(which specifies the coordinates at which the cube should be positioned at each
time-step) using the provided TriFinger robotic platform [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For final Phase 1
evaluation, participants submit their developed control policy and receive a score
based on how closely it can follow several randomly sampled goal trajectories.
      </p>
      <p>
        ‘Move Cube on Trajectory’ requires a dexterous policy that can adapt to
the various goal and cube positions encountered during an evaluation episode.
Last year (2020), the winning solutions to this task consisted of structured
policies which relied heavily on inductive biases and task specific engineering [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ].
We take an alternative approach, formulating the task as a pure reinforcement
learning (RL) problem. We then use RL to learn our control policy entirely in
simulation before transferring it to the real robot for final evaluation. Upon this
evaluation, our learned policy outperformed all other competing submissions,
winning Phase 1 of the 2021 RRC.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Traditional Robotic Manipulation</title>
        <p>
          Traditional robotic manipulation controllers often rely on solving inverse kinematic
equations [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The goal of this approach is to find the parameters needed to
position the end-efector of a robotic system (gripper, finger tips, etc.) into the
desired position and orientation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Because the solution to this problem is
not unique, motion primitives - i.e. a set of pre-computed movements that a
robot can take in a given environment - are typically introduced [
          <xref ref-type="bibr" rid="ref7 ref8">7,8</xref>
          ]. These
primitives can each have a defined cost, allowing the robot to avoid non-smooth
or non-desired transitions. Exteroceptive feedback in the form of sensors (RGB
cameras, depth/tactile sensors, etc.) is usually employed to help the robot achieve
the expected behaviour [
          <xref ref-type="bibr" rid="ref10 ref9">9,10</xref>
          ].
        </p>
        <p>
          Most successful approaches in previous editions of the Real Robot Challenge
make use of a combination of motion planning and motion primitives. The
winning team of the 2020 edition of the challenge [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] used a set of primitives to:
(i) align the cube to the target position and orientation while keeping it on the
ground, and then (ii) perform grasp planning using a Rapidly-exploring Random
Tree (RRT) algorithm [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. During the grasping planning, they use force control
feedback to ensure the finger tips apply enough force to lift the cube. Finally, they
improve their policy via (simulated) residual policy learning [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], a technique
which uses RL to learn corrective actions added to the output of the original
control policy. Contrary to these methods, we use a pure learning-based approach
which requires minimal task specific engineering.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Reinforcement Learning for Robotic Manipulation</title>
        <p>Deep RL methods promise to allow learning of sophisticated, dexterous robotic
manipulation strategies that would otherwise be impossible, or at least very
dificult, to hand-engineer. However, the data ineficiency of RL is a major
barrier to its application in real-world robotics: real robot data collection is
time-consuming and expensive. Thus, much RL research to-date has focused on
resolving or by-passing these data-eficiency issues.</p>
        <p>
          Due to their generally improved sample complexity, of-policy RL methods
[
          <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
          ] are often preferred to on-policy methods [
          <xref ref-type="bibr" rid="ref15 ref16">15,16</xref>
          ]. Model-based RL methods,
which explicitly learn a model of their environment, have been proposed to further
improve sample complexity [
          <xref ref-type="bibr" rid="ref17 ref18 ref19">17,18,19</xref>
          ], and have seen success in real robot settings,
e.g., with in-hand object manipulation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Ofline RL techniques seek to leverage
previously collected data to accelerate learning [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], and have learned dexterous
real-world skills such as drawer opening [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Imitation learning methods provide
the policy with expert demonstrations to learn from [
          <xref ref-type="bibr" rid="ref23 ref24">23,24</xref>
          ], enabling success in
real robot tasks such as peg insertion [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Finally, simulation-to-real (sim-to-real)
transfer methods train a policy quickly and cheaply in simulation before deploying
it on the real robot, and have notably been used to solve a Rubik’s cube with
a robot hand [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. To account for simulator modelling errors, and to improve
the policies ability to generalize to the real robot, sim-to-real approaches often
employ domain randomisation [
          <xref ref-type="bibr" rid="ref27 ref28">27,28</xref>
          ] or domain adaptation [
          <xref ref-type="bibr" rid="ref29 ref30">29,30</xref>
          ] techniques.
Domain randomisation, which has been particularly efective [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], randomises
the physics parameters in simulation to learn a robust policy that can adapt to
the partially unknown physics of the real system.
        </p>
        <p>Provided with a simulated replica of the real robotic setup, but without access
to prior data or expert demonstrations, we use sim-to-real transfer to bypass
real-robot RL data-eficiency issues.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <sec id="sec-3-1">
        <title>Goal-based Reinforcement Learning. We frame the RRC robotic environ</title>
        <p>ments as a Markov decision process (MDP), defined by the tuple (S, A, G, p, r, γ, ρ 0).
S, A, and G are the state, action and goal spaces, respectively. The state
transition distribution is denoted as p(s′|s, a), the initial state distribution as ρ 0(s),
and the reward function as r(s, g). γ ∈ (0, 1) discounts future rewards. The goal
of the RL agent is to find the optimal policy π ∗ that maximizes the expected
sum of discounted rewards in this MDP: π ∗ = argmaxπ Eπ [Pt∞=0 γ tr(st, gt)].</p>
      </sec>
      <sec id="sec-3-2">
        <title>Deep Deterministic Policy Gradients (DDPG). DDPG [13] is an of</title>
        <p>policy RL algorithm which, in the goal-based RL setting, maintains the following
neural networks: a policy (actor) π : S × G → A , and an action-value function
(critic) Q : S × G × A → R. The critic is trained to minimise the loss Lc =
E(Q(st, gt, at) − yt)2, where yt = rt + γQ (st+1, gt+1, π (st+1, gt+1)). To stabilize
the critics training, the targets yt are produced using slowly updated
polyakaveraged versions of the main networks. The actor is trained to minimise the
loss: La = − EsQ(s, g, π (s, g)), where gradients are computed by backpropagating
through the combined critic and actor networks. For these updates, the transition
tuples (st, gt, at, rt, st+1, gt+1) are sampled from a replay bufer which stores
previously collected experiences (i.e., of-policy data).</p>
        <p>
          Hindsight Experience Replay (HER). HER [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] can be used with any
ofpolicy RL algorithm in goal-based tasks, and is most efective when the reward
function is sparse and binary (e.g. equation 1). To improve learning in the sparse
reward setting, HER employs a simple trick when sampling previously collected
transitions for policy updates: a proportion of sampled transitions have their
goal g altered to g′, where g′ is a goal achieved later in the episode. The rewards
of these altered transitions are then recalculated with respect to g′, leaving the
altered transition tuples as (st, gt′, at, rt′, st+1, gt′+1). Even if the original episode
was unsuccessful, these altered transitions will teach the agent how to achieve g′,
thus accelerating its acquisition of skills.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methods</title>
      <p>
        We train our control policy in simulation with RL before transferring it to the
real robot for evaluation. This allows for quicker and easier data collection versus
real robot training. To compensate for modelling errors in the simulator, we
randomise the simulation dynamics [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. DDPG + HER is maintained as the RL
algorithm, modified slightly to suit our two-component reward system. We now
describe in detail our simulated environment, followed by our learning algorithm.
      </p>
      <sec id="sec-4-1">
        <title>Simulated Environment</title>
        <p>Actions and Observations. Pure torque control of the robot arms is employed
with an action frequency of 20 Hz (i.e. each time-step in the environment is 0.05
seconds). The robot has three arms, with three motorised joints in each arm;
thus the action space is 9-dimensional (and continuous). Observations include:
(i) robot joint positions, velocities, and torques; (ii) the provided estimate of the
cube’s pose (i.e. its estimated position and orientation), along with the diference
between the current and previous time-step’s pose; and (iii) the goal coordinates
at which the cube should currently be placed (i.e. the active goal of the trajectory).
In total, the observation space has 44 dimensions.</p>
        <p>Episodes. In each simulated training episode, the robot begins in its default
position and the cube is placed in a uniformly random position on the arena floor.
Episodes last for 90 time-steps, with the active goal of the randomly sampled
goal trajectory changing every 30 time-steps.</p>
        <p>Domain Randomisation. To help the learned policy generalize from an
inaccurate simulation to the real environment, we used some basic domain randomisation
(i.e., physics randomisation) during training6. This includes uniformly sampling,
from a specified range, parameters of the simulation physics (e.g. robot mass,
restitution, damping, friction; see our code for more details) and cube properties
(mass and width) each episode. To account for noisy real-robot actuations and
observations, uncorrelated noise is added to actions and observations within
simulated episodes.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Learning Algorithm</title>
        <p>
          The goal-based nature of the ‘Move Cube on Trajectory’ task makes HER a
natural fit; HER has excelled in similar goal-based robotic tasks [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] and obviates
the need for complex reward engineering. As such, we use DDPG + HER as our
RL algorithm7. However, in our early experiments we observed that standard
DDPG + HER was slow in learning to lift the cube. To resolve this issue, we
alter slightly the HER process and incorporate an additional dense reward which
encourages cube-lifting behaviors, as is now described.
        </p>
        <p>
          Rewards and HER. In our approach, the agent receives two reward components:
(i) a sparse reward based on the the cube’s x-y coordinates, rxy, and (ii) a dense
reward based on the cube’s z coordinate, rz (the coordinate frame can be seen in
Figure 1 (a)).
6 Our domain randomization implementation is based on the benchmark code from
the 2020 RRC [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
7 Our DDPG + HER implementation is taken from https://github.com/
TianhongDai/hindsight-experience-replay, and uses hyperparameters largely
based on [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ].
        </p>
        <p>The sparse x-y reward is calculated as:
rxy =
(0</p>
        <p>if
− 1 otherwise
g′xy − gxy ≤ 2cm
(1)
(2)
where g′xy are the x-y coordinates of the achieved goal (the actual x-y coordinates
of the cube), and gxy are the x-y coordinates of the desired goal.</p>
        <p>The dense z reward is defined as:

− a| zcube − zgoal| if zcube &lt; zgoal

rz = 
 − a
 2 | zcube − zgoal| if

zcube &gt; zgoal
where zcube and zgoal are the z-coordinates of the cube and goal, respectively,
and a is a parameter which weights rz relative to rxy (we use a = 20).</p>
        <p>We only apply HER to the x-y coordinates of the goal; i.e., the x-y coordinates
of the goal can be altered in hindsight, but the z coordinate remains unchanged.
Thus, our HER altered goals are: gˆ = (gx′, gy′, gz), meaning only rxy is recalculated
after HER is applied to a transition sampled during policy updates. This reward
system is motivated by the following:
1. Using rxy with HER allows the agent to learn to push the cube around in
the early stages of training, even if it cannot yet lift the cube to reach the
z-coordinate of the goal. As the agent learns to push the cube around in the
x-y plane of the arena floor, it can then more easily stumble upon actions
which lift it. Importantly, the rxy + HER approach requires no complicated
reward engineering.
2. rz aims to explicitly teach the agent to lift the cube by encouraging
minimisation of the vertical distance between the cube and the goal. It is less
punishing when the cube is above the goal, serving to further encourage
lifting behaviours.
3. In the early stages of training, the cube mostly remains on the floor. During
these stages, most g′ sampled by HER will be on the floor. Thus, applying
HER to rz could often lead to the agent being punished for briefly lifting the
cube. Since we only apply HER to the x-y coordinates of the goal, our HER
altered goals, gˆ, maintain their original z height. This leaves more room for
the agent to be rewarded by rz for any cube lifting it performs.</p>
        <p>Goal Trajectories. In each episode, the agent is faced with multiple goals;
it must move the cube from one goal to the next along a given trajectory. To
ensure the HER process remains meaningful in these multi-goal episodes, we only
sample future achieved goals, g′, (to replace g) from the period of time in which
g was active.</p>
        <p>In our implementation, the agent is unaware that it is dealing with trajectories:
when updating the policy with transitions (st, gt, at, rt, st+1, gt+1) we always set
(a) Pushing
(b) Cradling
(c) Pinching
gt+1 = gt, even if in reality gt+1 was diferent 8. Thus, the policy focuses solely
on achieving the current active goal and is unconcerned by any future changes in
the active goal.</p>
        <p>
          Exploration vs Exploitation. We derive our DDPG + HER hyperparameters
from Plappert et al. [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], who use a highly ‘exploratory’ policy when collecting
data in the environment: with probability 30% a random action is sampled
(uniformly) from the action-space, and when policy actions are chosen, Gaussian
noise is applied. This is beneficial for exploration in the early stages of training,
however, it can be limiting in the later stages when the policy must be
finetuned; we found that the exploratory policy repeatedly drops the cube due to
the randomly sampled actions and the injected action noise. To resolve this issue,
rather than slowly reducing the level of exploration each epoch - which would
require a degree of hyperparameter tuning, we make eficient use of evaluation
episodes (which are performed by the standard ‘exploiting’ policy) by adding
them to the replay bufer. Thus, 90% of rollouts added to the bufer are collected
with the exploratory policy, and the remaining 10% with the exploiting policy.
This addition was suficient to boost final success rates in simulation from 70-80%
to &gt;90% (where "success rate" is equivalent to that seen in Figure 3).
5
5.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <sec id="sec-5-1">
        <title>Simulation</title>
        <p>Our method is highly efective in simulation. The algorithm can learn from scratch
to proficiently grasp the cube and lift it along goal trajectories. Figure 3 compares
the training performance of our final algorithm to that of standard HER 9. Our
8 Interestingly, we found that exposing the agent (during updates) to transitions in
which gt+1 ̸= gt hurt performance significantly, perhaps due to the extra uncertainty
this introduces to the DDPG action-value estimates.
9 These runs did not use domain randomization. Generally we trained from scratch in
standard simulation before fine-tuning in a domain-randomized simulation
algorithm converges in roughly 23 the time of standard HER, and is markedly
improved in the the early stages of training; this allowed us to iteratively develop
our approach more quickly. Throughout diferent training runs, our policies
learned several diferent manipulation strategies, the most distinct of which
included: (i) ‘pinching ’ the cube with two arm tips and supporting it with the
third, and (ii) ‘cradling ’ the cube with all three of its forearms (see Figure 2).
Simulation -20,399± 3,799 -6,349± 1,039 -6,198± 1,840
Real robot -22,137 ± 3,671 -14,207 ± 2,160 -11,489 ± 3,790
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Real Robot</title>
        <p>Our final policies transferred to the real robot with reasonable success. Table
1 displays the self-reported scores of our best pinching and cradling policies
under RRC Phase 1 evaluation conditions. As a baseline comparison, we trained
a simple ‘pushing’ policy which ignores the height component of the goal and
simply learns to push the cube along the floor to the goal’s x-y coordinates. The
pinching policy performed best on the real robot, and is capable of carrying
the cube along goal trajectories for extended periods of time, and of recovering
the cube when it is dropped. This policy was submitted for the oficial RRC
Phase 1 final evaluation round and obtained the winning score (see https://
real-robot-challenge.com/leaderboard, username ‘thriftysnipe’).</p>
        <p>The domain gap between simulation and reality was significant, and generally
led to inferior scores on the real robot. Policies often struggled to gain control of
the real cube which appeared to slide more freely than in simulation. Additionally,
on the real robot policies could become stuck with an arm-tip pressing the cube
into the wall. As a makeshift solution to this issue, we assumed the policy
was stuck whenever the cube had not reached the goal’s x-y coordinates for
50 consecutive steps, then uniformly sampled random actions for 7 steps in an
attempt to ‘free’ the policy from its stuck state.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>
        Our relatively simple reinforcement learning approach fully solves the ‘Move Cube
on Trajectory’ task in simulation. Moreover, our learned policies can successfully
implement their sophisticated manipulation strategies on the real robot. Unlike
last years benchmark solutions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], this was achieved with the use of minimal
domain-specific knowledge. We outperformed all competing submissions, including
those employing more classical robotic control techniques.
      </p>
      <p>
        Due to the large domain gap, our excellent performances in simulation were
not fully matched upon transfer to the real robot. Indeed, the main limitation
of our approach was the absence of any training on real-robot data. It is likely
that some fine-tuning of the policy on real data would greatly increase its
robustness in the real environment, and developing a technique which could do so
eficiently is one direction for future work. Similarly, the use of domain adaptation
techniques [
        <xref ref-type="bibr" rid="ref29 ref30">29,30</xref>
        ] could produce a policy more capable of adapting to the real
environment. However, ideally the policy could be learned from scratch on the
real system; a suitable simulator may not always be available. Although our
results in simulation were positive, the algorithm is somewhat sample ineficient,
taking roughly 10 million environment steps to converge (equivalent to 6 days of
simulated experience). Thus, another important direction for future work would
be to reduce sample complexity to increase the feasibility of real robot training;
perhaps achievable via a model-based reinforcement learning approach [
        <xref ref-type="bibr" rid="ref18 ref33">18,33</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This publication has emanated from research supported by Science Foundation
Ireland (SFI) under Grant Number SFI/12/RC/2289_P2, co-funded by the
European Regional Development Fund, by Science Foundation Ireland Future Research
Leaders Award (17/FRL/4832), and by China Scholarship Council (CSC). We
thank the Max Planck Institute for Intelligent Systems (Stuttgart, Germany) for
organizing the challenge and providing the necessary software and hardware to
run our experiments remotely on a real robot. We acknowledge the Research IT
HPC Service at University College Dublin for providing computational facilities
and support that contributed to the research results reported in this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Stefan</surname>
          </string-name>
          , et al.
          <article-title>"A Robot Cluster for Reproducible Research in Dexterous Manipulation."</article-title>
          <source>arXiv preprint arXiv:2109.10957</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Wüthrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>Manuel</surname>
          </string-name>
          , et al.
          <article-title>"Trifinger: An open-source robot for learning dexterity." arXiv preprint arXiv:</article-title>
          <year>2008</year>
          .
          <volume>03596</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Funk</surname>
          </string-name>
          ,
          <string-name>
            <surname>Niklas</surname>
          </string-name>
          , et al.
          <article-title>"Benchmarking Structured Policies and Policy Optimization for Real-</article-title>
          <source>World Dexterous Object Manipulation." arXiv preprint arXiv:2105</source>
          .
          <year>02087</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Yoneda</surname>
          </string-name>
          ,
          <string-name>
            <surname>Takuma</surname>
          </string-name>
          , et al.
          <article-title>"Grasp and motion planning for dexterous manipulation for the real robot challenge</article-title>
          .
          <source>" arXiv preprint arXiv:2101.02842</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Liu,
          <string-name>
            <surname>Rongrong</surname>
          </string-name>
          , et al.
          <article-title>"Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review</article-title>
          .
          <source>" Robotics 10.1</source>
          (
          <year>2021</year>
          ):
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Wei</surname>
            , Hui,
            <given-names>Yijie</given-names>
          </string-name>
          <string-name>
            <surname>Bu</surname>
            , and
            <given-names>Ziyao</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>"Robotic arm controlling based on a spiking neural circuit and synaptic plasticity."</article-title>
          <source>Biomedical Signal Processing and Control</source>
          <volume>55</volume>
          (
          <year>2020</year>
          ):
          <fpage>101640</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Cohen,
          <string-name>
            <given-names>Benjamin J.</given-names>
            ,
            <surname>Sachin</surname>
          </string-name>
          <string-name>
            <surname>Chitta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Maxim</given-names>
            <surname>Likhachev</surname>
          </string-name>
          .
          <article-title>"Search-based planning for manipulation with motion primitives</article-title>
          .
          <source>" 2010 IEEE International Conference on Robotics and Automation. IEEE</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stulp</surname>
          </string-name>
          ,
          <string-name>
            <surname>Freek</surname>
          </string-name>
          , et al.
          <article-title>"Learning motion primitive goals for robust manipulation</article-title>
          .
          <source>" 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Montaño</surname>
            , Andrés, and
            <given-names>Raúl</given-names>
          </string-name>
          <string-name>
            <surname>Suárez</surname>
          </string-name>
          .
          <article-title>"Manipulation of unknown objects to improve the grasp quality using tactile information</article-title>
          .
          <source>" Sensors 18.5</source>
          (
          <year>2018</year>
          ):
          <fpage>1412</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Paolo Franceschi and Nicola Castaman "Combining visual and force feedback for the precise robotic manipulation of bulky components"</article-title>
          ,
          <source>Proc. SPIE 11785</source>
          ,
          <string-name>
            <surname>Multimodal</surname>
            <given-names>Sensing</given-names>
          </string-name>
          <source>and Artificial Intelligence: Technologies</source>
          and
          <string-name>
            <surname>Applications</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <volume>1178510</volume>
          (
          <issue>20</issue>
          <year>June 2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. LaValle, Steven M.
          <article-title>"Rapidly-exploring random trees: A new tool for path planning</article-title>
          .
          <source>"</source>
          (
          <year>1998</year>
          ):
          <fpage>98</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Silver</surname>
          </string-name>
          , Tom, et al.
          <article-title>"Residual policy learning</article-title>
          .
          <source>" arXiv preprint arXiv:1812</source>
          .
          <volume>06298</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>Timothy P.</given-names>
          </string-name>
          , et al.
          <article-title>"Continuous control with deep reinforcement learning</article-title>
          .
          <source>" arXiv preprint arXiv:1509.02971</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Haarnoja</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tuomas</surname>
          </string-name>
          , et al.
          <article-title>"Soft actor-critic: Of-policy maximum entropy deep reinforcement learning with a stochastic actor." International conference on machine learning</article-title>
          .
          <source>PMLR</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Schulman</surname>
          </string-name>
          , John, et al.
          <article-title>"Proximal policy optimization algorithms</article-title>
          .
          <source>" arXiv preprint arXiv:1707.06347</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Schulman</surname>
          </string-name>
          , John, et al.
          <article-title>"Trust region policy optimization." International conference on machine learning</article-title>
          .
          <source>PMLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Deisenroth</surname>
            , Marc, and
            <given-names>Carl E.</given-names>
          </string-name>
          <string-name>
            <surname>Rasmussen</surname>
          </string-name>
          .
          <article-title>"PILCO: A model-based and dataeficient approach to policy search."</article-title>
          <source>Proceedings of the 28th International Conference on machine learning (ICML-11)</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Janner</surname>
          </string-name>
          ,
          <string-name>
            <surname>Michael</surname>
          </string-name>
          , et al.
          <article-title>"When to trust your model: Model-based policy optimization." arXiv preprint arXiv:</article-title>
          <year>1906</year>
          .
          <volume>08253</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Hafner</surname>
          </string-name>
          ,
          <string-name>
            <surname>Danijar</surname>
          </string-name>
          , et al.
          <article-title>"Dream to control: Learning behaviors by latent imagination." arXiv preprint arXiv:</article-title>
          <year>1912</year>
          .
          <volume>01603</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Nagabandi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anusha</surname>
          </string-name>
          , et al.
          <article-title>"Deep dynamics models for learning dexterous manipulation." Conference on Robot Learning</article-title>
          . PMLR,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sergey</surname>
          </string-name>
          , et al.
          <article-title>"Ofline reinforcement learning: Tutorial, review, and perspectives on open problems." arXiv preprint arXiv:</article-title>
          <year>2005</year>
          .
          <volume>01643</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Nair</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ashvin</surname>
          </string-name>
          , et al.
          <article-title>"Accelerating online reinforcement learning with ofline datasets." arXiv preprint arXiv:</article-title>
          <year>2006</year>
          .
          <volume>09359</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter</surname>
          </string-name>
          , et al.
          <article-title>"Learning and generalization of motor skills by learning from demonstration</article-title>
          .
          <source>" 2009 IEEE International Conference on Robotics and Automation. IEEE</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Johns</surname>
          </string-name>
          , Edward.
          <article-title>"Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration."</article-title>
          <source>arXiv preprint arXiv:2105.06411</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vecerik</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mel</surname>
          </string-name>
          , et al.
          <article-title>"Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards</article-title>
          .
          <source>" arXiv preprint arXiv:1707.08817</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ilge</surname>
          </string-name>
          , et al.
          <article-title>"Solving rubik's cube with a robot hand." arXiv preprint arXiv:</article-title>
          <year>1910</year>
          .
          <volume>07113</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Xue</given-names>
          </string-name>
          <string-name>
            <surname>Bin</surname>
          </string-name>
          , et al.
          <article-title>"Sim-to-real transfer of robotic control with dynamics randomization." 2018 IEEE international conference on robotics and automation (ICRA)</article-title>
          . IEEE,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Tobin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Josh</surname>
          </string-name>
          , et al.
          <article-title>"Domain randomization for transferring deep neural networks from simulation to the real world." 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS)</article-title>
          . IEEE,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Arndt</surname>
          </string-name>
          ,
          <string-name>
            <surname>Karol</surname>
          </string-name>
          , et al.
          <article-title>"Meta reinforcement learning for sim-to-real domain adaptation." 2020 IEEE International Conference on Robotics and Automation (ICRA)</article-title>
          . IEEE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Eysenbach</surname>
          </string-name>
          ,
          <string-name>
            <surname>Benjamin</surname>
          </string-name>
          , et al.
          <article-title>"Of-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers." arXiv preprint arXiv:</article-title>
          <year>2006</year>
          .
          <volume>13916</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Andrychowicz</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marcin</surname>
          </string-name>
          , et al.
          <article-title>"Hindsight experience replay</article-title>
          .
          <source>" arXiv preprint arXiv:1707.01495</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Plappert</surname>
          </string-name>
          ,
          <string-name>
            <surname>Matthias</surname>
          </string-name>
          , et al.
          <article-title>"Multi-goal reinforcement learning: Challenging robotics environments and request for research." arXiv preprint arXiv:</article-title>
          <year>1802</year>
          .
          <volume>09464</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>McCarthy</surname>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            , and
            <given-names>Stephen J.</given-names>
          </string-name>
          <string-name>
            <surname>Redmond</surname>
          </string-name>
          .
          <article-title>"Imaginary Hindsight Experience Replay: Curious Model-based Learning for Sparse Reward Tasks."</article-title>
          <source>arXiv preprint arXiv:2110.02414</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>