Solving the Real Robot Challenge Using Deep
                                          Reinforcement Learning

                          Robert McCarthy1 *, Francisco Roldan Sanchez2,3 , Qiang Wang1 , David Cordova
                           Bulens1 , Kevin McGuinness2,3 , Noel O’Connor2,3 , and Stephen Redmond1,3
                                                    1
                                                      University College Dublin, Ireland
                                                       Dublin City University, Ireland
                                                        2
                                         3
                                           Insight SFI Research Centre for Data Analytics, Ireland


                                  Abstract. This paper details our winning submission to Phase 1 of the
                                  2021 Real Robot Challenge4 ; a challenge in which a three-fingered robot
                                  must carry a cube along specified goal trajectories. To solve Phase 1, we
                                  use a pure reinforcement learning approach which requires minimal expert
                                  knowledge of the robotic system or of robotic grasping in general. A sparse,
                                  goal-based reward is employed in conjunction with Hindsight Experience
                                  Replay to teach the control policy to move the cube to the desired x and
                                  y coordinates. Simultaneously, a dense distance-based reward is employed
                                  to teach the policy to lift the cube to the desired z coordinate. The
                                  policy is trained in simulation with domain randomisation before being
                                  transferred to the real robot for evaluation. Although performance tends
                                  to worsen after this transfer, our best policy can successfully lift the
                                  real cube along goal trajectories via an effective pinching grasp. Our
                                  approach5 outperforms all other submissions, including those leveraging
                                  more traditional robotic control techniques, and is the first pure learning-
                                  based method to solve this challenge.

                                   Keywords: Robotic Manipulation · Deep Reinforcement Learning · Real
                                   Robot Challenge.


                          1      Real Robot Challenge

                          Dexterous robotic manipulation is applicable in various industrial and domestic
                          settings. However, current state-of-the-art robotic control strategies generally
                          struggle in unstructured tasks which require high degrees of dexterity. Data-
                          driven learning methods are promising for these challenging manipulation tasks,
                          yet related research has been limited by the costly nature of real-robot experi-
                          mentation. In light of these issues, the Real Robot Challenge (RRC) [1] aims to
                          advance the state-of-the-art in robotic manipulation by providing participants
                           4
                               https://real-robot-challenge.com            *robert.mccarthy@ucdconnect.ie
                           5
                               Code: https://github.com/RobertMcCarthy97/rrc_phase1. Videos: https://www.
                               youtube.com/playlist?list=PLLJoWXUn8XplFszi16-VZMTDBhMQFuc5o.


Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
2                                 R. McCarthy et al.


             (a) Simulation                                (b) Reality

Fig. 1: The simulated and real ‘Move Cube on Trajectory’ Trifinger robotic
environments. The task is to bring the cube to specified 3-D goal coordinates,
along a goal trajectory.


with remote access to well-maintained robotic platforms, allowing for cheap and
easy real-robot experimentation. To further support easy experimentation, users
are also provided with a simulated version of the robotic setup (see Figure 1).
    The 2021 RRC consists of an initial qualifying Pre-Phase performed purely in
simulation, followed by independent Phases 1 and 2, both performed on the real
robot. Full details can be found in the ‘Protocol’ section of the RRC website4 .
This paper focuses solely on our approach to Phase 1 of the competition.
    In Phase 1, participants are tasked with solving the challenging ‘Move Cube
on Trajectory’ task. In this task, a cube must be carried along a goal trajectory
(which specifies the coordinates at which the cube should be positioned at each
time-step) using the provided TriFinger robotic platform [2]. For final Phase 1
evaluation, participants submit their developed control policy and receive a score
based on how closely it can follow several randomly sampled goal trajectories.
    ‘Move Cube on Trajectory’ requires a dexterous policy that can adapt to
the various goal and cube positions encountered during an evaluation episode.
Last year (2020), the winning solutions to this task consisted of structured
policies which relied heavily on inductive biases and task specific engineering [3,4].
We take an alternative approach, formulating the task as a pure reinforcement
learning (RL) problem. We then use RL to learn our control policy entirely in
simulation before transferring it to the real robot for final evaluation. Upon this
evaluation, our learned policy outperformed all other competing submissions,
winning Phase 1 of the 2021 RRC.


2     Related Work

2.1   Traditional Robotic Manipulation

Traditional robotic manipulation controllers often rely on solving inverse kinematic
equations [5]. The goal of this approach is to find the parameters needed to
                 Solving the Real Robot Challenge Using Deep Reinforcement Learning      3

position the end-effector of a robotic system (gripper, finger tips, etc.) into the
desired position and orientation [6]. Because the solution to this problem is
not unique, motion primitives - i.e. a set of pre-computed movements that a
robot can take in a given environment - are typically introduced [7,8]. These
primitives can each have a defined cost, allowing the robot to avoid non-smooth
or non-desired transitions. Exteroceptive feedback in the form of sensors (RGB
cameras, depth/tactile sensors, etc.) is usually employed to help the robot achieve
the expected behaviour [9,10].
    Most successful approaches in previous editions of the Real Robot Challenge
make use of a combination of motion planning and motion primitives. The
winning team of the 2020 edition of the challenge [4] used a set of primitives to:
(i) align the cube to the target position and orientation while keeping it on the
ground, and then (ii) perform grasp planning using a Rapidly-exploring Random
Tree (RRT) algorithm [11]. During the grasping planning, they use force control
feedback to ensure the finger tips apply enough force to lift the cube. Finally, they
improve their policy via (simulated) residual policy learning [12], a technique
which uses RL to learn corrective actions added to the output of the original
control policy. Contrary to these methods, we use a pure learning-based approach
which requires minimal task specific engineering.


2.2   Reinforcement Learning for Robotic Manipulation

Deep RL methods promise to allow learning of sophisticated, dexterous robotic
manipulation strategies that would otherwise be impossible, or at least very
difficult, to hand-engineer. However, the data inefficiency of RL is a major
barrier to its application in real-world robotics: real robot data collection is
time-consuming and expensive. Thus, much RL research to-date has focused on
resolving or by-passing these data-efficiency issues.
    Due to their generally improved sample complexity, off-policy RL methods
[13,14] are often preferred to on-policy methods [15,16]. Model-based RL methods,
which explicitly learn a model of their environment, have been proposed to further
improve sample complexity [17,18,19], and have seen success in real robot settings,
e.g., with in-hand object manipulation [20]. Offline RL techniques seek to leverage
previously collected data to accelerate learning [21], and have learned dexterous
real-world skills such as drawer opening [22]. Imitation learning methods provide
the policy with expert demonstrations to learn from [23,24], enabling success in
real robot tasks such as peg insertion [25]. Finally, simulation-to-real (sim-to-real)
transfer methods train a policy quickly and cheaply in simulation before deploying
it on the real robot, and have notably been used to solve a Rubik’s cube with
a robot hand [26]. To account for simulator modelling errors, and to improve
the policies ability to generalize to the real robot, sim-to-real approaches often
employ domain randomisation [27,28] or domain adaptation [29,30] techniques.
Domain randomisation, which has been particularly effective [26], randomises
the physics parameters in simulation to learn a robust policy that can adapt to
the partially unknown physics of the real system.
4                                    R. McCarthy et al.

    Provided with a simulated replica of the real robotic setup, but without access
to prior data or expert demonstrations, we use sim-to-real transfer to bypass
real-robot RL data-efficiency issues.

3    Background
Goal-based Reinforcement Learning. We frame the RRC robotic environ-
ments as a Markov decision process (MDP), defined by the tuple (S, A, G, p, r, γ, ρ0 ).
S, A, and G are the state, action and goal spaces, respectively. The state transi-
tion distribution is denoted as p(s′ |s, a), the initial state distribution as ρ0 (s),
and the reward function as r(s, g). γ ∈ (0, 1) discounts future rewards. The goal
of the RL agent is to find the optimal policy π ∗ that maximizesP∞ the expected
sum of discounted rewards in this MDP: π ∗ = argmaxπ Eπ [ t=0 γ t r(st , gt )].

Deep Deterministic Policy Gradients (DDPG). DDPG [13] is an off-
policy RL algorithm which, in the goal-based RL setting, maintains the following
neural networks: a policy (actor) π : S × G → A, and an action-value function
(critic) Q : S × G × A → R. The critic is trained to minimise the loss Lc =
E(Q(st , gt , at ) − yt )2 , where yt = rt + γQ(st+1 , gt+1 , π(st+1 , gt+1 )). To stabilize
the critics training, the targets yt are produced using slowly updated polyak-
averaged versions of the main networks. The actor is trained to minimise the
loss: La = −Es Q(s, g, π(s, g)), where gradients are computed by backpropagating
through the combined critic and actor networks. For these updates, the transition
tuples (st , gt , at , rt , st+1 , gt+1 ) are sampled from a replay buffer which stores
previously collected experiences (i.e., off-policy data).

Hindsight Experience Replay (HER). HER [31] can be used with any off-
policy RL algorithm in goal-based tasks, and is most effective when the reward
function is sparse and binary (e.g. equation 1). To improve learning in the sparse
reward setting, HER employs a simple trick when sampling previously collected
transitions for policy updates: a proportion of sampled transitions have their
goal g altered to g ′ , where g ′ is a goal achieved later in the episode. The rewards
of these altered transitions are then recalculated with respect to g ′ , leaving the
altered transition tuples as (st , gt′ , at , rt′ , st+1 , gt+1
                                                            ′
                                                                ). Even if the original episode
was unsuccessful, these altered transitions will teach the agent how to achieve g ′ ,
thus accelerating its acquisition of skills.

4    Methods
We train our control policy in simulation with RL before transferring it to the
real robot for evaluation. This allows for quicker and easier data collection versus
real robot training. To compensate for modelling errors in the simulator, we
randomise the simulation dynamics [28]. DDPG + HER is maintained as the RL
algorithm, modified slightly to suit our two-component reward system. We now
describe in detail our simulated environment, followed by our learning algorithm.
                 Solving the Real Robot Challenge Using Deep Reinforcement Learning      5

4.1   Simulated Environment
Actions and Observations. Pure torque control of the robot arms is employed
with an action frequency of 20 Hz (i.e. each time-step in the environment is 0.05
seconds). The robot has three arms, with three motorised joints in each arm;
thus the action space is 9-dimensional (and continuous). Observations include:
(i) robot joint positions, velocities, and torques; (ii) the provided estimate of the
cube’s pose (i.e. its estimated position and orientation), along with the difference
between the current and previous time-step’s pose; and (iii) the goal coordinates
at which the cube should currently be placed (i.e. the active goal of the trajectory).
In total, the observation space has 44 dimensions.

Episodes. In each simulated training episode, the robot begins in its default
position and the cube is placed in a uniformly random position on the arena floor.
Episodes last for 90 time-steps, with the active goal of the randomly sampled
goal trajectory changing every 30 time-steps.

Domain Randomisation. To help the learned policy generalize from an inaccu-
rate simulation to the real environment, we used some basic domain randomisation
(i.e., physics randomisation) during training6 . This includes uniformly sampling,
from a specified range, parameters of the simulation physics (e.g. robot mass,
restitution, damping, friction; see our code for more details) and cube properties
(mass and width) each episode. To account for noisy real-robot actuations and
observations, uncorrelated noise is added to actions and observations within
simulated episodes.

4.2   Learning Algorithm
The goal-based nature of the ‘Move Cube on Trajectory’ task makes HER a
natural fit; HER has excelled in similar goal-based robotic tasks [31] and obviates
the need for complex reward engineering. As such, we use DDPG + HER as our
RL algorithm7 . However, in our early experiments we observed that standard
DDPG + HER was slow in learning to lift the cube. To resolve this issue, we
alter slightly the HER process and incorporate an additional dense reward which
encourages cube-lifting behaviors, as is now described.

Rewards and HER. In our approach, the agent receives two reward components:
(i) a sparse reward based on the the cube’s x-y coordinates, rxy , and (ii) a dense
reward based on the cube’s z coordinate, rz (the coordinate frame can be seen in
Figure 1 (a)).
6
  Our domain randomization implementation is based on the benchmark code from
  the 2020 RRC [3].
7
  Our DDPG + HER implementation is taken from https://github.com/
  TianhongDai/hindsight-experience-replay, and uses hyperparameters largely
  based on [32].
6                                  R. McCarthy et al.

    The sparse x-y reward is calculated as:
                           (
                             0     if   g ′ xy − gxy ≤ 2cm
                     rxy =                                                            (1)
                             −1 otherwise

where g ′ xy are the x-y coordinates of the achieved goal (the actual x-y coordinates
of the cube), and gxy are the x-y coordinates of the desired goal.
    The dense z reward is defined as:
                          
                          
                           −a| zcube − zgoal | if zcube < zgoal
                          
                    rz =                                                          (2)
                           −a | zcube − zgoal | if zcube > zgoal
                          
                          
                             2
where zcube and zgoal are the z-coordinates of the cube and goal, respectively,
and a is a parameter which weights rz relative to rxy (we use a = 20).
    We only apply HER to the x-y coordinates of the goal; i.e., the x-y coordinates
of the goal can be altered in hindsight, but the z coordinate remains unchanged.
Thus, our HER altered goals are: ĝ = (gx′ , gy′ , gz ), meaning only rxy is recalculated
after HER is applied to a transition sampled during policy updates. This reward
system is motivated by the following:

 1. Using rxy with HER allows the agent to learn to push the cube around in
    the early stages of training, even if it cannot yet lift the cube to reach the
    z-coordinate of the goal. As the agent learns to push the cube around in the
    x-y plane of the arena floor, it can then more easily stumble upon actions
    which lift it. Importantly, the rxy + HER approach requires no complicated
    reward engineering.
 2. rz aims to explicitly teach the agent to lift the cube by encouraging min-
    imisation of the vertical distance between the cube and the goal. It is less
    punishing when the cube is above the goal, serving to further encourage
    lifting behaviours.
 3. In the early stages of training, the cube mostly remains on the floor. During
    these stages, most g ′ sampled by HER will be on the floor. Thus, applying
    HER to rz could often lead to the agent being punished for briefly lifting the
    cube. Since we only apply HER to the x-y coordinates of the goal, our HER
    altered goals, ĝ, maintain their original z height. This leaves more room for
    the agent to be rewarded by rz for any cube lifting it performs.


Goal Trajectories. In each episode, the agent is faced with multiple goals;
it must move the cube from one goal to the next along a given trajectory. To
ensure the HER process remains meaningful in these multi-goal episodes, we only
sample future achieved goals, g ′ , (to replace g) from the period of time in which
g was active.
    In our implementation, the agent is unaware that it is dealing with trajectories:
when updating the policy with transitions (st , gt , at , rt , st+1 , gt+1 ) we always set
                 Solving the Real Robot Challenge Using Deep Reinforcement Learning       7


       (a) Pushing                  (b) Cradling                   (c) Pinching

      Fig. 2: The various manipulation strategies learned by our approach.


gt+1 = gt , even if in reality gt+1 was different8 . Thus, the policy focuses solely
on achieving the current active goal and is unconcerned by any future changes in
the active goal.

Exploration vs Exploitation. We derive our DDPG + HER hyperparameters
from Plappert et al. [32], who use a highly ‘exploratory’ policy when collecting
data in the environment: with probability 30% a random action is sampled
(uniformly) from the action-space, and when policy actions are chosen, Gaussian
noise is applied. This is beneficial for exploration in the early stages of training,
however, it can be limiting in the later stages when the policy must be fine-
tuned; we found that the exploratory policy repeatedly drops the cube due to
the randomly sampled actions and the injected action noise. To resolve this issue,
rather than slowly reducing the level of exploration each epoch - which would
require a degree of hyperparameter tuning, we make efficient use of evaluation
episodes (which are performed by the standard ‘exploiting’ policy) by adding
them to the replay buffer. Thus, 90% of rollouts added to the buffer are collected
with the exploratory policy, and the remaining 10% with the exploiting policy.
This addition was sufficient to boost final success rates in simulation from 70-80%
to >90% (where "success rate" is equivalent to that seen in Figure 3).

5     Results
5.1   Simulation
Our method is highly effective in simulation. The algorithm can learn from scratch
to proficiently grasp the cube and lift it along goal trajectories. Figure 3 compares
the training performance of our final algorithm to that of standard HER9 . Our
8
  Interestingly, we found that exposing the agent (during updates) to transitions in
  which gt+1 ≠ gt hurt performance significantly, perhaps due to the extra uncertainty
  this introduces to the DDPG action-value estimates.
9
  These runs did not use domain randomization. Generally we trained from scratch in
  standard simulation before fine-tuning in a domain-randomized simulation
8                                    R. McCarthy et al.

algorithm converges in roughly 23 the time of standard HER, and is markedly
improved in the the early stages of training; this allowed us to iteratively develop
our approach more quickly. Throughout different training runs, our policies
learned several different manipulation strategies, the most distinct of which
included: (i) ‘pinching’ the cube with two arm tips and supporting it with the
third, and (ii) ‘cradling’ the cube with all three of its forearms (see Figure 2).


Fig. 3: Success rate vs experience collected during simulated training (1 day ≈
1.7 million environment steps). We compare training with: (i) HER applied to
a standard sparse reward (blue), (ii) HER applied to both rxy and rz (orange),
and (iii) our final method where HER is applied to rxy but not to rz . An episode
is deemed successful if, when complete, the final goal of the trajectory has been
achieved.


Table 1: Self-reported evaluation scores of our learned pushing, cradling, and
pinching policies when deployed on the simulated and real robots (mean ±
standard deviation score over 10 episodes). Scores are based on the cumulative
                                                                Pn         ||et ||      |et |
position error of the cube during an episode: score = t=0 −( 12 dxy           xy
                                                                                   + 12 dzz ),
where et = (etx , ety ; etz ) is the error between the cube and goal position at time-step
t, dxy the arena range on the x-y plane, and dz the range on the z-axis.
                                  Pushing          Cradling           Pinching
             Simulation -20,399±3,799      -6,349±1,039    -6,198±1,840
             Real robot -22,137 ± 3,671 -14,207 ± 2,160 -11,489 ± 3,790
                 Solving the Real Robot Challenge Using Deep Reinforcement Learning     9

5.2   Real Robot

Our final policies transferred to the real robot with reasonable success. Table
1 displays the self-reported scores of our best pinching and cradling policies
under RRC Phase 1 evaluation conditions. As a baseline comparison, we trained
a simple ‘pushing’ policy which ignores the height component of the goal and
simply learns to push the cube along the floor to the goal’s x-y coordinates. The
pinching policy performed best on the real robot, and is capable of carrying
the cube along goal trajectories for extended periods of time, and of recovering
the cube when it is dropped. This policy was submitted for the official RRC
Phase 1 final evaluation round and obtained the winning score (see https://
real-robot-challenge.com/leaderboard, username ‘thriftysnipe’).
    The domain gap between simulation and reality was significant, and generally
led to inferior scores on the real robot. Policies often struggled to gain control of
the real cube which appeared to slide more freely than in simulation. Additionally,
on the real robot policies could become stuck with an arm-tip pressing the cube
into the wall. As a makeshift solution to this issue, we assumed the policy
was stuck whenever the cube had not reached the goal’s x-y coordinates for
50 consecutive steps, then uniformly sampled random actions for 7 steps in an
attempt to ‘free’ the policy from its stuck state.


6     Discussion

Our relatively simple reinforcement learning approach fully solves the ‘Move Cube
on Trajectory’ task in simulation. Moreover, our learned policies can successfully
implement their sophisticated manipulation strategies on the real robot. Unlike
last years benchmark solutions [3], this was achieved with the use of minimal
domain-specific knowledge. We outperformed all competing submissions, including
those employing more classical robotic control techniques.
     Due to the large domain gap, our excellent performances in simulation were
not fully matched upon transfer to the real robot. Indeed, the main limitation
of our approach was the absence of any training on real-robot data. It is likely
that some fine-tuning of the policy on real data would greatly increase its
robustness in the real environment, and developing a technique which could do so
efficiently is one direction for future work. Similarly, the use of domain adaptation
techniques [29,30] could produce a policy more capable of adapting to the real
environment. However, ideally the policy could be learned from scratch on the
real system; a suitable simulator may not always be available. Although our
results in simulation were positive, the algorithm is somewhat sample inefficient,
taking roughly 10 million environment steps to converge (equivalent to 6 days of
simulated experience). Thus, another important direction for future work would
be to reduce sample complexity to increase the feasibility of real robot training;
perhaps achievable via a model-based reinforcement learning approach [18,33].
10                                R. McCarthy et al.

Acknowledgments

This publication has emanated from research supported by Science Foundation
Ireland (SFI) under Grant Number SFI/12/RC/2289_P2, co-funded by the Euro-
pean Regional Development Fund, by Science Foundation Ireland Future Research
Leaders Award (17/FRL/4832), and by China Scholarship Council (CSC). We
thank the Max Planck Institute for Intelligent Systems (Stuttgart, Germany) for
organizing the challenge and providing the necessary software and hardware to
run our experiments remotely on a real robot. We acknowledge the Research IT
HPC Service at University College Dublin for providing computational facilities
and support that contributed to the research results reported in this paper.


References
1. Bauer, Stefan, et al. "A Robot Cluster for Reproducible Research in Dexterous
   Manipulation." arXiv preprint arXiv:2109.10957 (2021).
2. Wüthrich, Manuel, et al. "Trifinger: An open-source robot for learning dexterity."
   arXiv preprint arXiv:2008.03596 (2020).
3. Funk, Niklas, et al. "Benchmarking Structured Policies and Policy Optimization
   for Real-World Dexterous Object Manipulation." arXiv preprint arXiv:2105.02087
   (2021).
4. Yoneda, Takuma, et al. "Grasp and motion planning for dexterous manipulation for
   the real robot challenge." arXiv preprint arXiv:2101.02842 (2021).
5. Liu, Rongrong, et al. "Deep reinforcement learning for the control of robotic manip-
   ulation: a focussed mini-review." Robotics 10.1 (2021): 22.
6. Wei, Hui, Yijie Bu, and Ziyao Zhu. "Robotic arm controlling based on a spiking
   neural circuit and synaptic plasticity." Biomedical Signal Processing and Control 55
   (2020): 101640.
7. Cohen, Benjamin J., Sachin Chitta, and Maxim Likhachev. "Search-based planning
   for manipulation with motion primitives." 2010 IEEE International Conference on
   Robotics and Automation. IEEE, 2010.
8. Stulp, Freek, et al. "Learning motion primitive goals for robust manipulation." 2011
   IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2011.
9. Montaño, Andrés, and Raúl Suárez. "Manipulation of unknown objects to improve
   the grasp quality using tactile information." Sensors 18.5 (2018): 1412.
10. Paolo Franceschi and Nicola Castaman "Combining visual and force feedback for the
   precise robotic manipulation of bulky components", Proc. SPIE 11785, Multimodal
   Sensing and Artificial Intelligence: Technologies and Applications II, 1178510 (20
   June 2021)
11. LaValle, Steven M. "Rapidly-exploring random trees: A new tool for path planning."
   (1998): 98-11.
12. Silver, Tom, et al. "Residual policy learning." arXiv preprint arXiv:1812.06298
   (2018).
13. Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning."
   arXiv preprint arXiv:1509.02971 (2015).
14. Haarnoja, Tuomas, et al. "Soft actor-critic: Off-policy maximum entropy deep
   reinforcement learning with a stochastic actor." International conference on machine
   learning. PMLR, 2018.
                  Solving the Real Robot Challenge Using Deep Reinforcement Learning           11

15. Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint
   arXiv:1707.06347 (2017).
16. Schulman, John, et al. "Trust region policy optimization." International conference
   on machine learning. PMLR, 2015.
17. Deisenroth, Marc, and Carl E. Rasmussen. "PILCO: A model-based and data-
   efficient approach to policy search." Proceedings of the 28th International Conference
   on machine learning (ICML-11). 2011.
18. Janner, Michael, et al. "When to trust your model: Model-based policy optimiza-
   tion." arXiv preprint arXiv:1906.08253 (2019).
19. Hafner, Danijar, et al. "Dream to control: Learning behaviors by latent imagination."
   arXiv preprint arXiv:1912.01603 (2019).
20. Nagabandi, Anusha, et al. "Deep dynamics models for learning dexterous manipu-
   lation." Conference on Robot Learning. PMLR, 2020.
21. Levine, Sergey, et al. "Offline reinforcement learning: Tutorial, review, and perspec-
   tives on open problems." arXiv preprint arXiv:2005.01643 (2020).
22. Nair, Ashvin, et al. "Accelerating online reinforcement learning with offline datasets."
   arXiv preprint arXiv:2006.09359 (2020).
23. Pastor, Peter, et al. "Learning and generalization of motor skills by learning from
   demonstration." 2009 IEEE International Conference on Robotics and Automation.
   IEEE, 2009.
24. Johns, Edward. "Coarse-to-Fine Imitation Learning: Robot Manipulation from a
   Single Demonstration." arXiv preprint arXiv:2105.06411 (2021).
25. Vecerik, Mel, et al. "Leveraging demonstrations for deep reinforcement learning on
   robotics problems with sparse rewards." arXiv preprint arXiv:1707.08817 (2017).
26. Akkaya, Ilge, et al. "Solving rubik’s cube with a robot hand." arXiv preprint
   arXiv:1910.07113 (2019).
27. Peng, Xue Bin, et al. "Sim-to-real transfer of robotic control with dynamics random-
   ization." 2018 IEEE international conference on robotics and automation (ICRA).
   IEEE, 2018.
28. Tobin, Josh, et al. "Domain randomization for transferring deep neural networks
   from simulation to the real world." 2017 IEEE/RSJ international conference on
   intelligent robots and systems (IROS). IEEE, 2017.
29. Arndt, Karol, et al. "Meta reinforcement learning for sim-to-real domain adaptation."
   2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE,
   2020.
30. Eysenbach, Benjamin, et al. "Off-Dynamics Reinforcement Learning: Training for
   Transfer with Domain Classifiers." arXiv preprint arXiv:2006.13916 (2020).
31. Andrychowicz, Marcin, et al. "Hindsight experience replay." arXiv preprint
   arXiv:1707.01495 (2017).
32. Plappert, Matthias, et al. "Multi-goal reinforcement learning: Challenging robotics
   environments and request for research." arXiv preprint arXiv:1802.09464 (2018).
33. McCarthy, Robert, and Stephen J. Redmond. "Imaginary Hindsight Experience
   Replay: Curious Model-based Learning for Sparse Reward Tasks." arXiv preprint
   arXiv:2110.02414 (2021).