<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>One-Shot Learning for Robotic Manipulators: Rapid Replication of Human Activities from a Single Demonstration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaime Duque-Domingo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Caccavale</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Finzi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo Zalama</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaime Gómez-García-Bermejo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Tecnológico CARTIF</institution>
          ,
          <addr-line>Boecillo, 47151 Valladolid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Advanced Production Technologies - Department of Systems Engineering and Automatics (ITAP-DISA), School of Industrial Engineers, University of Valladolid</institution>
          ,
          <addr-line>Prado de la Magdalena 3-5, 47011</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>PRISMA Lab. Dipartimento di Ingegneria Elettrica e delle Tecnologie dell'Informazione. Università degli Studi di Napoli “Federico II”</institution>
          ,
          <addr-line>Napoli</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a One-Shot Learning framework able to process a RGB-D video of a human task demonstration and to perform it on a robot manipulator. Learning from a single human demonstration is one of the most interesting challenges in robotics. The aim is to allow a robot to reproduce operator's activities after observing how they are performed just once. Although the work presented in this paper focuses on specific manipulation tasks, the proposed method can be extended to multi-stage operations carried out in diferent fields, both domestic and industrial. In the proposed approach, the demonstration is first segmented into primitives, which are then mapped into robot actions to be executed by a manipulator. This work also aims to ensure that the learning process is carried out rapidly. The paper provides an overview of the overall framework and illustrates the system at work in a use case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;One-Shot Learning</kwd>
        <kwd>robot learning</kwd>
        <kwd>human demonstrations</kwd>
        <kwd>activity segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>This paper introduces a one-shot learning framework capable of processing RGB-D video of a human
task demonstration involving known objects and replicating the task using a robotic manipulator.
One-shot learning allows a robot to imitate tasks or activities after only a single observation. This
problem is both relevant and challenging, as it ofers the advantage of quickly acquiring new skills,
while requiring efective use of prior knowledge to generalize from just one example. In robotics,
one-shot learning represents a significant leap forward, in that it allows robots to quickly and eficiently
learn tasks that would otherwise require extensive training, mirroring the adaptive learning process of
humans.</p>
      <p>In this work, we propose a novel framework that exploits real-time object detection and assumptions
about manipulation actions to both segments human demonstrations and flexibly reproduce observed
tasks. Specifically, in the proposed approach, a RGB-D recorded human demonstration is firstly
segmented and then associated to action primitives, which are composed and adapted to be reproduced
by a robotic manipulator acting on the same target. The framework employs Yolo-based 3D object
segmentation, alongside human feature tracking (including key hand trajectories and gaze detection),
to monitor human-object interactions, enabling the isolation, interpretation, and replication of action
primitives. While our current proposal focuses on basic manipulation capabilities (e.g., grasp, drop,
11th Italian Workshop on Artificial Intelligence and Robotics (AIRO 2024)
* Corresponding author.
†These authors contributed equally.
$ jaime.duque@uva.es (J. Duque-Domingo)
0000-0001-6649-5550 (J. Duque-Domingo)</p>
      <p>© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
carry, etc.), the framework is intended to incrementally observe and reproduce multi-step operations
across various domains, both domestic and industrial.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Overview of related work</title>
      <p>
        In the field of collaborative robotics, significant research eforts aim to enhance our understanding
of environmental interactions and object manipulation, aspiring to replicate human dexterity. Many
studies highlight the integration of advanced perception tools, such as computer vision, which
empower robots to interpret their surroundings with precision [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Additionally, there is an increasing
focus on developing algorithms that mimic human proficiency in grasping and manipulating various
objects, addressing challenges like handling diverse shapes [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Techniques such as learning from
demonstrations [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], and reinforcement learning [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] are also employed to promote more natural
human-robot interactions during task learning. Collectively [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], these advancements enable robots to
better understand their environment and skillfully manage objects, approaching human-like abilities.
      </p>
      <p>
        One prominent robotic technique is reinforcement learning (RL), where robots autonomously develop
control strategies through iterative experimentation. Lobbezoo et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] combine traditional control
methods with RL in both virtual and physical environments, advocating for RL in conventional industrial
tasks such as reaching, grasping, and placing. In contrast to typical methods where robots identify
and perform grasps, Kalashnikov et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose a vision-based, closed-loop control system. In this
approach, the robot continuously refines its grasp strategy based on new sensory data, optimising its
success rates. To tackle the challenge of identifying optimal grasp locations, Mahler et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] utilised
a synthetic dataset containing a wide range of point clouds, grasps, and analytical metrics to train a
predictive model for grasp success. Similarly, Guo et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] developed a dataset featuring real-world
manipulable objects, providing detailed pose information and afordance predictions. Another notable
approach utilises a multi-stage grasp detection algorithm for Kinova robots in cluttered environments
[13].
      </p>
      <p>In the area of robot learning from demonstrations, some studies focus on gesture recognition through
human skeletal data, leveraging neural networks and Markov models [14]. Others examine human
demonstrations across various contexts [15], promoting imitation learning frameworks [16] or
kinesthetic teaching of structured tasks [17]. A specialised approach explores robot eye-hand coordination
[18], where robots extract task-relevant information from human videos to guide real-time actions.
By integrating human demonstration data with RL, San et al. [19] advocate for continuous
robotenvironment interaction to enhance skill acquisition. Similarly, Kamali et al. [20] utilise virtual reality to
guide robotic actions through hand gestures. Cabi et al. [21] develop policies for diverse manipulation
tasks using a variety of techniques, incorporating human preferences to refine task rewards.</p>
      <p>Diferently from these methods, in this paper, we address the challenge of rapid one-shot learning [ 22,
23, 24]. In particular, we are interested in learning structured robotic manipulation tasks, demonstrated
through a single human demonstration captured by an RGB-D camera. In this respect, similarly to [22],
the proposed approach focuses on quickly adapting the human demonstration to enable direct task
replication, without the need for detailed or complex object models. This ensures broad adaptability
while avoiding the extensive training typically required by reinforcement learning (RL) methods [25] or
behavior cloning methods on a data-set of tasks [24, 26]. However, in contrast with [22], our approach
introduces a novel method that leverages object and action segmentation from RGB-D video, allowing
us to isolate and replicate manipulation primitives inferred from the demonstration.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed system is based on two main stages: activity recording and processing, task reproduction.</p>
      <sec id="sec-3-1">
        <title>INPUT</title>
        <p>RGB-D
(RGB frame)</p>
        <p>RGB-D
(depth map)</p>
      </sec>
      <sec id="sec-3-2">
        <title>OUTPUT</title>
        <p>Categorise each primitive action based
on the movement of objects between the
frames corresponding to the candidate
primitives.</p>
        <p>Compute 3D
coordinates of the
trajectory of the hand</p>
        <p>and the face.</p>
        <p>Apply tracking filter
between frames</p>
        <p>Evaluate person’s gaze
Extraction of candidate primitive actions
based on the proximity to objects in 3D,
the hand speed and the evaluation of the
gaze on the object space.</p>
        <p>YOLO segmentation
of objects of</p>
        <p>interest
Align masks with
previous frames
Extract 2D centroids
of objects</p>
        <p>Compute 3D
centroids of object
Apply tracking filter
between frames</p>
        <sec id="sec-3-2-1">
          <title>3.1. Activity recording and processing</title>
          <p>In the first stage of the proposed pipeline we collect and process a RGB-D video capturing a human
activity demonstration. The video is segmented to isolate primitive actions using several features, such
as the proximity of the user’s hands to relevant objects, the speed of hand movements, and the direction
of the user’s gaze. The segmentation process is outlined in Figure 1. Initially, RGB frames are analyzed
to perform object segmentation and extract key points from the user’s hands and face. Depth maps,
combined with filters, are then used to derive 3D trajectories of the hand and face points, as well as
the 3D centroids of detected objects. The orientation of the user’s gaze is subsequently evaluated to
identify potential target objects, which are then exploited to segment the human demonstration and
isolate candidate primitive actions as interpretation of those segments. The primitive actions are finally
classified based on the motion of the associated objects.</p>
          <p>The action segmentation process introduced above relies on the proximity and velocity of the
operator’s hand relative to the detected objects in the scene, with additional reinforcement from the
operator’s gaze direction. More specifically, segmentation is based on three thresholds 1, 2, and 3.
These thresholds are used to determine the operator’s intention to interact with objects in the 3D space
through specific primitive manipulation actions. The first threshold, 1, defines the maximum distance
between the operator’s hand and the centroid of a detected object for an interaction to be considered. If
multiple objects fall within this distance, potential interactions are prioritized by proximity. The second
threshold, 2, specifies the maximum hand speed allowed for an interaction to be considered with a
nearby object. The third threshold, 3, sets the maximum allowable angle between the operator’s gaze
direction and a proximal object to consider a plausible intention to interact. The underlying assumption
is that the user gaze should be directed towards the target of a manipulation action. For each primitive
action extracted by the segmentation process, the system tracks and records key positions of the hand
trajectory with respect to the centroid of the objects participating in the action.</p>
          <p>The overall pipeline described above is built on object segmentation, hand detection/tracking, and
gaze direction monitoring. Additional details about these modules are provided below.</p>
          <p>Object segmentation is performed using YOLOv8 [27] that exploits a deep convolutional neural
network architecture to process images similar to the one of YOLO [28], enhanced with additional layers
and a special branch to predict segmentation masks. The output of this module includes bounding
boxes, object classes, class probabilities, and segmentation masks.</p>
          <p>Hand points detection and tracking are based on the MediaPipe Hand Landmarker [29], which operates
in real-time by first using a palm detection model to locate the hand and then predicting 21 key landmarks
on the hand, including finger joints, tips, and the wrist. While this method eficiently tracks multiple
hands and is optimized for gesture recognition, augmented reality (AR), and interactive applications,
we found that the 3D coordinates returned by this model did not yield satisfactory results within our
framework. Therefore, as we use a depth camera, we found it more reliable to directly leverage the
depth information provided by the RGB-D camera.</p>
          <p>Gaze orientation estimation is achieved using MediaPipe’s FaceMesh [30], a real-time facial landmark
tracking technology that detects 468 key points on the face using a standard camera. FaceMesh detects
the face, maps 2D landmarks, and estimates 3D coordinates for each point. Though it provides high
eficiency for facial feature tracking, FaceMesh does not directly return the center of the eyes. Thus,
interpolation is used to calculate this point. To determine the direction of the gaze, a vector is computed
between the center of the forehead and the eyes, giving a normal vector for the gaze direction. The
distance from the RGB-D camera to each facial point is used to construct the 3D model of the face.</p>
          <p>The basic primitives are fixed: grasping (TAKE), moving (MOVE OVER), waiting (WAIT) and releasing
(RELEASE). The activities are decomposed into several primitives. Users can record the activities at
diferent speeds, although there are thresholds for correct detection.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2. Task reproduction</title>
          <p>Once the human demonstration has been segmented and processed, the interpreted manipulation task
is to be reproduced by a robot manipulator operating in the same workspace. The robotic platform is
assumed to be a manipulator equipped with a gripper.</p>
          <p>For ease of demonstration, it is assumed that the robot and the human are positioned opposite each
other in the workspace. Consequently, the trajectories and points collected during task segmentation
must first be mirrored and then adapted for the robot’s execution. Mirroring is achieved by applying a
180-degree transformation to the objects’ axes and the hand interaction points relative to the camera
and base marker, both of which are in the same plane as the table where the actions occur.</p>
          <p>To enable rapid task reproduction, the robotic system executes the sequence of detected manipulation
actions step-by-step, operating on the target objects as demonstrated by the human. The robot first
segments the scene using YOLOv8 to detect and locate task-relevant objects before deploying the
demonstrated actions. Given the 3D locations of the objects, key points from the human hand trajectory
- such as the wrist, index finger, and thumb - are mapped to reproduce the trajectory of the robot’s
end-efector and gripper movements relative to the target object. For instance, to replicate object
grasping, the robot’s end-efector follows the trajectory of the human hand to reach the pose necessary
for approaching the object, followed by a grasp movement, where the gripper motion is adapted from
the recorded motion of the human’s index finger and thumb. If the task involves multiple steps or
objects, the system continuously monitors the execution of each manipulation action and the status of
the target object until the task is completed.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimentation</title>
      <p>Experiments were conducted using a Kuka IIWA 7 robot, a Real Sense D415 camera, an i7 server with
an RTX 3060 GPU and ROS 2 software. Human demonstrations were recorded and processed ofline,
with a processing time of approximately 1 minute for a 15-second video. During the execution phase,
the system receives the task to be performed and begins by segmenting the objects using YOLOv8.
Once segmentation is completed, the system retreives each detected primitive action of the task and
executes them by leveraging the key points and trajectories associated with the segmented actions.</p>
      <p>Figure 2 illustrate the overall system in action, where a person demonstrates picking up a bottle
of water and placing it on top of a small bowl. During this one-shot task demonstration, the system
identifies two action primitives (Figure 2, second column) involving two target objects (the bottle and
the bowl). The robot is then able to rapidly and flexibly reproduce the demonstrated task (Figure 2,
second column), regardless of the objects’ positions in the workspace, as it learns the relationships
between the hand and the objects for each action segment. In this scenario, we observed reliable and
precise task reproduction, with an error margin of ± 1cm during execution. Additional tasks such
as grasping, carrying, placing, and pouring were also tested, yielding satisfactory results. However,
challenges remain during task demonstration, particularly with depth camera precision and estimation
errors when fingers are occluded, which need to be addressed to ensure more robust task detection. As
for task replication, we currently assume a clear workspace where obstacles and potential collisions are
neglected for simplicity. Future work will focus on developing methods that can handle more complex
tasks and environments with obstacles in flexible and reliable manner.</p>
      <p>Video off-line
processing</p>
      <p>First primitive: take bottle</p>
      <p>Robot execution
Second primitive: move
bottle over bowl</p>
      <p>In our experiments we have used a robot controller implementing obstacle-free movement of the
end-efector toward the desired pose in the robot’s operational space. The processing of the vision is the
most computationally expensive part, both for the segmentation and for the execution itself. However,
the processing of a new activity is carried out in a few minutes and its execution in the robot is carried
out in a few seconds since only the processing of the first frame is required.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work presents a system capable of learning multi-step tasks from human demonstrations and
reproducing them on a robot manipulator. Operating under a one-shot learning paradigm, the system
aims to enable rapid, flexible, and reliable reproduction of typical manipulation tasks across a dataset
of known objects. Currently, the system is being tested on tasks such as picking, carrying, placing,
and pouring, performed by a robot manipulator equipped with a gripper. Initial results are promising;
however, several challenges remain, particularly in scaling and generalizing task interpretation and
reproduction for more complex manipulation scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research has received funding from projects ROSOGAR PID2021-123020OB-I00 funded by
MCIN/AEI/10.13039/501100011033/FEDER, UE, EIAROB funded by Consejería de Familia of the Junta de
Castilla y León - Next Generation EU, INVERSE (EU Horizon, grant 101136067), euROBIN (EU Horizon,
grant 101070596), Melody (CUP E53D23017550001).
world manipulable object categories with pose annotations, afordances, and reconstructions, 2023.
arXiv:2308.01477.
[13] X. Dong, Y. Jiang, F. Zhao, J. Xia, A practical multi-stage grasp detection method for kinova robot in
stacked environments, Micromachines 14 (2023). URL: https://www.mdpi.com/2072-666X/14/1/117.
doi:10.3390/mi14010117.
[14] J. D. Domingo, J. Gómez-García-Bermejo, E. Zalama, Visual recognition of gymnastic exercise
sequences. application to supervision and robot learning by demonstration, Robotics and
Autonomous Systems 143 (2021) 103830.
[15] Z. Qian, M. You, H. Zhou, X. Xu, B. He, Robot learning from human demonstrations with
inconsistent contexts, Robotics and Autonomous Systems 166 (2023) 104466. URL: https://www.
sciencedirect.com/science/article/pii/S0921889023001057. doi:https://doi.org/10.1016/j.
robot.2023.104466.
[16] R. Caccavale, M. Saveriano, G. A. Fontanelli, F. Ficuciello, D. Lee, A. Finzi, Imitation learning and
attentional supervision of dual-arm structured tasks, in: Proc. of ICDL-EpiRob, 2017, pp. 66–71.
[17] R. Caccavale, M. Saveriano, A. Finzi, D. Lee, Kinesthetic teaching and attentional supervision of
structured tasks in human-robot interaction, Auton. Robots 43 (2019) 1291–1307.
[18] J. Jin, L. Petrich, M. Dehghan, Z. Zhang, M. Jagersand, Robot eye-hand coordination learning by
watching human demonstrations: a task function approximation approach, in: 2019 International
Conference on Robotics and Automation (ICRA), 2019, pp. 6624–6630. doi:10.1109/ICRA.2019.
8793649.
[19] X. Sun, J. Li, A. V. Kovalenko, W. Feng, Y. Ou, Integrating reinforcement learning and learning
from demonstrations to learn nonprehensile manipulation, IEEE Transactions on Automation
Science and Engineering 20 (2023) 1735–1744. doi:10.1109/TASE.2022.3185071.
[20] K. Kamali, I. A. Bonev, C. Desrosiers, Real-time motion planning for robotic teleoperation using
dynamic-goal deep reinforcement learning, in: 2020 17th Conference on Computer and Robot
Vision (CRV), 2020, pp. 182–189. doi:10.1109/CRV50864.2020.00032.
[21] S. Cabi, S. G. Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y.
Aytar, D. Budden, M. Vecerik, O. Sushkov, D. Barker, J. Scholz, M. Denil, N. de Freitas, Z. Wang,
Scaling data-driven robotics with reward sketching and batch reinforcement learning, 2020.
arXiv:1909.12200.
[22] Y. Wu, Y. Demiris, Towards one shot learning by imitation for humanoid robots, in: 2010 IEEE</p>
      <p>International Conference on Robotics and Automation (ICRA 2010), 2010, pp. 2889–2894.
[23] M. A. R. S. A. L. J. W. M Kopicki, R Detry, One-shot learning and generation of dexterous grasps
for novel objects, The International Journal of Robotics Research 35 (2015).
[24] S. Dasari, A. Gupta, Transformers for one-shot visual imitation, in: J. Kober, F. Ramos, C. Tomlin
(Eds.), Proceedings of the 2020 Conference on Robot Learning, volume 155 of Proceedings of
Machine Learning Research, PMLR, 2021, pp. 2071–2084.
[25] J. Fu, S. Levine, P. Abbeel, One-shot learning of manipulation skills with online dynamics adaptation
and neural network priors, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), 2016, pp. 4019–4026.
[26] T. Z. P. A. S. L. Chelsea Finn, Tianhe Yu, One-shot visual imitation learning via meta-learning, in:</p>
      <p>In Conference on Robot Learning, 2017, pp. 357–368.
[27] G. Jocher, A. Chaurasia, J. Qiu, Ultralytics yolo v8, https://docs.ultralytics.com/models/yolov8/,
2023. Last access: 2024-07-15.
[28] J. Redmon, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016.
[29] F. Zhang, V. Bazarevsky, A. Vakunov, A. Tkachenka, G. Sung, C.-L. Chang, M. Grundmann,</p>
      <p>Mediapipe hands: On-device real-time hand tracking, arXiv preprint arXiv:2006.10214 (2020).
[30] Y. Kartynnik, A. Ablavatski, I. Grishchenko, M. Grundmann, Real-time facial surface geometry
from monocular video on mobile gpus, arXiv preprint arXiv:1907.06724 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sie</surname>
          </string-name>
          ,
          <article-title>Position-aware pushing and grasping synergy with deep reinforcement learning in clutter</article-title>
          ,
          <source>CAAI Transactions on Intelligence Technology</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kleeberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bormann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <article-title>A survey on learning-based robotic grasping</article-title>
          ,
          <source>Current Robotics Reports</source>
          <volume>1</volume>
          (
          <year>2020</year>
          )
          <fpage>239</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Newbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chumbley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mousavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Eppner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Asfour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kragic</surname>
          </string-name>
          , et al.,
          <article-title>Deep learning approaches to grasp synthesis: A review</article-title>
          ,
          <source>IEEE Transactions on Robotics</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ravichandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Polydoros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chernova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Billard</surname>
          </string-name>
          ,
          <article-title>Recent advances in robot learning from demonstration, Annual review of control, robotics, and autonomous systems 3 (</article-title>
          <year>2020</year>
          )
          <fpage>297</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Survey of imitation learning for robotic manipulation</article-title>
          ,
          <source>International Journal of Intelligent Robotics and Applications</source>
          <volume>3</volume>
          (
          <year>2019</year>
          )
          <fpage>362</fpage>
          -
          <lpage>369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Brunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Panerati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Schoellig</surname>
          </string-name>
          ,
          <article-title>Safe learning in robotics: From learning-based control to safe reinforcement learning</article-title>
          ,
          <source>Annual Review of Control, Robotics, and Autonomous Systems</source>
          <volume>5</volume>
          (
          <year>2022</year>
          )
          <fpage>411</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning in robotic applications: a comprehensive survey</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>Deep imitation reinforcement learning for selfdriving by vision</article-title>
          ,
          <source>CAAI Transactions on Intelligence Technology</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <fpage>493</fpage>
          -
          <lpage>503</lpage>
          . URL: https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/cit2.12025. doi:https://doi. org/10.1049/cit2.12025. arXiv:https://ietresearch.onlinelibrary.wiley.com /doi/pdf/10.1049/cit2.
          <fpage>12025</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lobbezoo</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Kwon</surname>
          </string-name>
          ,
          <article-title>Simulated and real robotic reach, grasp, and pick-and-place using combined reinforcement learning and traditional controls</article-title>
          ,
          <source>Robotics</source>
          <volume>12</volume>
          (
          <year>2023</year>
          ). URL: https://www. mdpi.com/2218-6581/12/1/12. doi:
          <volume>10</volume>
          .3390/robotics12010012.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalashnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Irpan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ibarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herzog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Quillen</surname>
          </string-name>
          , E. Holly,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kalakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          , et al.,
          <article-title>Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>10293</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mahler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Niyaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Laskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Ojea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1703</volume>
          .
          <fpage>09312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tremblay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tyree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Birchfield</surname>
          </string-name>
          ,
          <article-title>Handal: A dataset of real-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>