Towards a holistic human perception system for close
human-robot collaboration
Matteo Terreran1 , Leonardo Barcellona1,2 , Davide Allegro1 and Stefano Ghidoni1
1
    Intelligent and Autonomous System Laboratory (IAS-Lab), University of Padova, Italy
2
    Ph.D student in Artifical Intelligence, Politecnico di Torino, 10138 Torino, Italy


                                         Abstract
                                         When considering close human-robot collaboration, perception plays a central role in order to guarantee
                                         a safe and intuitive interaction. In this work, we present an AI-based perception system composed of
                                         different modules to understand human activities at multiple levels, namely: human pose estimation,
                                         body parts segmentation and human action recognition. Pose estimation and body parts segmentation
                                         allow to estimate important information about the worker position within the workcell and the volume
                                         occupied, while human action and intention recognition provides information on what the human is
                                         doing and how he/she is performing a certain action. The proposed system is demonstrated in a mockup
                                         scenario targeting the collaborative assembly of a wooden leg table, highlighting the potential of action
                                         recognition and body parts segmentation to enable a safe and natural close human-robot collaboration.

                                         Keywords
                                         human-robot collaboration, human perception, body parts segmentation, action recognition


1. Introduction
Human-robot collaboration (HRC) aims at a close and direct collaboration between robots and
humans to reach higher productivity thanks to the synergy between human intelligence on one
side, and robot artificial intelligence and mechanical power on the other. This collaboration
offers several benefits such as improved worker ergonomics, higher productivity, production
flexibility and mass customization. Currently, many practical uses of human-robot collaboration
in industry adopt a simplified form of collaboration, where humans and robots share the same
workspace but at different times to guarantee human safety: if a person gets close to a robot that
is in operation, the robot stops until the person moves away [1]. Such form of collaboration may
introduce slowdowns in the production process and does not allow for various collaborative
tasks in which person and robot must be in close contact (e.g., assembly or object-passing)
or handle a large and heavy object together. However, this would often be the case: the best
option in many assembly operations would require the human and the robot to work side-
by-side to assemble an object composed of several components – typically the robot should
assists the human by passing the tools or the parts, while the human completes the operations
requiring dexterous manipulation. All such operations involve a close collaboration, that
9th Italian Workshop on Artificial Intelligence and Robotics (AIRO 2022)
Envelope-Open matteo.terreran@unipd.it (M. Terreran); leonardo.barcellona@phd.unipd.it (L. Barcellona);
davide.allegro.1@phd.unipd.it (D. Allegro); stefano.ghidoni@unipd.it (S. Ghidoni)
Orcid 0000-0001-9862-8469 (M. Terreran); 0000-0003-4281-0610 (L. Barcellona); 0000-0003-3406-8719 (S. Ghidoni)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
means coordination of actions and intentions between the robot and the human to maximize
efficiency and to guarantee human safety. A crucial step to reach such adaptive behavior is the
development of a perception system capable to monitor human position and activity within the
workcell.
   Many approaches have been proposed in the literature to estimate the position in the scene
and the volume of the person, using for example volumetric representations [2] or 3D bounding
boxes [1]. But when close human-collaboration collaboration is addressed, skeletal representa-
tions provided by human pose estimation algorithms are generally adopted, since they allow
to monitor the distance of the robot from the various joints of the person’s skeleton [3, 4].
Many human representations specialized for collision-avoidance can be further derived from
skeletons: in [5] a volumetric voxel-grid representation derived from skeletons is used to prevent
potential robot collisions with humans, while in [6] human occupancy is represented in terms
of convex volumes computed from skeleton joint positions. However, such representations tend
to overestimate a person’s body size, and strongly depend on the output of pose estimation,
which may be noisy or incomplete due to occlusions.
   Recently, human pose has also been used as an input for human action recognition outper-
forming other approaches on popular action recognition datasets [7]. Action recognition is
usually addressed focusing just on body information but, especially in collaborative assembly
tasks, hands information can be very important to discriminate between very similar gestures
(e.g., ok, stop) or actions where the body is mainly still (e.g., tightening a screw, assembling two
interlocking pieces). However, obtaining accurate hand poses in these contexts is even more
complicated than body pose estimation, leading many works to address hand pose estimation
using ad-hoc setups with cameras that frame the hands very closely [8]; hands contains many
joints to be estimated very close to each other, and are very easily occluded when objects or
tools are manipulated.
   In this work, we investigate AI perception methods to enable a closer collaboration in such
applications where robot and human operator work in the same space at the same time on the
same objects. To address this challenge, we propose an intelligent perception system capable of
monitoring the whole robot workcell, providing information about the human workers: the
system should be capable not only to detect human position and volume, but also to recognize
what the human is doing (i.e., actions) and what he/she wants to achieve (i.e., intentions).
Specifically, the perception system includes modules for pose estimation and action recognition,
as well as a body parts segmentation module. Such module leads to several advantages compared
to other works in the literature: (i) body parts segmentation provides an accurate estimate of the
person’s volume without depending on predefined geometric volumes or pose estimation results
as other works in the literature; (ii) body parts segmentation allows to refine the output of the
pose estimation module (e.g., by recovering missing joints), resulting in a refined representation
of the human posture, especially with regard to hands.
   The paper is organized as follows: in Section 2 each module of the system is presented in
detail, while in Section 3 an experimental validation of the system is given. Finally, in Section 4
conclusions are derived and future developments of the system are illustrated.
2. Human perception system
The proposed system is based on a network of RGB-D cameras positioned around the robot
workcell, providing information from multiple points of view to be robust to occlusions. All the
cameras are calibrated both intrinsically and extrinsically, in order to express the information
from each camera in a common coordinate reference frame (e.g., robot base). Each camera is
attached to a processing device (e.g., PC) which analyzes the RGB-D data stream by means of
AI-perception modules, providing mid-level information about people in the scene (e.g., pose
estimation and body parts segmentation). The position of each camera with respect to the robot
base frame is known. All the information is gathered by a main central PC which fuses them
together to compute a unique 3D representation of the human worker describing his/her pose
and volume; such representation is then used to compute high-level information about human
activity (i.e., action and intentions). An overview of the system and its main AI-modules is
shown in Figure 1, while in the following sections each module is described in detail.


Figure 1: Overview of the human perception system. A dedicated processing node analyzes the RGB-D
stream of each camera, computing human pose estimation and body part segmentation. All information
is combined together by a central PC exploiting the camera network calibration, allowing to compute a
volumetric human representation and to recognize human actions.


2.1. Camera network calibration
When dealing with multi-camera systems, it is very important to know precisely where each
camera is with respect to the others. This information is the result of a calibration procedure,
usually done by acquiring several images of a known pattern (e.g., checkerboard) from all
cameras and by implementing an optimization process which allows to estimate unknown
rigid transformations between sensors reference frame [9]. However, when considering a
human-robot collaboration scenario we have the additional requirement to calibrate the camera
network with respect to the robot base frame: in such a manner the robot can directly exploit
the information from the perception system such as the position of the human worker, hence
ensuring human safety by avoiding possible collisions.
          (a)                       (b)                       (c)                       (d)
Figure 2: Different representation of a human body from the same scene: (a) RGB input image; (b)
skeleton obtained as output of the pose estimation; (c) body parts segmentation, with a different color
base on the pixel it represents; (d) a cylindrical representation obtained from (b) and (c).


   To achieve such requirement, our perception system is calibrated by means of an hand-eye
calibration procedure which allows to estimate the rigid transformation between the robot base
and each camera in the network.
   In particular, to calibrate the proposed system we rely on an iterative hand-eye calibration
method based on non-linear optimization [10]. During the calibration phase, a planar checker-
board is mounted on the robot end-effector and moved around the workcell while acquiring
images from all cameras; the pose of each camera with respect to the robot base frame is
estimated as the rigid transformation minimizing the 2D euclidean distance between each repro-
jected 3D checkerboard corner on the image plane and the corresponding detected 2D corner.
This method offers the main advantage of avoiding to directly use the transformation between
the camera and the board and then it does not rely on Perspective-n-Point algorithms which
could be unreliable with blurred images, that would negatively affect the whole calibration.

2.2. Human pose estimation and body parts segmentation
The human perception system is composed of several AI-based modules developed to provide a
holistic understanding of the human worker, considering different types of information such
as human pose and human volume. The human pose estimation module is based on the state-
of-the-art OpenPose [11] detector which analyzes RGB input images and computes for each
person in the scene a set of 2D points describing the joints of skeletal representation as in
Figure 2b. OpenPose follows a bottom-up approach since it first extracts the joints position,
without inferring the person they are related to, and then associates each joint to an identifier
according to the person they belong exploiting part affinity field [12]. Such 2D points are then
projected in the 3D space using the depth associated to the input RGB image and the camera
intrinsic parameters, using a Kalman filter to merge the contributions from each camera.
   Despite providing a very detailed information about the human pose and the position of
the human worker within the workcell, the output of the pose estimation module does not
include information like the volume occupied by the person which is also very important to
implement human collision avoidance strategies in close collaborative tasks. For providing
such complementary information, the proposed system includes also a human parsing module
which runs in parallel to the pose estimation one. Such module aims to semantically segment
an input RGB image assigning to each pixel a label representing a human body part (e.g., head,
torso, arms, legs). The module is based on the SCHP [13] architecture, a state-of-the-art deep
learning network for body parts segmentation on RGB images. The segmented output is then
projected in the 3D world obtaining a labelled point cloud of the human worker (Figure 2c).
Finally, the information from the pose estimation and human parsing modules are combined
in a parametric human model representing each limb with a cylinder, as shown in Figure 2d.
Each cylinder has direction and length obtained by the corresponding skeleton link, and radius
computed from the labelled point cloud. Such cylindrical representation is useful not only for
obstacle and collision avoidance, but also for direct interaction with the human. For example
the information of the forearms can allow passing objects similarly to [14].

2.3. Human action recognition
The human action recognition module is based on an graph convolutional networks (GCN) [15]
taking as input sequences of human body configurations (or human skeletons) computed by
the previous modules. Sequences of skeletons provide a robust representation of the human
movements free of any disturbances like external objects, lighting, and aesthetic differences of
people (e,g., clothes or skin color). Therefore they represent an interesting source of information
to achieve a robust and general action recognition, especially in collaborative scenarios where
both human and robot are moving, and the human worker should interact with many objects
and tools. In order to improve accuracy and robustness of the system, it relies on an ensemble
of GCNs where each network is trained to recognize actions based on a different set of joints
(e.g., body joints, hand joints, arm joints) and the final prediction is given by averaging all the
networks’ predictions [16]. In particular, the vision system recognizes the person’s actions at
various levels: a general classification of the type of action taking place (e.g., pick, place, request,
hand to), and a finer recognition of the main direction of the movement and its intensity (e.g.,
small, medium, high intensity) useful for better characterize particular actions such as pulling,
pushing or pointing.


3. Experimental validation
The proposed human perception system has been validated in a real scenario, that is the
collaborative assembly of a small wooden table shown in Figure 3. In particular, the human
operator and the robot work together to build a table composed of wooden and 3D-printed
parts: the person performs the actions that require more manual dexterity, such as inserting
parts (Figure 3a), while the robot assists the human by passing at the right time the parts that
the human partner needs (Figure 3c).
   The experimental setup is composed of a Franka Emika robot arm and a camera network of 4
RGB-D cameras (i.e., Microsoft Kinect V2) positioned at the four corners of the lab room, so as
to observe the scene from multiple viewpoints and reduce the possibility of occlusions. Each
camera is attached to a local PC equipped with a high-end NVIDIA GeForce RTX 2080 GPU,
running both the pose estimation and body parts segmentation modules. All PCs are connected
to a local network and send the outputs of the perception modules to a central PC through
            (a)                      (b)                     (c)                      (d)
Figure 3: Example of a collaborative assembly task: (a) the operator performs the actions that require
high manual dexterity, such as inserting parts; (b) the human worker requests a new object using a
”pointing” gesture; (c) the robot moves to pick the requested object while the human continue the
assembly process; (d) the robot passes the object to the human partner, exploiting pose estimation and
human parsing information to precisely localize the human hand and to avoid possible collisions.


ROS (Robot Operating System). Such central PC merges all contributions to obtain a unique
representation of the person for each module (i.e., 3D pose and body parts segmentation), which
are then used as input to the action recognition module.
   The position of the human worker and his/her activities are constantly monitored thanks to
the perception system developed, enabling an effective and intuitive human-robot interaction.
When a new object is required during the assembly process, the worker can simply point to the
object he/she wants to receive as in Figure 3b: the perception system recognizes the “pointing”
action and the corresponding intention (i.e., the direction given by the arm), triggering the robot
to move and pick the requested object. Once the robot has picked up the object, it moves in
front of the operator at a safe distance to signal that it is ready to deliver the object. When the
operator is ready to receive the object, he/she extends the arm with the hand open (i.e., “pass
object” action) and the robot passes the object to the human partner, exploiting pose estimation
and human parsing information provided by the perception system to precisely localize the
human hand and to avoid possible collisions (Figure 3d).


4. Conclusions
In this work, the design of a AI-based perception system for close human-robot collaboration
was presented. Special emphasis was placed on achieving effective and intuitive collaboration
for the human operator through body parts segmentation and action recognition. The proposed
system has been applied to a collaborative assembly task in a mock-up scenario, highlighting
its potential for enabling a safe and natural human-robot collaboration in industrial scenarios.


References
 [1] M. Terreran, E. Lamon, S. Michieletto, E. Pagello, Low-cost scalable people tracking system
     for human-robot collaboration in industrial environment, Procedia Manufacturing 51
     (2020) 116–124.
 [2] M. J. Rosenstrauch, J. Krüger, Safe human robot collaboration—operation area segmentation
     for dynamic adjustable distance monitoring, in: 2018 4th International Conference on
     Control, Automation and Robotics (ICCAR), IEEE, 2018, pp. 17–21.
 [3] M. J. Rosenstrauch, T. J. Pannen, J. Krüger, Human robot collaboration-using kinect v2 for
     iso/ts 15066 speed and separation monitoring, Procedia CIRP 76 (2018) 183–186.
 [4] S. Yang, W. Xu, Z. Liu, Z. Zhou, D. T. Pham, Multi-source vision perception for human-
     robot collaboration in manufacturing, in: 2018 IEEE 15th International Conference on
     Networking, Sensing and Control (ICNSC), IEEE, 2018, pp. 1–6.
 [5] H. Liu, L. Wang, Collision-free human-robot collaboration based on context awareness,
     Robotics and Computer-Integrated Manufacturing 67 (2021) 101997.
 [6] M. Ragaglia, A. M. Zanchettin, P. Rocco, Trajectory generation algorithm for safe human-
     robot collaboration based on multiple depth sensor measurements, Mechatronics 55 (2018)
     267–281.
 [7] A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu rgb+ d: A large scale dataset for 3d human
     activity analysis, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2016, pp. 1010–1019.
 [8] T. Kobayashi, Y. Aoki, S. Shimizu, K. Kusano, S. Okumura, Fine-grained action recognition
     in assembly work scenes by drawing attention to the hands, in: 2019 15th International
     Conference on Signal-Image Technology & Internet-Based Systems (SITIS), IEEE, 2019, pp.
     440–446.
 [9] M. Munaro, F. Basso, E. Menegatti, Openptrack: Open source multi-camera calibration
     and people tracking for rgb-d camera networks, Robotics and Autonomous Systems 75
     (2016) 525–538.
[10] D. Evangelista, D. Allegro, M. Terreran, A. Pretto, S. Ghidoni, An unified iterative hand-
     eye calibration method for eye-on-base and eye-in-hand setups, in: 2022 IEEE 27th
     International Conference on Emerging Technologies and Factory Automation (ETFA),
     2022, pp. 1–7. doi:10.1109/ETFA52439.2022.9921738 .
[11] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, Openpose: Realtime multi-person
     2d pose estimation using part affinity fields, IEEE Transactions on Pattern Analysis and
     Machine Intelligence 43 (2019) 172–186.
[12] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using
     part affinity fields, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2017, pp. 7291–7299.
[13] P. Li, Y. Xu, Y. Wei, Y. Yang, Self-correction for human parsing, IEEE Transactions on
     Pattern Analysis and Machine Intelligence (2020).
[14] P. Rosenberger, A. Cosgun, R. Newbury, J. Kwan, V. Ortenzi, P. Corke, M. Grafinger, Object-
     independent human-to-robot handovers using real time robotic vision, IEEE Robotics and
     Automation Letters 6 (2020) 17–23.
[15] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, H. Lu, Skeleton-based action recognition
     with shift graph convolutional network, in: Proceedings of the IEEE/CVF Conference on
     Computer Vision and Pattern Recognition, 2020, pp. 183–192.
[16] M. Terreran, M. Lazzaretto, S. Ghidoni, Skeleton-based action and gesture recognition
     for human-robot collaboration, in: International Conference on Intelligent Autonomous
     Systems, Springer, 2022.