=Paper=
{{Paper
|id=Vol-3762/505
|storemode=property
|title=Industrial Datasets for Multi-Modal Monitoring of an Assembly Task for Human Action Recognition and Segmentation
|pdfUrl=https://ceur-ws.org/Vol-3762/505.pdf
|volume=Vol-3762
|authors=Laura Romeo,Annaclaudia Bono,Grazia Cicirelli,Tiziana D'Orazio
|dblpUrl=https://dblp.org/rec/conf/ital-ia/RomeoBCD24
}}
==Industrial Datasets for Multi-Modal Monitoring of an Assembly Task for Human Action Recognition and Segmentation==
Industrial Datasets for Multi-Modal Monitoring of an
Assembly Task for Human Action Recognition and
Segmentation
Laura Romeo1,* , Annaclaudia Bono1,2 , Grazia Cicirelli1 and Tiziana D’Orazio1
1
Institute of Intelligent Industrial Systems and Technologies for Advanced Manufacturing (STIIMA),
National Research Council (CNR), Bari, Italy
2
Department of Electrical and Information Engineering (DEI), Polytechnic of Bari, Bari, Italy
Abstract
With the rapid evolution of advanced industrial systems exploiting deep learning techniques, the availability of multimodal
and heterogeneous datasets of operators working in industrial scenarios is essential. Such datasets allow in-depth studies for
accurate segmentation and recognition of the actions of operators working alongside collaborative robots. Using multimodal
information guarantees the capture of relevant features to analyze human movements properly. This paper presents our recent
research activity on the development of two datasets representing human operators performing assembly tasks in industrial
contexts. The dataset for Human Action Multi-Modal Monitoring in Manufacturing (HA4M) is a collection of multimodal data
recorded using a Microsoft Azure Kinect camera observing 41 subjects while performing 12 actions to assemble an Epicyclic
Gear Train (EGT). The dataset for Human-Cobot Collaboration for Action Recognition in Manufacturing Assembly (HARMA)
focuses on the interaction between 27 subjects and a collaborative robot while assembling the EGT in 7 actions. In this case,
the acquisition setup consisted of two Microsoft Azure Kinect cameras. Both datasets were collected in controlled laboratories.
To prove the validity of the HA4M and HARMA datasets, state-of-the-art temporal action segmentation models, i.e. MS-TCN++
and ASFormer, were trained using both skeletal and video features. The results successfully prove the effectiveness of the
presented datasets in segmenting human actions in industrial contexts.
Keywords
Image processing, Assembly Datasets, Action Segmentation, Action Recognition, Manufacturing
1. Introduction when the actions commence and conclude, is essential
for the cobot to understand and interpret the intended
In Industry 5.0, the interaction between humans and col- actions of the human collaborator, to synchronize its
laborative robots (cobots) is becoming more and more actions, respond in real-time, and ensure smooth cooper-
important for manufacturing processes [1]. Cobots repre- ation with the human collaborator [8] [9].
sent a shift in robotic technology. Traditional robots typ- Recently, the research has notably focused on using
ically operate in confined work cells or dedicated spaces multimodal data, which can contribute to developing
having predefined and automated tasks. Unlike tradi- more sophisticated and adaptive action recognition sys-
tional robots, cobots operate in environments where they tems. In particular, the information derived from skeletal
can interact directly with human workers to solve tasks joints enables researchers to capture temporal variations
that require a combination of human cognition and robot in body movements. It offers flexibility in focusing on
strength and repeatability. the entire body or specific body parts, allowing for a com-
In manufacturing processes, human action recognition prehensive representation of the action recognition and
and segmentation are crucial for many reasons: to pro- bypassing eventual privacy concerns [10] [11].
mote human-robot cooperation [2]; to assist operators To the best of the authors’ knowledge, few vision-
[3]; to support employee training [4, 5]; to increase pro- based datasets exist on human-cobot cooperation for
ductivity and safety [6]; or to promote workers’ good object assembly in industrial manufacturing. For this
mental health [7]. In particular, the accurate recognition reason, in the last few years, our research has been fo-
and segmentation of the actions, including the timing of cused on the task of generating real datasets for prac-
tical applications of action recognition in the manufac-
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- turing context. The datasets for Human Action Multi-
nized by CINI, May 29-30, 2024, Naples, Italy
*
Corresponding author. Modal Monitoring in Manufacturing (HA4M) and the
$ laura.romeo@stiima.cnr.it (L. Romeo); Human-cobot collaboration for Action Recognition in
annaclaudia.bono@stiima.cnr.it (A. Bono); Manufacturing Assembly (HARMA), consist of multi-
grazia.cicirelli@stiima.cnr.it (G. Cicirelli); modal information acquired during the assembly of an
tiziana.dorazio@stiima.cnr.it (T. D’Orazio) Epicyclic Gear Train (EGT), depicted in Figure 1, with-
0000-0001-8138-893X (L. Romeo)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License out and with the collaboration of a cobot, respectively.
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 1: Components involved in the assembly of the Epicyclic Gear Train. The CAD model of the components is publicly
available at [12].
The HA4M dataset was recorded using one single depth experimental results on action segmentation. Finally,
camera, while the HARMA dataset was recorded using Section 4 delineates conclusive remarks.
two depth cameras. The Microsoft® Azure Kinects have
been selected as depth cameras in both cases.
The two proposed datasets present various main con- 2. Datasets description
tributions compared to the existing ones [13, 14] in the
The task involves the assembly of an Epicyclic Gear Train
context of object assembly in industrial manufacturing:
(EGT) (see Figure 1), which involves three phases: the
• The datasets provide untrimmed sequences of sev- assembly of Block 1, the assembly of Block 2, and then
eral types of data: RGB frames, Depth maps, RGB- the completion of the EGT that makes up both blocks.
to-depth-Aligned (RGB-A) frames, and Skeleton The HA4M dataset contains videos of different operators
data. The availability of a variety of multi-modal that assemble the complete EGT. The HARMA dataset,
data represents an added value for the scientific instead, contains videos of different operators that assem-
community to test different machine learning ap- ble the EGT in collaboration with a cobot. All the subjects
proaches in action segmentation as well as ac- participated voluntarily in the experiments. They were
tion recognition tasks, by using one or more data asked to execute the task several times as preferred (e.g.
modalities. with both hands), independently of their dominant hand.
• The datasets present a variety in action execution Furthermore, the subjects performed the task at their
due to the different order followed by the subjects comfortable self-selected speed so that high time vari-
to perform the actions and the interchangeable ance could be noticed among the different subjects. The
use of both hands. subsequent sections give more details on both datasets.
• The actions have a high granularity as the compo-
nents to be assembled and the actions themselves 2.1. HA4M dataset
appear visually similar. As a result, recognizing
different actions is very challenging and requires The HA4M dataset contains 217 videos of the assembly
a high level of context understanding and object- task performed by 41 subjects. The acquisition
®
setup is
tracking skills. composed of a Microsoft Azure Kinect camera placed
on a tripod in front of the operator as pictured in Fig. 2.
• Both datasets provide a good base for developing,
The camera is at a height of 1.54 𝑚 above the floor,
validating, and testing techniques and method-
at a horizontal distance of 1.78 𝑚 from the far border of
ologies for the recognition and segmentation of
the table, and is tilted down to an angle of 17°. As shown
assembly actions.
in Figure 2, the individual components to be assembled
Preliminary experiments have been conducted to test are spread on the table in front of the operator and are
state-of-the-art temporal action segmentation methods, placed according to the order of assembly. The opera-
the ASFormer [15] and MS-TCN++ [16], on RGB and tor can pick up one component at a time to perform the
skeletal data achieving considerable accuracy rates in assembly task standing in front of the table. The exper-
action segmentation. iments took place in two laboratories: one in Italy and
The remainder of this paper is organized as follows: one in Spain. Two typical RGB frames captured by the
Section 2 presents the datasets and describes the assembly camera in both laboratories are shown in Figure 3. The
task, reporting details on the acquisition setup, study Figure also depicts the two supports fixed on the table to
participants, and data annotation. Section 3 reports some facilitate the assembly of Block 1 and Block 2.
blocks and completing the EGT. Some actions are per-
formed more times as there are more components of the
same type to be assembled: actions 2 and 3 are executed
three times, while action 11 is repeated two times. Finally,
a “don’t care” action (ID=0) has been added to manage
pauses between action transitions or unexpected events
such as the loss of a component during the assembly.
Table 2
List of actions to build Block 1, Block 2, and EGT in the HA4M
Figure 2: Sketch of the acquisition setup of the HA4M dataset: dataset.
a Microsoft® Azure Kinect is placed in front of the operator
and the table where the components are spread over.
Actions
ID Description
0 “don’t care” action
1 Pick up/Place Carrier over Support 1
2 Pick up/Place Gear Bearings (×3)
Block 1
3 Pick up/Place Planet Gears (×3)
4 Pick up/Place Carrier Shaft
5 Pick up/Place Sun Shaft over Support 2
6 Pick up/Place Sun Gear
(a) (b) Block 2
7 Pick up/Place Sun Gear Bearing
Figure 3: Typical video frames acquired by the RGB-D camera 8 Pick up/Place Ring Bear
in the (a) Italian and (b) Spanish Laboratories.
9 Pick up Block 2 and place it on Block 1
10 Pick up/Place Cover
EGT
11 Pick up/Place Screw (×2)
Table 1
Pick up Allen Key, Turn both screws, Re-
List of Block 1, Block 2, and EGT components, respectively. 12
turn Allen Key and the EGT
EGT Components
Quantity Description
3 Planet Gear 2.2. HARMA dataset
3 Planet Gear Bearing The HARMA dataset comprises 160 videos (80 videos per
Block 1
1 Carrier Shaft
camera) capturing the assembly task performed by 27
1 Carrier
subjects in collaboration with a cobot (Fanuc CRX10ia/L
1 Ring Bear robotic arm). Each subject performed the task multiple
1 Sun Gear Bearing times, resulting in 240 task executions in the dataset.
Block 2
1 Sun Gear The acquisition setup is pictured in Fig. 4. The two
1 Sun Shaft
Microsoft® Azure Kinect cameras are placed on a tri-
1 Block 1 pod in Frontal and Lateral positions to the Operator
1 Block 2 Workplace. The Frontal Camera is at a height of 1.72 𝑚
EGT
1 Cover above the floor and down tilted by an angle of 6 𝑑𝑒𝑔𝑟𝑒𝑒𝑠,
while the Lateral Camera is at a height of 2.07 𝑚 and
19 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 down tilted. Two typical RGB frames cap-
tured by both cameras are shown in Fig. 5. As shown in
Tables 1 and 2 list the components and the actions Fig. 5, the EGT components are spread over the Operator
necessary for assembling Block 1, Block 2, and the whole Workplace, so the operator can pick up one component
EGT, respectively. Notice that the final action (ID=12) at a time to perform the assembly task in seven pick-
involves additional tools, such as two screws and an Allen and-place actions [14]. The operator assembles Block
key to secure the EGT. 1, whereas the cobot assembles Block 2. The assembly
As listed in Table 2, the total number of actions is 12, of Block 2 done by the cobot is not considered in the
divided as follows: four actions for building Block 1, four HARMA dataset, as our goal is to recognize the actions
for building Block 2, and four for assembling the two performed by the operator to trigger the cobot when it
Table 3
List of the actions carried out by the operator for the construc-
tion of the EGT in the HARMA dataset.
Actions
ID Description
0 “don’t care” action
1 Pick up/Place Carrier over the Support
2 Pick up/Place Planet Gear Bearing (×3)
Block 1
3 Pick up/Place Planet Gear (×3)
4 Pick up/Place Carrier Shaft
Figure 4: Sketch of the acquisition setup of the HARMA
dataset: two Microsoft® Azure Kinect cameras are placed in a 5 Pick up Block 1 and join it with Block
Frontal and Lateral position to the operator’s workplace. 2 held by the cobot
EGT
6 Pick up/Place the Cover
7 Pick up/Place the 2 Hooks, then leave
the EGT on the table
of-the-art deep learning methods to HA4M and HARMA
datasets. Both datasets were split into non-overlapping
training and testing sets by considering the 70% of videos
(a) (b) for training and the remaining 30% for testing, ensuring
Figure 5: Sample frames captured by the (a) Frontal and (b) that videos of the same operator do not appear in both
Lateral camera, respectively, during the assembly task. training and testing sets.
ASFormer [15] and MS-TCN++ [16] models have been
applied to test action segmentation performance. The AS-
Former (risp. the MSTCN++) models were fed using RGB
has to approach the operator to perform the collabora- and Skeletal data extracted from both datasets, perform-
tion action. So, the HARMA dataset comprises videos of ing the training over 120 (risp. 100) epochs, collecting
only the assembly task performed by the subjects, includ- losses for each iteration. The best model is chosen as
ing the collaborative action needed to join Block 1 and the one with the lower loss within the total number of
Block 2 (action 5 in Tab. 3). Table 3 lists the seven actions iterations and is used in the test phase.
included in the HARMA dataset. As can be noticed in Tab. 4 lists the performance rates in terms of Accuracy,
Table 3, unlike the HA4M dataset, the Cover is secured Edit Score, and F1-score. Accuracy is a frame-wise met-
with two hooks (see Figure 6). ric that measures the proportion of correctly classified
frames in the entire video sequence without capturing the
temporal dependencies between action segments. The
Edit Score, instead, measures how well the model predicts
the ordering of action segmentation without requiring
exact frame-level alignment. Finally, F1-score with a
threshold 𝜏 , often denoted as F1@𝜏 , accounts for the de-
gree of overlap between the Intersection over Union (IoU)
of each predicted segment and ground truth segments
[17]. In the experiments, the threshold 𝜏 has been set to
Figure 6: Completion of the EGT by placing the Cover and 60%, 70% and 80%. Focusing on these metrics, it can be
the two Hooks as included in Action 7 of Table 3. noticed that all the considered models succeeded in cor-
rectly segmenting the actions for the assembly task. In
particular, the Accuracy rates reached high values (over
91%) in both cases of using RGB or skeletal features.
3. Experiments For completeness, Figure 7 shows a qualitative repre-
sentation of action segmentation obtained by applying
This section presents preliminary experiments and re- MS-TCN++on and ASFormer models to one video from
sults on temporal action segmentation by applying state- the HA4M and one from the HARMA dataset. These
Table 4
Performance rates on action segmentation obtained by applying ASFormer and MS-TCN++ architectures, using RGB and
Skeletal data grabbed from HA4M and HARMA datasets.
TAS Model Dataset Features Acc. Edit F1 @ {60, 70, 80}
RGB 91.79% 95.10% 87.81% 80.82% 70.27%
HA4M
Skeleton 92.43% 93.01% 86.71% 79.28% 69.42%
ASFormer [15]
RGB 94.2% 93.6% 92.0% 88.7% 83.4%
HARMA
Skeleton 94.51% 95.08% 91.03% 87.97% 78.24%
RGB 93.53% 93.85% 91.12% 86.01% 76.22%
HA4M
Skeleton 94.92% 95.9% 92.57% 88.57% 81.85%
MS-TCN++ [16]
RGB 92.13% 86.23% 78.18% 74.54% 66.00%
HARMA
Skeleton 94.45% 93.89% 90.24% 87.80% 81.80%
Figure 7: Action segmentation results over a video from the HA4M (a) and a video from the HARMA (b) dataset. GT, RGB,
and Skel stand for Ground Truth, use of RGB features and use of Skeletal features, respectively. The labels in orange indicate
the results obtained by the MS-TCN++ model, while the labels in blue remark the outcomes of the ASFormer architecture.
videos have been chosen to display challenging situa- ing scenarios involving Human-Robot collaboration and
tions such as the case of Action2 (dark blue bars) and interaction. The multimodal features within the datasets
Action3 (light blue bars) that in the case of HA4M (Fig. encompass a variety of actions and interactions in in-
7(a)) are not always detected properly depending on the dustrial assembly tasks, allowing this work to lay the
used features or applied model. On the contrary, Fig. 7(b) foundation for the development and enhancement of
shows better segmentation results also for actions 2 and intelligent systems aiming at the understanding and as-
3. Furthermore, in the HARMA dataset, the availability sisting human operators in manufacturing production
of two cameras allows us to compensate for the lack of lines.
data when one camera fails to provide skeletal data due To properly evaluate HA4M and HARMA, state-of-the-
to occlusion or out of range [18]. art temporal action segmentation models were consid-
ered, namely ASFormer and MS-TCN++, which demon-
strated notable success in exploiting the data provided
4. Conclusions by the datasets. The comparison between the RGB and
Skeletal features underlines the potential of a multimodal
The present paper depicted an examination of two indus-
approach to balance the computational efficiency with
trial datasets, namely the Human Action Multi-Modal
the precision required for the recognition and segmenta-
Monitoring in Manufacturing (HA4M), and the Human-
tion of complex tasks.
cobot collaboration for Action Recognition in Manufac-
The conducted experiments prove that, overall, both
turing (HARMA). Both datasets address the high demand
RGB and Skeletal features performed properly. RGB data
for human action recognition and segmentation within
provides rich visual information about the scene but typ-
industrial manufacturing contexts, particularly regard-
ically requires higher storage space and computational
complexity compared to skeleton-based data representa- Systems (SITIS), 2019, p. 440–446. doi:10.1109/
tion. On the other hand, by using skeleton data is possible SITIS.2019.00077.
to abstract away detailed appearance information and [7] M. L. Nicora, E. André, D. Berkmans, C. Carissoli,
focus solely on the spatial configuration of body joints T. D’Orazio, et al., A human-driven control archi-
and movements. Therefore, it’s essential to carefully tecture for promoting good mental health in collab-
find a good trade-off and select the data modality that orative robot scenarios, in: 2021 30th IEEE Interna-
best aligns with the goals and constraints of the working tional Conference on Robot & Human Interactive
context. Communication (RO-MAN), 2021, pp. 285–291.
The presented datasets are benchmarks for further [8] G. Cicirelli, C. Attolico, C. Guaragnella, T. D’Orazio,
studies in novel models and algorithms that can improve A kinect-based gesture recognition approach for a
the accuracy and reliability of action recognition and natural human robot interface, International Jour-
segmentation systems in industrial settings. HA4M and nal of Advanced Robotic Systems 12 (2015).
HARMA offer a valuable resource for the research com- [9] M. V. Maselli and R. Marani and G. Cicirelli and T.
munity, allowing ongoing innovation and development D’Orazio, Continuous Action Recognition in Man-
of human-robot collaboration systems in complex, real- ufacturing Contexts by Deep Graph Convolutional
world scenarios. Networks, volume 825, Springer, 2024.
[10] L. Romeo, R. Marani, A. Perri, T. D’Orazio,
Microsoft Azure Kinect Calibration for Three-
Acknowledgments Dimensional Dense Point Clouds and Reliable Skele-
tons, Sensors 22 (2022) 4986.
This research has been partly funded by PNRR - M4C2 -
[11] C. Brambilla, R. Marani, L. Romeo, M. L. Nicora, F. A.
Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR
Storm, G. Reni, M. Malosio, T. D’Orazio, A. Scano,
- Future Artificial Intelligence Research" - Spoke 8 “Per-
Azure kinect performance evaluation for human
vasive AI", funded by the European Commission under
motion and upper limb biomechanical analysis, He-
the NextGeneration EU program.
liyon 9 (2023).
[12] D. F. Redaelli, F. A. Storm, G. Fioretta, Mind-
References Bot Planetary Gearbox, 2021. URL: https://zenodo.
org/record/5675810#.YZZJXrVKjcs. doi:10.5281/
[1] A. Keshvarparast, D. Battini, O. Battaia, A. Pirayesh, zenodo.5675810.
Collaborative robots in manufacturing and assem- [13] G. Cicirelli, R. Marani, L. Romeo, M. G. Dominguez,
bly systems: literature review and future research J. Heras, A. G. Perri, T. D’Orazio, The HA4M dataset:
agenda, Journal of Intelligent Manufacturing (2023). Multi-Modal Monitoring of an assembly task for
[2] L. Wang, R. Gao, J. Vancza, J. Krüger, X. Wang, Human Action recognition in Manufacturing, Sci-
S. Makris, Symbiotic human-robot collaborative as- entific Data 9 (2022).
sembly, CIRP Annals - Manufacturing Technology [14] L. Romeo, R. Marani, G. Cicirelli and T. D’Orazio,
68 (2019) 701–726. A Dataset on Human-Cobot Collaboration for Ac-
[3] W. Tao, M. Al-Amin, H. Chen, M. C. Leu, Z. Yin, tion Recognition in Manufacturing Assembly, 2024.
R. Qin, Real-Time Assembly Operation Recogni- Submitted to CoDiT2024.
tion with Fog Computing and Transfer Learning for [15] F. Yi, H. Wen, T. Jiang, ASFormer: Transformer
Human-Centered Intelligent Manufacturing, Pro- for Action Segmentation, in: The British Machine
cedia Manufacturing 48 (2020) 926–931. Vision Conference (BMVC), 2021.
[4] J. Patalas-Maliszewska, D. Halikowski, [16] S.-J. Li, Y. AbuFarha, Y. Liu, M.-M. Cheng, J. Gall,
R. Damas̆evic̆ius, An Automated Recogni- MS-TCN++: Multi-Stage Temporal Convolutional
tion of Work Activity in Industrial Manufacturing Network for Action Segmentation, IEEE Transac-
Using Convolutional Neural Networks, Electronics tions on Pattern Analysis and Machine Intelligence
10 (2021) 1–17. 45 (2023) 6647–6658.
[5] M. A. Zamora-Hernandez, J. A. Castro-Vergas, [17] G. Ding, F. Sener, A. Yao, Temporal Action Segmen-
J. Azorin-Lopez, J. Garcia-Rodriguez, Deep learning- tation: An analysis of modern techniques, IEEE
based visual control assistant for assembly in indus- Transactions on Pattern Analysis and Machine In-
try 4.0, Computers in Industry 131 (2021) 1–15. telligence (2023).
[6] T. Kobayashi, Y. Aoki, S. Shimizu, K. Kusano, [18] L. Romeo, G. Cicirelli and T. D’Orazio, Multi-view
S. Okumura, Fine-grained Action Recognition skeleton analysis for human action recognition and
in Assembly Work Scenes by Drawing Attention segmentation tasks, 2024. Submitted to CASE2024.
to the Hands, in: 15th International Confer-
ence on Signal-Image Technology & Internet-Based