<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Industrial Datasets for Multi-Modal Monitoring of an Assembly Task for Human Action Recognition and Segmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Romeo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annaclaudia Bono</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grazia Cicirelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tiziana D'Orazio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical and Information Engineering (DEI), Polytechnic of Bari</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Intelligent Industrial Systems and Technologies for Advanced Manufacturing (STIIMA), National Research Council (CNR)</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rapid evolution of advanced industrial systems exploiting deep learning techniques, the availability of multimodal and heterogeneous datasets of operators working in industrial scenarios is essential. Such datasets allow in-depth studies for accurate segmentation and recognition of the actions of operators working alongside collaborative robots. Using multimodal information guarantees the capture of relevant features to analyze human movements properly. This paper presents our recent research activity on the development of two datasets representing human operators performing assembly tasks in industrial contexts. The dataset for Human Action Multi-Modal Monitoring in Manufacturing (HA4M) is a collection of multimodal data recorded using a Microsoft Azure Kinect camera observing 41 subjects while performing 12 actions to assemble an Epicyclic Gear Train (EGT). The dataset for Human-Cobot Collaboration for Action Recognition in Manufacturing Assembly (HARMA) focuses on the interaction between 27 subjects and a collaborative robot while assembling the EGT in 7 actions. In this case, the acquisition setup consisted of two Microsoft Azure Kinect cameras. Both datasets were collected in controlled laboratories. To prove the validity of the HA4M and HARMA datasets, state-of-the-art temporal action segmentation models, i.e. MS-TCN++ and ASFormer, were trained using both skeletal and video features. The results successfully prove the efectiveness of the presented datasets in segmenting human actions in industrial contexts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Image processing</kwd>
        <kwd>Assembly Datasets</kwd>
        <kwd>Action Segmentation</kwd>
        <kwd>Action Recognition</kwd>
        <kwd>Manufacturing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>experimental results on action segmentation. Finally,
Section 4 delineates conclusive remarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Datasets description</title>
      <sec id="sec-2-1">
        <title>The HA4M dataset was recorded using one single depth</title>
        <p>camera, while the HARMA dataset was recorded using
two depth cameras. The Microsoft® Azure Kinects have
been selected as depth cameras in both cases.</p>
        <p>The two proposed datasets present various main
contributions compared to the existing ones [13, 14] in the
context of object assembly in industrial manufacturing:
The task involves the assembly of an Epicyclic Gear Train
(EGT) (see Figure 1), which involves three phases: the
• The datasets provide untrimmed sequences of sev- assembly of Block 1, the assembly of Block 2, and then
eral types of data: RGB frames, Depth maps, RGB- the completion of the EGT that makes up both blocks.
to-depth-Aligned (RGB-A) frames, and Skeleton The HA4M dataset contains videos of diferent operators
data. The availability of a variety of multi-modal that assemble the complete EGT. The HARMA dataset,
data represents an added value for the scientific instead, contains videos of diferent operators that
assemcommunity to test diferent machine learning ap- ble the EGT in collaboration with a cobot. All the subjects
proaches in action segmentation as well as ac- participated voluntarily in the experiments. They were
tion recognition tasks, by using one or more data asked to execute the task several times as preferred (e.g.
modalities. with both hands), independently of their dominant hand.
• The datasets present a variety in action execution Furthermore, the subjects performed the task at their
due to the diferent order followed by the subjects comfortable self-selected speed so that high time
varito perform the actions and the interchangeable ance could be noticed among the diferent subjects. The
use of both hands. subsequent sections give more details on both datasets.
• The actions have a high granularity as the
components to be assembled and the actions themselves 2.1. HA4M dataset
appear visually similar. As a result, recognizing
diferent actions is very challenging and requires
a high level of context understanding and
objecttracking skills.
• Both datasets provide a good base for developing,
validating, and testing techniques and
methodologies for the recognition and segmentation of
assembly actions.</p>
      </sec>
      <sec id="sec-2-2">
        <title>The HA4M dataset contains 217 videos of the assembly</title>
        <p>task performed by 41 subjects. The acquisition setup is
composed of a Microsoft Azure Kinect® camera placed
on a tripod in front of the operator as pictured in Fig. 2.</p>
        <p>The camera is at a height of 1.54  above the floor,
at a horizontal distance of 1.78  from the far border of
the table, and is tilted down to an angle of 17°. As shown
in Figure 2, the individual components to be assembled</p>
        <p>Preliminary experiments have been conducted to test are spread on the table in front of the operator and are
state-of-the-art temporal action segmentation methods, placed according to the order of assembly. The
operathe ASFormer [15] and MS-TCN++ [16], on RGB and tor can pick up one component at a time to perform the
skeletal data achieving considerable accuracy rates in assembly task standing in front of the table. The
experaction segmentation. iments took place in two laboratories: one in Italy and</p>
        <p>The remainder of this paper is organized as follows: one in Spain. Two typical RGB frames captured by the
Section 2 presents the datasets and describes the assembly camera in both laboratories are shown in Figure 3. The
task, reporting details on the acquisition setup, study Figure also depicts the two supports fixed on the table to
participants, and data annotation. Section 3 reports some facilitate the assembly of Block 1 and Block 2.</p>
        <p>2.2. HARMA dataset
Block 1 The HARMA dataset comprises 160 videos (80 videos per
camera) capturing the assembly task performed by 27
subjects in collaboration with a cobot (Fanuc CRX10ia/L
1 Ring Bear robotic arm). Each subject performed the task multiple
Block 2 11 SSuunn GGeeaarr Bearing timTehse,raecsquultiisnigtioinn 2s4e0tutpasiks epxicetcuurteiodnisninFitgh.e4d.aTtahseett.wo
1 Sun Shaft Microsoft® Azure Kinect cameras are placed on a
tri1 Block 1 pod in Frontal and Lateral positions to the Operator
EGT 11 BCloovcekr 2 Waboorvkeptlhaec eflo.oTrhaendFrdoonwtanl tCilatmedebray iasnaatnaghleeiogfht6of1.72,
while the Lateral Camera is at a height of 2.07  and
19  down tilted. Two typical RGB frames
captured by both cameras are shown in Fig. 5. As shown in</p>
        <p>Tables 1 and 2 list the components and the actions Fig. 5, the EGT components are spread over the Operator
necessary for assembling Block 1, Block 2, and the whole Workplace, so the operator can pick up one component
EGT, respectively. Notice that the final action (ID=12) at a time to perform the assembly task in seven
pickinvolves additional tools, such as two screws and an Allen and-place actions [14]. The operator assembles Block
key to secure the EGT. 1, whereas the cobot assembles Block 2. The assembly</p>
        <p>As listed in Table 2, the total number of actions is 12, of Block 2 done by the cobot is not considered in the
divided as follows: four actions for building Block 1, four HARMA dataset, as our goal is to recognize the actions
for building Block 2, and four for assembling the two performed by the operator to trigger the cobot when it
of-the-art deep learning methods to HA4M and HARMA
datasets. Both datasets were split into non-overlapping
training and testing sets by considering the 70% of videos
(a) (b) for training and the remaining 30% for testing, ensuring
Figure 5: Sample frames captured by the (a) Frontal and (b) that videos of the same operator do not appear in both
Lateral camera, respectively, during the assembly task. training and testing sets.</p>
        <p>ASFormer [15] and MS-TCN++ [16] models have been
applied to test action segmentation performance. The
ASFormer (risp. the MSTCN++) models were fed using RGB
has to approach the operator to perform the collabora- and Skeletal data extracted from both datasets,
performtion action. So, the HARMA dataset comprises videos of ing the training over 120 (risp. 100) epochs, collecting
only the assembly task performed by the subjects, includ- losses for each iteration. The best model is chosen as
ing the collaborative action needed to join Block 1 and the one with the lower loss within the total number of
Block 2 (action 5 in Tab. 3). Table 3 lists the seven actions iterations and is used in the test phase.
included in the HARMA dataset. As can be noticed in Tab. 4 lists the performance rates in terms of Accuracy,
Table 3, unlike the HA4M dataset, the Cover is secured Edit Score, and F1-score. Accuracy is a frame-wise
metwith two hooks (see Figure 6). ric that measures the proportion of correctly classified
frames in the entire video sequence without capturing the
temporal dependencies between action segments. The
Edit Score, instead, measures how well the model predicts
the ordering of action segmentation without requiring
exact frame-level alignment. Finally, F1-score with a
threshold  , often denoted as F1@ , accounts for the
degree of overlap between the Intersection over Union (IoU)
of each predicted segment and ground truth segments
[17]. In the experiments, the threshold  has been set to
Figure 6: Completion of the EGT by placing the Cover and 60%, 70% and 80%. Focusing on these metrics, it can be
the two Hooks as included in Action 7 of Table 3. noticed that all the considered models succeeded in
correctly segmenting the actions for the assembly task. In
particular, the Accuracy rates reached high values (over
91%) in both cases of using RGB or skeletal features.
3. Experiments For completeness, Figure 7 shows a qualitative
representation of action segmentation obtained by applying
This section presents preliminary experiments and re- MS-TCN++on and ASFormer models to one video from
sults on temporal action segmentation by applying state- the HA4M and one from the HARMA dataset. These</p>
        <sec id="sec-2-2-1">
          <title>ASFormer [15] MS-TCN++ [16] HA4M</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>HARMA</title>
          <p>HA4M</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>HARMA RGB</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Skeleton RGB</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Skeleton RGB</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Skeleton RGB</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>Skeleton</title>
          <p>Acc.
videos have been chosen to display challenging situa- ing scenarios involving Human-Robot collaboration and
tions such as the case of Action2 (dark blue bars) and interaction. The multimodal features within the datasets
Action3 (light blue bars) that in the case of HA4M (Fig. encompass a variety of actions and interactions in
in7(a)) are not always detected properly depending on the dustrial assembly tasks, allowing this work to lay the
used features or applied model. On the contrary, Fig. 7(b) foundation for the development and enhancement of
shows better segmentation results also for actions 2 and intelligent systems aiming at the understanding and
as3. Furthermore, in the HARMA dataset, the availability sisting human operators in manufacturing production
of two cameras allows us to compensate for the lack of lines.
data when one camera fails to provide skeletal data due To properly evaluate HA4M and HARMA,
state-of-theto occlusion or out of range [18]. art temporal action segmentation models were
considered, namely ASFormer and MS-TCN++, which
demonstrated notable success in exploiting the data provided
4. Conclusions by the datasets. The comparison between the RGB and
Skeletal features underlines the potential of a multimodal
The present paper depicted an examination of two indus- approach to balance the computational eficiency with
trial datasets, namely the Human Action Multi-Modal the precision required for the recognition and
segmentaMonitoring in Manufacturing (HA4M), and the Human- tion of complex tasks.
cobot collaboration for Action Recognition in Manufac- The conducted experiments prove that, overall, both
turing (HARMA). Both datasets address the high demand RGB and Skeletal features performed properly. RGB data
for human action recognition and segmentation within provides rich visual information about the scene but
typindustrial manufacturing contexts, particularly regard- ically requires higher storage space and computational</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This research has been partly funded by PNRR - M4C2</title>
        <p>Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR
- Future Artificial Intelligence Research" - Spoke 8
“Pervasive AI", funded by the European Commission under
the NextGeneration EU program.
complexity compared to skeleton-based data
representation. On the other hand, by using skeleton data is possible
to abstract away detailed appearance information and
focus solely on the spatial configuration of body joints
and movements. Therefore, it’s essential to carefully
ifnd a good trade-of and select the data modality that
best aligns with the goals and constraints of the working
context.</p>
        <p>The presented datasets are benchmarks for further
studies in novel models and algorithms that can improve
the accuracy and reliability of action recognition and
segmentation systems in industrial settings. HA4M and
HARMA ofer a valuable resource for the research
community, allowing ongoing innovation and development
of human-robot collaboration systems in complex,
realworld scenarios.</p>
        <p>Systems (SITIS), 2019, p. 440–446. doi:10.1109/</p>
        <p>SITIS.2019.00077.
[7] M. L. Nicora, E. André, D. Berkmans, C. Carissoli,</p>
        <p>T. D’Orazio, et al., A human-driven control
architecture for promoting good mental health in
collaborative robot scenarios, in: 2021 30th IEEE
International Conference on Robot &amp; Human Interactive</p>
        <p>Communication (RO-MAN), 2021, pp. 285–291.
[8] G. Cicirelli, C. Attolico, C. Guaragnella, T. D’Orazio,</p>
        <p>A kinect-based gesture recognition approach for a
natural human robot interface, International
Journal of Advanced Robotic Systems 12 (2015).
[9] M. V. Maselli and R. Marani and G. Cicirelli and T.</p>
        <p>D’Orazio, Continuous Action Recognition in
Manufacturing Contexts by Deep Graph Convolutional</p>
        <p>Networks, volume 825, Springer, 2024.
[10] L. Romeo, R. Marani, A. Perri, T. D’Orazio,</p>
        <p>Microsoft Azure Kinect Calibration for
ThreeDimensional Dense Point Clouds and Reliable
Skeletons, Sensors 22 (2022) 4986.
[11] C. Brambilla, R. Marani, L. Romeo, M. L. Nicora, F. A.</p>
        <p>Storm, G. Reni, M. Malosio, T. D’Orazio, A. Scano,
Azure kinect performance evaluation for human
motion and upper limb biomechanical analysis,
Heliyon 9 (2023).
[12] D. F. Redaelli, F. A. Storm, G. Fioretta,
Mind</p>
        <p>Bot Planetary Gearbox, 2021. URL: https://zenodo.
org/record/5675810#.YZZJXrVKjcs. doi:10.5281/
[1] A. Keshvarparast, D. Battini, O. Battaia, A. Pirayesh, zenodo.5675810.</p>
        <p>Collaborative robots in manufacturing and assem- [13] G. Cicirelli, R. Marani, L. Romeo, M. G. Dominguez,
bly systems: literature review and future research J. Heras, A. G. Perri, T. D’Orazio, The HA4M dataset:
agenda, Journal of Intelligent Manufacturing (2023). Multi-Modal Monitoring of an assembly task for
[2] L. Wang, R. Gao, J. Vancza, J. Krüger, X. Wang, Human Action recognition in Manufacturing,
SciS. Makris, Symbiotic human-robot collaborative as- entific Data 9 (2022).
sembly, CIRP Annals - Manufacturing Technology [14] L. Romeo, R. Marani, G. Cicirelli and T. D’Orazio,
68 (2019) 701–726. A Dataset on Human-Cobot Collaboration for
Ac[3] W. Tao, M. Al-Amin, H. Chen, M. C. Leu, Z. Yin, tion Recognition in Manufacturing Assembly, 2024.</p>
        <p>R. Qin, Real-Time Assembly Operation Recogni- Submitted to CoDiT2024.
tion with Fog Computing and Transfer Learning for [15] F. Yi, H. Wen, T. Jiang, ASFormer: Transformer
Human-Centered Intelligent Manufacturing, Pro- for Action Segmentation, in: The British Machine
cedia Manufacturing 48 (2020) 926–931. Vision Conference (BMVC), 2021.
[4] J. Patalas-Maliszewska, D. Halikowski, [16] S.-J. Li, Y. AbuFarha, Y. Liu, M.-M. Cheng, J. Gall,
R. Damas˘evic˘ius, An Automated Recogni- MS-TCN++: Multi-Stage Temporal Convolutional
tion of Work Activity in Industrial Manufacturing Network for Action Segmentation, IEEE
TransacUsing Convolutional Neural Networks, Electronics tions on Pattern Analysis and Machine Intelligence
10 (2021) 1–17. 45 (2023) 6647–6658.
[5] M. A. Zamora-Hernandez, J. A. Castro-Vergas, [17] G. Ding, F. Sener, A. Yao, Temporal Action
SegmenJ. Azorin-Lopez, J. Garcia-Rodriguez, Deep learning- tation: An analysis of modern techniques, IEEE
based visual control assistant for assembly in indus- Transactions on Pattern Analysis and Machine
Intry 4.0, Computers in Industry 131 (2021) 1–15. telligence (2023).
[6] T. Kobayashi, Y. Aoki, S. Shimizu, K. Kusano, [18] L. Romeo, G. Cicirelli and T. D’Orazio, Multi-view
S. Okumura, Fine-grained Action Recognition skeleton analysis for human action recognition and
in Assembly Work Scenes by Drawing Attention segmentation tasks, 2024. Submitted to CASE2024.
to the Hands, in: 15th International
Conference on Signal-Image Technology &amp; Internet-Based</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>