1. Introduction

Industrial Datasets for Multi-Modal Monitoring of an Assembly Task for Human Action Recognition and Segmentation

Laura Romeo

Annaclaudia Bono

0 1

Grazia Cicirelli

Tiziana D'Orazio

1 0 Department of Electrical and Information Engineering (DEI), Polytechnic of Bari , Bari , Italy 1 Institute of Intelligent Industrial Systems and Technologies for Advanced Manufacturing (STIIMA), National Research Council (CNR) , Bari , Italy

With the rapid evolution of advanced industrial systems exploiting deep learning techniques, the availability of multimodal and heterogeneous datasets of operators working in industrial scenarios is essential. Such datasets allow in-depth studies for accurate segmentation and recognition of the actions of operators working alongside collaborative robots. Using multimodal information guarantees the capture of relevant features to analyze human movements properly. This paper presents our recent research activity on the development of two datasets representing human operators performing assembly tasks in industrial contexts. The dataset for Human Action Multi-Modal Monitoring in Manufacturing (HA4M) is a collection of multimodal data recorded using a Microsoft Azure Kinect camera observing 41 subjects while performing 12 actions to assemble an Epicyclic Gear Train (EGT). The dataset for Human-Cobot Collaboration for Action Recognition in Manufacturing Assembly (HARMA) focuses on the interaction between 27 subjects and a collaborative robot while assembling the EGT in 7 actions. In this case, the acquisition setup consisted of two Microsoft Azure Kinect cameras. Both datasets were collected in controlled laboratories. To prove the validity of the HA4M and HARMA datasets, state-of-the-art temporal action segmentation models, i.e. MS-TCN++ and ASFormer, were trained using both skeletal and video features. The results successfully prove the efectiveness of the presented datasets in segmenting human actions in industrial contexts.

eol>Image processing Assembly Datasets Action Segmentation Action Recognition Manufacturing

1. Introduction

experimental results on action segmentation. Finally, Section 4 delineates conclusive remarks.

2. Datasets description The HA4M dataset was recorded using one single depth

camera, while the HARMA dataset was recorded using two depth cameras. The Microsoft® Azure Kinects have been selected as depth cameras in both cases.

The two proposed datasets present various main contributions compared to the existing ones [13, 14] in the context of object assembly in industrial manufacturing: The task involves the assembly of an Epicyclic Gear Train (EGT) (see Figure 1), which involves three phases: the • The datasets provide untrimmed sequences of sev- assembly of Block 1, the assembly of Block 2, and then eral types of data: RGB frames, Depth maps, RGB- the completion of the EGT that makes up both blocks. to-depth-Aligned (RGB-A) frames, and Skeleton The HA4M dataset contains videos of diferent operators data. The availability of a variety of multi-modal that assemble the complete EGT. The HARMA dataset, data represents an added value for the scientific instead, contains videos of diferent operators that assemcommunity to test diferent machine learning ap- ble the EGT in collaboration with a cobot. All the subjects proaches in action segmentation as well as ac- participated voluntarily in the experiments. They were tion recognition tasks, by using one or more data asked to execute the task several times as preferred (e.g. modalities. with both hands), independently of their dominant hand. • The datasets present a variety in action execution Furthermore, the subjects performed the task at their due to the diferent order followed by the subjects comfortable self-selected speed so that high time varito perform the actions and the interchangeable ance could be noticed among the diferent subjects. The use of both hands. subsequent sections give more details on both datasets. • The actions have a high granularity as the components to be assembled and the actions themselves 2.1. HA4M dataset appear visually similar. As a result, recognizing diferent actions is very challenging and requires a high level of context understanding and objecttracking skills. • Both datasets provide a good base for developing, validating, and testing techniques and methodologies for the recognition and segmentation of assembly actions.

The HA4M dataset contains 217 videos of the assembly

task performed by 41 subjects. The acquisition setup is composed of a Microsoft Azure Kinect® camera placed on a tripod in front of the operator as pictured in Fig. 2.

The camera is at a height of 1.54 above the floor, at a horizontal distance of 1.78 from the far border of the table, and is tilted down to an angle of 17°. As shown in Figure 2, the individual components to be assembled

Preliminary experiments have been conducted to test are spread on the table in front of the operator and are state-of-the-art temporal action segmentation methods, placed according to the order of assembly. The operathe ASFormer [15] and MS-TCN++ [16], on RGB and tor can pick up one component at a time to perform the skeletal data achieving considerable accuracy rates in assembly task standing in front of the table. The experaction segmentation. iments took place in two laboratories: one in Italy and

The remainder of this paper is organized as follows: one in Spain. Two typical RGB frames captured by the Section 2 presents the datasets and describes the assembly camera in both laboratories are shown in Figure 3. The task, reporting details on the acquisition setup, study Figure also depicts the two supports fixed on the table to participants, and data annotation. Section 3 reports some facilitate the assembly of Block 1 and Block 2.

2.2. HARMA dataset Block 1 The HARMA dataset comprises 160 videos (80 videos per camera) capturing the assembly task performed by 27 subjects in collaboration with a cobot (Fanuc CRX10ia/L 1 Ring Bear robotic arm). Each subject performed the task multiple Block 2 11 SSuunn GGeeaarr Bearing timTehse,raecsquultiisnigtioinn 2s4e0tutpasiks epxicetcuurteiodnisninFitgh.e4d.aTtahseett.wo 1 Sun Shaft Microsoft® Azure Kinect cameras are placed on a tri1 Block 1 pod in Frontal and Lateral positions to the Operator EGT 11 BCloovcekr 2 Waboorvkeptlhaec eflo.oTrhaendFrdoonwtanl tCilatmedebray iasnaatnaghleeiogfht6of1.72, while the Lateral Camera is at a height of 2.07 and 19 down tilted. Two typical RGB frames captured by both cameras are shown in Fig. 5. As shown in

Tables 1 and 2 list the components and the actions Fig. 5, the EGT components are spread over the Operator necessary for assembling Block 1, Block 2, and the whole Workplace, so the operator can pick up one component EGT, respectively. Notice that the final action (ID=12) at a time to perform the assembly task in seven pickinvolves additional tools, such as two screws and an Allen and-place actions [14]. The operator assembles Block key to secure the EGT. 1, whereas the cobot assembles Block 2. The assembly

As listed in Table 2, the total number of actions is 12, of Block 2 done by the cobot is not considered in the divided as follows: four actions for building Block 1, four HARMA dataset, as our goal is to recognize the actions for building Block 2, and four for assembling the two performed by the operator to trigger the cobot when it of-the-art deep learning methods to HA4M and HARMA datasets. Both datasets were split into non-overlapping training and testing sets by considering the 70% of videos (a) (b) for training and the remaining 30% for testing, ensuring Figure 5: Sample frames captured by the (a) Frontal and (b) that videos of the same operator do not appear in both Lateral camera, respectively, during the assembly task. training and testing sets.

ASFormer [15] and MS-TCN++ [16] models have been applied to test action segmentation performance. The ASFormer (risp. the MSTCN++) models were fed using RGB has to approach the operator to perform the collabora- and Skeletal data extracted from both datasets, performtion action. So, the HARMA dataset comprises videos of ing the training over 120 (risp. 100) epochs, collecting only the assembly task performed by the subjects, includ- losses for each iteration. The best model is chosen as ing the collaborative action needed to join Block 1 and the one with the lower loss within the total number of Block 2 (action 5 in Tab. 3). Table 3 lists the seven actions iterations and is used in the test phase. included in the HARMA dataset. As can be noticed in Tab. 4 lists the performance rates in terms of Accuracy, Table 3, unlike the HA4M dataset, the Cover is secured Edit Score, and F1-score. Accuracy is a frame-wise metwith two hooks (see Figure 6). ric that measures the proportion of correctly classified frames in the entire video sequence without capturing the temporal dependencies between action segments. The Edit Score, instead, measures how well the model predicts the ordering of action segmentation without requiring exact frame-level alignment. Finally, F1-score with a threshold , often denoted as F1@ , accounts for the degree of overlap between the Intersection over Union (IoU) of each predicted segment and ground truth segments [17]. In the experiments, the threshold has been set to Figure 6: Completion of the EGT by placing the Cover and 60%, 70% and 80%. Focusing on these metrics, it can be the two Hooks as included in Action 7 of Table 3. noticed that all the considered models succeeded in correctly segmenting the actions for the assembly task. In particular, the Accuracy rates reached high values (over 91%) in both cases of using RGB or skeletal features. 3. Experiments For completeness, Figure 7 shows a qualitative representation of action segmentation obtained by applying This section presents preliminary experiments and re- MS-TCN++on and ASFormer models to one video from sults on temporal action segmentation by applying state- the HA4M and one from the HARMA dataset. These

ASFormer [15] MS-TCN++ [16] HA4M HARMA

HA4M

HARMA RGB Skeleton RGB Skeleton RGB Skeleton RGB Skeleton

Acc. videos have been chosen to display challenging situa- ing scenarios involving Human-Robot collaboration and tions such as the case of Action2 (dark blue bars) and interaction. The multimodal features within the datasets Action3 (light blue bars) that in the case of HA4M (Fig. encompass a variety of actions and interactions in in7(a)) are not always detected properly depending on the dustrial assembly tasks, allowing this work to lay the used features or applied model. On the contrary, Fig. 7(b) foundation for the development and enhancement of shows better segmentation results also for actions 2 and intelligent systems aiming at the understanding and as3. Furthermore, in the HARMA dataset, the availability sisting human operators in manufacturing production of two cameras allows us to compensate for the lack of lines. data when one camera fails to provide skeletal data due To properly evaluate HA4M and HARMA, state-of-theto occlusion or out of range [18]. art temporal action segmentation models were considered, namely ASFormer and MS-TCN++, which demonstrated notable success in exploiting the data provided 4. Conclusions by the datasets. The comparison between the RGB and Skeletal features underlines the potential of a multimodal The present paper depicted an examination of two indus- approach to balance the computational eficiency with trial datasets, namely the Human Action Multi-Modal the precision required for the recognition and segmentaMonitoring in Manufacturing (HA4M), and the Human- tion of complex tasks. cobot collaboration for Action Recognition in Manufac- The conducted experiments prove that, overall, both turing (HARMA). Both datasets address the high demand RGB and Skeletal features performed properly. RGB data for human action recognition and segmentation within provides rich visual information about the scene but typindustrial manufacturing contexts, particularly regard- ically requires higher storage space and computational

Acknowledgments This research has been partly funded by PNRR - M4C2

Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR - Future Artificial Intelligence Research" - Spoke 8 “Pervasive AI", funded by the European Commission under the NextGeneration EU program. complexity compared to skeleton-based data representation. On the other hand, by using skeleton data is possible to abstract away detailed appearance information and focus solely on the spatial configuration of body joints and movements. Therefore, it’s essential to carefully ifnd a good trade-of and select the data modality that best aligns with the goals and constraints of the working context.

The presented datasets are benchmarks for further studies in novel models and algorithms that can improve the accuracy and reliability of action recognition and segmentation systems in industrial settings. HA4M and HARMA ofer a valuable resource for the research community, allowing ongoing innovation and development of human-robot collaboration systems in complex, realworld scenarios.

Systems (SITIS), 2019, p. 440–446. doi:10.1109/

SITIS.2019.00077. [7] M. L. Nicora, E. André, D. Berkmans, C. Carissoli,

T. D’Orazio, et al., A human-driven control architecture for promoting good mental health in collaborative robot scenarios, in: 2021 30th IEEE International Conference on Robot & Human Interactive

Communication (RO-MAN), 2021, pp. 285–291. [8] G. Cicirelli, C. Attolico, C. Guaragnella, T. D’Orazio,

A kinect-based gesture recognition approach for a natural human robot interface, International Journal of Advanced Robotic Systems 12 (2015). [9] M. V. Maselli and R. Marani and G. Cicirelli and T.

D’Orazio, Continuous Action Recognition in Manufacturing Contexts by Deep Graph Convolutional

Networks, volume 825, Springer, 2024. [10] L. Romeo, R. Marani, A. Perri, T. D’Orazio,

Microsoft Azure Kinect Calibration for ThreeDimensional Dense Point Clouds and Reliable Skeletons, Sensors 22 (2022) 4986. [11] C. Brambilla, R. Marani, L. Romeo, M. L. Nicora, F. A.

Storm, G. Reni, M. Malosio, T. D’Orazio, A. Scano, Azure kinect performance evaluation for human motion and upper limb biomechanical analysis, Heliyon 9 (2023). [12] D. F. Redaelli, F. A. Storm, G. Fioretta, Mind

Bot Planetary Gearbox, 2021. URL: https://zenodo. org/record/5675810#.YZZJXrVKjcs. doi:10.5281/ [1] A. Keshvarparast, D. Battini, O. Battaia, A. Pirayesh, zenodo.5675810.

Collaborative robots in manufacturing and assem- [13] G. Cicirelli, R. Marani, L. Romeo, M. G. Dominguez, bly systems: literature review and future research J. Heras, A. G. Perri, T. D’Orazio, The HA4M dataset: agenda, Journal of Intelligent Manufacturing (2023). Multi-Modal Monitoring of an assembly task for [2] L. Wang, R. Gao, J. Vancza, J. Krüger, X. Wang, Human Action recognition in Manufacturing, SciS. Makris, Symbiotic human-robot collaborative as- entific Data 9 (2022). sembly, CIRP Annals - Manufacturing Technology [14] L. Romeo, R. Marani, G. Cicirelli and T. D’Orazio, 68 (2019) 701–726. A Dataset on Human-Cobot Collaboration for Ac[3] W. Tao, M. Al-Amin, H. Chen, M. C. Leu, Z. Yin, tion Recognition in Manufacturing Assembly, 2024.

R. Qin, Real-Time Assembly Operation Recogni- Submitted to CoDiT2024. tion with Fog Computing and Transfer Learning for [15] F. Yi, H. Wen, T. Jiang, ASFormer: Transformer Human-Centered Intelligent Manufacturing, Pro- for Action Segmentation, in: The British Machine cedia Manufacturing 48 (2020) 926–931. Vision Conference (BMVC), 2021. [4] J. Patalas-Maliszewska, D. Halikowski, [16] S.-J. Li, Y. AbuFarha, Y. Liu, M.-M. Cheng, J. Gall, R. Damas˘evic˘ius, An Automated Recogni- MS-TCN++: Multi-Stage Temporal Convolutional tion of Work Activity in Industrial Manufacturing Network for Action Segmentation, IEEE TransacUsing Convolutional Neural Networks, Electronics tions on Pattern Analysis and Machine Intelligence 10 (2021) 1–17. 45 (2023) 6647–6658. [5] M. A. Zamora-Hernandez, J. A. Castro-Vergas, [17] G. Ding, F. Sener, A. Yao, Temporal Action SegmenJ. Azorin-Lopez, J. Garcia-Rodriguez, Deep learning- tation: An analysis of modern techniques, IEEE based visual control assistant for assembly in indus- Transactions on Pattern Analysis and Machine Intry 4.0, Computers in Industry 131 (2021) 1–15. telligence (2023). [6] T. Kobayashi, Y. Aoki, S. Shimizu, K. Kusano, [18] L. Romeo, G. Cicirelli and T. D’Orazio, Multi-view S. Okumura, Fine-grained Action Recognition skeleton analysis for human action recognition and in Assembly Work Scenes by Drawing Attention segmentation tasks, 2024. Submitted to CASE2024. to the Hands, in: 15th International Conference on Signal-Image Technology & Internet-Based