=Paper=
{{Paper
|id=Vol-3762/505
|storemode=property
|title=Industrial Datasets for Multi-Modal Monitoring of an Assembly Task for Human Action Recognition and Segmentation
|pdfUrl=https://ceur-ws.org/Vol-3762/505.pdf
|volume=Vol-3762
|authors=Laura Romeo,Annaclaudia Bono,Grazia Cicirelli,Tiziana D'Orazio
|dblpUrl=https://dblp.org/rec/conf/ital-ia/RomeoBCD24
}}
==Industrial Datasets for Multi-Modal Monitoring of an Assembly Task for Human Action Recognition and Segmentation==
<pdf width="1500px">https://ceur-ws.org/Vol-3762/505.pdf</pdf>
<pre>
                                Industrial Datasets for Multi-Modal Monitoring of an
                                Assembly Task for Human Action Recognition and
                                Segmentation
                                Laura Romeo1,* , Annaclaudia Bono1,2 , Grazia Cicirelli1 and Tiziana D’Orazio1
                                1
                                  Institute of Intelligent Industrial Systems and Technologies for Advanced Manufacturing (STIIMA),
                                  National Research Council (CNR), Bari, Italy
                                2
                                  Department of Electrical and Information Engineering (DEI), Polytechnic of Bari, Bari, Italy


                                                 Abstract
                                                 With the rapid evolution of advanced industrial systems exploiting deep learning techniques, the availability of multimodal
                                                 and heterogeneous datasets of operators working in industrial scenarios is essential. Such datasets allow in-depth studies for
                                                 accurate segmentation and recognition of the actions of operators working alongside collaborative robots. Using multimodal
                                                 information guarantees the capture of relevant features to analyze human movements properly. This paper presents our recent
                                                 research activity on the development of two datasets representing human operators performing assembly tasks in industrial
                                                 contexts. The dataset for Human Action Multi-Modal Monitoring in Manufacturing (HA4M) is a collection of multimodal data
                                                 recorded using a Microsoft Azure Kinect camera observing 41 subjects while performing 12 actions to assemble an Epicyclic
                                                 Gear Train (EGT). The dataset for Human-Cobot Collaboration for Action Recognition in Manufacturing Assembly (HARMA)
                                                 focuses on the interaction between 27 subjects and a collaborative robot while assembling the EGT in 7 actions. In this case,
                                                 the acquisition setup consisted of two Microsoft Azure Kinect cameras. Both datasets were collected in controlled laboratories.
                                                 To prove the validity of the HA4M and HARMA datasets, state-of-the-art temporal action segmentation models, i.e. MS-TCN++
                                                 and ASFormer, were trained using both skeletal and video features. The results successfully prove the effectiveness of the
                                                 presented datasets in segmenting human actions in industrial contexts.

                                                 Keywords
                                                 Image processing, Assembly Datasets, Action Segmentation, Action Recognition, Manufacturing


                                1. Introduction                                                                                        when the actions commence and conclude, is essential
                                                                                                                                       for the cobot to understand and interpret the intended
                                In Industry 5.0, the interaction between humans and col- actions of the human collaborator, to synchronize its
                                laborative robots (cobots) is becoming more and more actions, respond in real-time, and ensure smooth cooper-
                                important for manufacturing processes [1]. Cobots repre- ation with the human collaborator [8] [9].
                                sent a shift in robotic technology. Traditional robots typ-                                               Recently, the research has notably focused on using
                                ically operate in confined work cells or dedicated spaces multimodal data, which can contribute to developing
                                having predefined and automated tasks. Unlike tradi- more sophisticated and adaptive action recognition sys-
                                tional robots, cobots operate in environments where they tems. In particular, the information derived from skeletal
                                can interact directly with human workers to solve tasks joints enables researchers to capture temporal variations
                                that require a combination of human cognition and robot in body movements. It offers flexibility in focusing on
                                strength and repeatability.                                                                            the entire body or specific body parts, allowing for a com-
                                   In manufacturing processes, human action recognition prehensive representation of the action recognition and
                                and segmentation are crucial for many reasons: to pro- bypassing eventual privacy concerns [10] [11].
                                mote human-robot cooperation [2]; to assist operators                                                     To the best of the authors’ knowledge, few vision-
                                [3]; to support employee training [4, 5]; to increase pro- based datasets exist on human-cobot cooperation for
                                ductivity and safety [6]; or to promote workers’ good object assembly in industrial manufacturing. For this
                                mental health [7]. In particular, the accurate recognition reason, in the last few years, our research has been fo-
                                and segmentation of the actions, including the timing of cused on the task of generating real datasets for prac-
                                                                                                                                       tical applications of action recognition in the manufac-
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- turing context. The datasets for Human Action Multi-
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                *
                                  Corresponding author.                                                                                Modal Monitoring in Manufacturing (HA4M) and the
                                $ laura.romeo@stiima.cnr.it (L. Romeo);                                                                Human-cobot collaboration for Action Recognition in
                                annaclaudia.bono@stiima.cnr.it (A. Bono);                                                              Manufacturing Assembly (HARMA), consist of multi-
                                grazia.cicirelli@stiima.cnr.it (G. Cicirelli);                                                         modal information acquired during the assembly of an
                                tiziana.dorazio@stiima.cnr.it (T. D’Orazio)                                                            Epicyclic Gear Train (EGT), depicted in Figure 1, with-
                                 0000-0001-8138-893X (L. Romeo)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License out and with the collaboration of a cobot, respectively.
                                           Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Components involved in the assembly of the Epicyclic Gear Train. The CAD model of the components is publicly
available at [12].


The HA4M dataset was recorded using one single depth experimental results on action segmentation. Finally,
camera, while the HARMA dataset was recorded using Section 4 delineates conclusive remarks.
two depth cameras. The Microsoft® Azure Kinects have
been selected as depth cameras in both cases.
   The two proposed datasets present various main con- 2. Datasets description
tributions compared to the existing ones [13, 14] in the
                                                            The task involves the assembly of an Epicyclic Gear Train
context of object assembly in industrial manufacturing:
                                                            (EGT) (see Figure 1), which involves three phases: the
     • The datasets provide untrimmed sequences of sev- assembly of Block 1, the assembly of Block 2, and then
        eral types of data: RGB frames, Depth maps, RGB- the completion of the EGT that makes up both blocks.
        to-depth-Aligned (RGB-A) frames, and Skeleton The HA4M dataset contains videos of different operators
        data. The availability of a variety of multi-modal that assemble the complete EGT. The HARMA dataset,
        data represents an added value for the scientific instead, contains videos of different operators that assem-
        community to test different machine learning ap- ble the EGT in collaboration with a cobot. All the subjects
        proaches in action segmentation as well as ac- participated voluntarily in the experiments. They were
        tion recognition tasks, by using one or more data asked to execute the task several times as preferred (e.g.
        modalities.                                         with both hands), independently of their dominant hand.
     • The datasets present a variety in action execution Furthermore, the subjects performed the task at their
        due to the different order followed by the subjects comfortable self-selected speed so that high time vari-
        to perform the actions and the interchangeable ance could be noticed among the different subjects. The
        use of both hands.                                  subsequent sections give more details on both datasets.
     • The actions have a high granularity as the compo-
        nents to be assembled and the actions themselves 2.1. HA4M dataset
        appear visually similar. As a result, recognizing
        different actions is very challenging and requires The HA4M dataset contains 217 videos of the assembly
        a high level of context understanding and object- task performed by 41 subjects. The acquisition
                                                                                                      ®
                                                                                                               setup is
        tracking skills.                                    composed    of a Microsoft  Azure  Kinect   camera   placed
                                                            on a tripod in front of the operator as pictured in Fig. 2.
     • Both datasets provide a good base for developing,
                                                               The camera is at a height of 1.54 𝑚 above the floor,
        validating, and testing techniques and method-
                                                            at a horizontal distance of 1.78 𝑚 from the far border of
        ologies for the recognition and segmentation of
                                                            the table, and is tilted down to an angle of 17°. As shown
        assembly actions.
                                                            in Figure 2, the individual components to be assembled
   Preliminary experiments have been conducted to test are spread on the table in front of the operator and are
state-of-the-art temporal action segmentation methods, placed according to the order of assembly. The opera-
the ASFormer [15] and MS-TCN++ [16], on RGB and tor can pick up one component at a time to perform the
skeletal data achieving considerable accuracy rates in assembly task standing in front of the table. The exper-
action segmentation.                                        iments took place in two laboratories: one in Italy and
   The remainder of this paper is organized as follows: one in Spain. Two typical RGB frames captured by the
Section 2 presents the datasets and describes the assembly camera in both laboratories are shown in Figure 3. The
task, reporting details on the acquisition setup, study Figure also depicts the two supports fixed on the table to
participants, and data annotation. Section 3 reports some facilitate the assembly of Block 1 and Block 2.
                                                                 blocks and completing the EGT. Some actions are per-
                                                                 formed more times as there are more components of the
                                                                 same type to be assembled: actions 2 and 3 are executed
                                                                 three times, while action 11 is repeated two times. Finally,
                                                                 a “don’t care” action (ID=0) has been added to manage
                                                                 pauses between action transitions or unexpected events
                                                                 such as the loss of a component during the assembly.

                                                                 Table 2
                                                                 List of actions to build Block 1, Block 2, and EGT in the HA4M
Figure 2: Sketch of the acquisition setup of the HA4M dataset:   dataset.
a Microsoft® Azure Kinect is placed in front of the operator
and the table where the components are spread over.
                                                                                           Actions
                                                                               ID    Description
                                                                               0     “don’t care” action
                                                                               1     Pick up/Place Carrier over Support 1
                                                                               2     Pick up/Place Gear Bearings (×3)
                                                                  Block 1
                                                                               3     Pick up/Place Planet Gears (×3)
                                                                               4     Pick up/Place Carrier Shaft
                                                                               5     Pick up/Place Sun Shaft over Support 2
                                                                               6     Pick up/Place Sun Gear
                 (a)                      (b)                     Block 2
                                                                               7     Pick up/Place Sun Gear Bearing
Figure 3: Typical video frames acquired by the RGB-D camera                    8     Pick up/Place Ring Bear
in the (a) Italian and (b) Spanish Laboratories.
                                                                               9     Pick up Block 2 and place it on Block 1
                                                                               10    Pick up/Place Cover
                                                                    EGT
                                                                               11    Pick up/Place Screw (×2)
Table 1
                                                                                     Pick up Allen Key, Turn both screws, Re-
List of Block 1, Block 2, and EGT components, respectively.                    12
                                                                                     turn Allen Key and the EGT

                        EGT Components
                       Quantity   Description
                          3       Planet Gear                    2.2. HARMA dataset
                          3       Planet Gear Bearing       The HARMA dataset comprises 160 videos (80 videos per
       Block 1
                          1       Carrier Shaft
                                                            camera) capturing the assembly task performed by 27
                          1       Carrier
                                                            subjects in collaboration with a cobot (Fanuc CRX10ia/L
                       1        Ring Bear                   robotic arm). Each subject performed the task multiple
                       1        Sun Gear Bearing            times, resulting in 240 task executions in the dataset.
       Block 2
                       1        Sun Gear                       The acquisition setup is pictured in Fig. 4. The two
                       1        Sun Shaft
                                                            Microsoft® Azure Kinect cameras are placed on a tri-
                       1        Block 1                     pod in Frontal and Lateral positions to the Operator
                       1        Block 2                     Workplace. The Frontal Camera is at a height of 1.72 𝑚
         EGT
                       1        Cover                       above the floor and down tilted by an angle of 6 𝑑𝑒𝑔𝑟𝑒𝑒𝑠,
                                                            while the Lateral Camera is at a height of 2.07 𝑚 and
                                                            19 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 down tilted. Two typical RGB frames cap-
                                                            tured by both cameras are shown in Fig. 5. As shown in
  Tables 1 and 2 list the components and the actions Fig. 5, the EGT components are spread over the Operator
necessary for assembling Block 1, Block 2, and the whole Workplace, so the operator can pick up one component
EGT, respectively. Notice that the final action (ID=12) at a time to perform the assembly task in seven pick-
involves additional tools, such as two screws and an Allen and-place actions [14]. The operator assembles Block
key to secure the EGT.                                      1, whereas the cobot assembles Block 2. The assembly
  As listed in Table 2, the total number of actions is 12, of Block 2 done by the cobot is not considered in the
divided as follows: four actions for building Block 1, four HARMA dataset, as our goal is to recognize the actions
for building Block 2, and four for assembling the two performed by the operator to trigger the cobot when it
                                                                Table 3
                                                                List of the actions carried out by the operator for the construc-
                                                                tion of the EGT in the HARMA dataset.

                                                                                           Actions
                                                                               ID                  Description
                                                                               0     “don’t care” action
                                                                               1     Pick up/Place Carrier over the Support
                                                                               2     Pick up/Place Planet Gear Bearing (×3)
                                                                  Block 1
                                                                               3     Pick up/Place Planet Gear (×3)
                                                                               4     Pick up/Place Carrier Shaft
Figure 4: Sketch of the acquisition setup of the HARMA
dataset: two Microsoft® Azure Kinect cameras are placed in a                   5     Pick up Block 1 and join it with Block
Frontal and Lateral position to the operator’s workplace.                            2 held by the cobot
                                                                    EGT
                                                                               6     Pick up/Place the Cover
                                                                               7     Pick up/Place the 2 Hooks, then leave
                                                                                     the EGT on the table


                                                                of-the-art deep learning methods to HA4M and HARMA
                                                                datasets. Both datasets were split into non-overlapping
                                                                training and testing sets by considering the 70% of videos
            (a)                              (b)                for training and the remaining 30% for testing, ensuring
Figure 5: Sample frames captured by the (a) Frontal and (b)     that videos of the same operator do not appear in both
Lateral camera, respectively, during the assembly task.         training and testing sets.
                                                                   ASFormer [15] and MS-TCN++ [16] models have been
                                                                applied to test action segmentation performance. The AS-
                                                                Former (risp. the MSTCN++) models were fed using RGB
has to approach the operator to perform the collabora-          and Skeletal data extracted from both datasets, perform-
tion action. So, the HARMA dataset comprises videos of          ing the training over 120 (risp. 100) epochs, collecting
only the assembly task performed by the subjects, includ-       losses for each iteration. The best model is chosen as
ing the collaborative action needed to join Block 1 and         the one with the lower loss within the total number of
Block 2 (action 5 in Tab. 3). Table 3 lists the seven actions   iterations and is used in the test phase.
included in the HARMA dataset. As can be noticed in                Tab. 4 lists the performance rates in terms of Accuracy,
Table 3, unlike the HA4M dataset, the Cover is secured          Edit Score, and F1-score. Accuracy is a frame-wise met-
with two hooks (see Figure 6).                                  ric that measures the proportion of correctly classified
                                                                frames in the entire video sequence without capturing the
                                                                temporal dependencies between action segments. The
                                                                Edit Score, instead, measures how well the model predicts
                                                                the ordering of action segmentation without requiring
                                                                exact frame-level alignment. Finally, F1-score with a
                                                                threshold 𝜏 , often denoted as F1@𝜏 , accounts for the de-
                                                                gree of overlap between the Intersection over Union (IoU)
                                                                of each predicted segment and ground truth segments
                                                                [17]. In the experiments, the threshold 𝜏 has been set to
Figure 6: Completion of the EGT by placing the Cover and        60%, 70% and 80%. Focusing on these metrics, it can be
the two Hooks as included in Action 7 of Table 3.               noticed that all the considered models succeeded in cor-
                                                                rectly segmenting the actions for the assembly task. In
                                                                particular, the Accuracy rates reached high values (over
                                                                91%) in both cases of using RGB or skeletal features.
3. Experiments                                                     For completeness, Figure 7 shows a qualitative repre-
                                                                sentation of action segmentation obtained by applying
This section presents preliminary experiments and re-           MS-TCN++on and ASFormer models to one video from
sults on temporal action segmentation by applying state-        the HA4M and one from the HARMA dataset. These
Table 4
Performance rates on action segmentation obtained by applying ASFormer and MS-TCN++ architectures, using RGB and
Skeletal data grabbed from HA4M and HARMA datasets.


                 TAS Model         Dataset     Features       Acc.      Edit           F1 @ {60, 70, 80}
                                                  RGB        91.79%    95.10%     87.81%    80.82%     70.27%
                                    HA4M
                                                Skeleton     92.43%    93.01%     86.71%    79.28%     69.42%
                ASFormer [15]
                                                  RGB         94.2%    93.6%      92.0%      88.7%     83.4%
                                   HARMA
                                                Skeleton     94.51%    95.08%     91.03%    87.97%     78.24%
                                                  RGB        93.53%    93.85%     91.12%    86.01%     76.22%
                                    HA4M
                                                Skeleton     94.92%    95.9%      92.57%    88.57%     81.85%
               MS-TCN++ [16]
                                                  RGB        92.13%    86.23%     78.18%    74.54%     66.00%
                                   HARMA
                                                Skeleton     94.45%    93.89%     90.24%    87.80%     81.80%


Figure 7: Action segmentation results over a video from the HA4M (a) and a video from the HARMA (b) dataset. GT, RGB,
and Skel stand for Ground Truth, use of RGB features and use of Skeletal features, respectively. The labels in orange indicate
the results obtained by the MS-TCN++ model, while the labels in blue remark the outcomes of the ASFormer architecture.


videos have been chosen to display challenging situa-           ing scenarios involving Human-Robot collaboration and
tions such as the case of Action2 (dark blue bars) and          interaction. The multimodal features within the datasets
Action3 (light blue bars) that in the case of HA4M (Fig.        encompass a variety of actions and interactions in in-
7(a)) are not always detected properly depending on the         dustrial assembly tasks, allowing this work to lay the
used features or applied model. On the contrary, Fig. 7(b)      foundation for the development and enhancement of
shows better segmentation results also for actions 2 and        intelligent systems aiming at the understanding and as-
3. Furthermore, in the HARMA dataset, the availability          sisting human operators in manufacturing production
of two cameras allows us to compensate for the lack of          lines.
data when one camera fails to provide skeletal data due            To properly evaluate HA4M and HARMA, state-of-the-
to occlusion or out of range [18].                              art temporal action segmentation models were consid-
                                                                ered, namely ASFormer and MS-TCN++, which demon-
                                                                strated notable success in exploiting the data provided
4. Conclusions                                                  by the datasets. The comparison between the RGB and
                                                                Skeletal features underlines the potential of a multimodal
The present paper depicted an examination of two indus-
                                                                approach to balance the computational efficiency with
trial datasets, namely the Human Action Multi-Modal
                                                                the precision required for the recognition and segmenta-
Monitoring in Manufacturing (HA4M), and the Human-
                                                                tion of complex tasks.
cobot collaboration for Action Recognition in Manufac-
                                                                   The conducted experiments prove that, overall, both
turing (HARMA). Both datasets address the high demand
                                                                RGB and Skeletal features performed properly. RGB data
for human action recognition and segmentation within
                                                                provides rich visual information about the scene but typ-
industrial manufacturing contexts, particularly regard-
                                                                ically requires higher storage space and computational
complexity compared to skeleton-based data representa-             Systems (SITIS), 2019, p. 440–446. doi:10.1109/
tion. On the other hand, by using skeleton data is possible        SITIS.2019.00077.
to abstract away detailed appearance information and           [7] M. L. Nicora, E. André, D. Berkmans, C. Carissoli,
focus solely on the spatial configuration of body joints           T. D’Orazio, et al., A human-driven control archi-
and movements. Therefore, it’s essential to carefully              tecture for promoting good mental health in collab-
find a good trade-off and select the data modality that            orative robot scenarios, in: 2021 30th IEEE Interna-
best aligns with the goals and constraints of the working          tional Conference on Robot & Human Interactive
context.                                                           Communication (RO-MAN), 2021, pp. 285–291.
   The presented datasets are benchmarks for further           [8] G. Cicirelli, C. Attolico, C. Guaragnella, T. D’Orazio,
studies in novel models and algorithms that can improve            A kinect-based gesture recognition approach for a
the accuracy and reliability of action recognition and             natural human robot interface, International Jour-
segmentation systems in industrial settings. HA4M and              nal of Advanced Robotic Systems 12 (2015).
HARMA offer a valuable resource for the research com-          [9] M. V. Maselli and R. Marani and G. Cicirelli and T.
munity, allowing ongoing innovation and development                D’Orazio, Continuous Action Recognition in Man-
of human-robot collaboration systems in complex, real-             ufacturing Contexts by Deep Graph Convolutional
world scenarios.                                                   Networks, volume 825, Springer, 2024.
                                                              [10] L. Romeo, R. Marani, A. Perri, T. D’Orazio,
                                                                   Microsoft Azure Kinect Calibration for Three-
Acknowledgments                                                    Dimensional Dense Point Clouds and Reliable Skele-
                                                                   tons, Sensors 22 (2022) 4986.
This research has been partly funded by PNRR - M4C2 -
                                                              [11] C. Brambilla, R. Marani, L. Romeo, M. L. Nicora, F. A.
Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR
                                                                   Storm, G. Reni, M. Malosio, T. D’Orazio, A. Scano,
- Future Artificial Intelligence Research" - Spoke 8 “Per-
                                                                   Azure kinect performance evaluation for human
vasive AI", funded by the European Commission under
                                                                   motion and upper limb biomechanical analysis, He-
the NextGeneration EU program.
                                                                   liyon 9 (2023).
                                                              [12] D. F. Redaelli, F. A. Storm, G. Fioretta, Mind-
References                                                         Bot Planetary Gearbox, 2021. URL: https://zenodo.
                                                                   org/record/5675810#.YZZJXrVKjcs. doi:10.5281/
 [1] A. Keshvarparast, D. Battini, O. Battaia, A. Pirayesh,        zenodo.5675810.
     Collaborative robots in manufacturing and assem-         [13] G. Cicirelli, R. Marani, L. Romeo, M. G. Dominguez,
     bly systems: literature review and future research            J. Heras, A. G. Perri, T. D’Orazio, The HA4M dataset:
     agenda, Journal of Intelligent Manufacturing (2023).          Multi-Modal Monitoring of an assembly task for
 [2] L. Wang, R. Gao, J. Vancza, J. Krüger, X. Wang,               Human Action recognition in Manufacturing, Sci-
     S. Makris, Symbiotic human-robot collaborative as-            entific Data 9 (2022).
     sembly, CIRP Annals - Manufacturing Technology           [14] L. Romeo, R. Marani, G. Cicirelli and T. D’Orazio,
     68 (2019) 701–726.                                            A Dataset on Human-Cobot Collaboration for Ac-
 [3] W. Tao, M. Al-Amin, H. Chen, M. C. Leu, Z. Yin,               tion Recognition in Manufacturing Assembly, 2024.
     R. Qin, Real-Time Assembly Operation Recogni-                 Submitted to CoDiT2024.
     tion with Fog Computing and Transfer Learning for        [15] F. Yi, H. Wen, T. Jiang, ASFormer: Transformer
     Human-Centered Intelligent Manufacturing, Pro-                for Action Segmentation, in: The British Machine
     cedia Manufacturing 48 (2020) 926–931.                        Vision Conference (BMVC), 2021.
 [4] J.    Patalas-Maliszewska,        D.     Halikowski,     [16] S.-J. Li, Y. AbuFarha, Y. Liu, M.-M. Cheng, J. Gall,
     R. Damas̆evic̆ius,       An Automated Recogni-                MS-TCN++: Multi-Stage Temporal Convolutional
     tion of Work Activity in Industrial Manufacturing             Network for Action Segmentation, IEEE Transac-
     Using Convolutional Neural Networks, Electronics              tions on Pattern Analysis and Machine Intelligence
     10 (2021) 1–17.                                               45 (2023) 6647–6658.
 [5] M. A. Zamora-Hernandez, J. A. Castro-Vergas,             [17] G. Ding, F. Sener, A. Yao, Temporal Action Segmen-
     J. Azorin-Lopez, J. Garcia-Rodriguez, Deep learning-          tation: An analysis of modern techniques, IEEE
     based visual control assistant for assembly in indus-         Transactions on Pattern Analysis and Machine In-
     try 4.0, Computers in Industry 131 (2021) 1–15.               telligence (2023).
 [6] T. Kobayashi, Y. Aoki, S. Shimizu, K. Kusano,            [18] L. Romeo, G. Cicirelli and T. D’Orazio, Multi-view
     S. Okumura, Fine-grained Action Recognition                   skeleton analysis for human action recognition and
     in Assembly Work Scenes by Drawing Attention                  segmentation tasks, 2024. Submitted to CASE2024.
     to the Hands, in: 15th International Confer-
     ence on Signal-Image Technology & Internet-Based

</pre>