1. Introduction

through Multimodal Fusion

Jihao Gu

jihao.gu.23@ucl.ac.uk 4

Fei Wang

jiafei127@gmail.com 0 3 5

Kun Li

kunli.hfut@gmail.com 2

Yanyan Wei

weiyy@hfut.edu.cn 1 3

Zhiliang Wu

wu_zhiliang@zju.edu.cn 2

Dan Guo

guodan@hfut.edu.cn 0 3

Micro-Gesture, Action Recognition, Multi-modal, Ensemble Fusion, Transfer Learning

0 Institute of Artificial Intelligence, Hefei Comprehensive National Science Center , China 1 Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education 2 ReLER, CCAI, Zhejiang University , China 3 School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology , HFUT 4 University College London (UCL) , Gower Street, London, WC1E 6BT , UK 5 Xinsight Lab, Research Institute, Hefei Zhongjuyuan Intelligent Technology Co., Ltd. , China

2026

In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across diferent modalities, validate the efectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%. Code is available at: https://github.com/momiji-bit/MM-Gesture.

1. Introduction

PoseConv3D [21] and Video Swin Transformer [ 9, 6 ], integrating information across six complementary modalities: joint, limb, RGB video, Taylor video, optical flow video, and depth video . In addition,

CEUR Workshop

ISSN1613-0073 to enhance the performance of the RGB modality, we apply transfer learning by pre-training on the Micro-Action 52 dataset [11] and fine-tuning on the iMiGUE dataset [ 2 ].

The key contributions of this paper can be summarized as follows: • We present an integrated multi-modal MGs classification network that utilizes complementary information from six diverse modalities: joint, limb, RGB video, Taylor video, optical flow video, and depth video. • We propose an efective ensemble fusion method capable of eficiently integrating six modalities, enabling the joint exploitation of modality-specific strengths for improved MGs classification accuracy. • Extensive experiments on the iMiGUE dataset [ 2 ] demonstrate that the proposed MM‑Gesture achieves state-of-the-art performance, reaching a Top-1 accuracy of 73.213%, which is the highest reported accuracy in previous Micro-gesture Analysis (MiGA) challenges.

2. Related Work

Micro-Gestures (MGs) are becoming increasingly important in understanding human emotions, focusing on subtle body movements in daily interactions. Advances in this field have been driven by the development of large benchmark datasets and sophisticated model architectures [ 1, 2, 3, 11 ]. Key datasets include the SMG dataset [ 3 ], which consists of recordings from 40 participants engaged in storytelling, capturing upper limb micro-gestures and emotional states. The iMiGUE dataset [ 2 ] ofers identity-free videos of 72 athletes at press conferences, annotated with 32 micro-gesture categories for analyzing both actions and emotions. The MA-52 dataset [11] expands the focus to full-body micro-actions, with 22,000 samples covering 52 action-level and 7 body-level categories, sourced from psychological interviews to recognize subtle visual cues.

Current models primarily focus on limited modalities. RGB-based methods leverage spatial-temporal modeling strategies, such as a pure Transformer backbone with shifted 3D local attention windows [ 9 ]. MANet [11] integrates SE and TSM modules with semantic embedding loss for fine-grained microaction recognition. Skeleton-based approaches include a 3D-CNN model with joint and semantic embedding losses [12], and an EHCT framework [13] employs hypergraph-based attention and ensemble Transformers [22, 23] to capture high-order joint relations and address class imbalance. In contrast, skeleton sequences can be encoded as 3D heatmaps and fused with RGB inputs through a dual-branch multimodal network [21]. Inspired by this network, Chen et al. [19] adopt channel-wise cross-attention and prototype refinement to enhance feature fusion and category discrimination, while Huang et al. [24] design a multi-scale heterogeneous fusion network. Recently, Li et al. [ 10 ] propose a hierarchical prototype-based calibration method to resolve ambiguity in fine-grained actions. Overall, current methods only focus on the RGB or skeleton data.

To exploit the complementarity between diferent multimodal data, we propose the MM-Gesture model, adopting a comprehensive multimodal approach that integrates six modalities: joint, limb, RGB video, Taylor video, optical flow video, and depth video. This approach enables a deeper understanding and representation of micro-gestures, capturing their nuances and dynamics. Additionally, we leverage transfer learning from the MA-52 dataset to infuse valuable prior knowledge into the RGB modality, further enhancing its recognition accuracy. Consequently, our model improves performance on existing benchmarks and paves the way for advanced applications in human emotion understanding through micro-gesture analysis.

3. Methodology 3.1. Data Pre-processing

We adopted the RGB videos (R ∈ ℝ × × ×3 ) provided by the oficial dataset, along with a subset of 36 skeleton keypoints ( ) selected from the original 137 points, to form the input joint data (J ∈ ℝ × ×2 ).

These cleaned keypoints focus specifically on the upper body, hands, and facial joints. Additionally, we constructed input limb data (L ∈ ℝ ××2 ) by computing spatial diferences between adjacent joint pairs defined by the skeletal edges ( ) connecting the selected keypoints.

To efectively capture multi-modal gesture information, we employ advanced, of-the-shelf modality extraction methods to generate complementary auxiliary modalities. Specifically, we utilize Taylorseries temporal expansion videos, optical-flow videos, and depth-estimation videos, each modality providing distinct yet complementary gesture-related information. By leveraging the ensemble among these diverse modalities, our proposed MM-Gesture model efectively exploits multi-modal feature complementarity.

T ∈ ℝ( −)× × ×3 ,

F ∈ ℝ( −1)× × ×3 ,

D ∈ ℝ × × ×3 , T = ℱtaylor(R∶+ ),

F = ℱflow (R∶+1 ),

D = ℱdepth(R ), (1) where each symbol is defined as follows: • : Temporal length of the input RGB video. • , : Height and Width of the input RGB video frames. • : Temporal window length for computing the truncated Taylor-series expansion. • R : The RGB frame at time step . • ℱ : The Taylor-series-based video calculated according to the approach [25], where denotes the maximum order of the truncated Taylor-series expansion and represents the temporal window length used for aggregating local temporal context. • ℱ : The optical-flow-based modality computed using the MemFlow network [ 26], which estimates optical flow representations F from consecutive frames R and R+1 . • ℱℎ : The depth-estimation-based modality generated using the monocular depth estimation algorithm [27], resulting in depth representations D .

3.2. Network Architecture

As shown in Figure 1, the proposed multi-modal micro-gesture recognition framework (MM-Gesture) consists of three main modules:

Cross-Modal Fusion Module: In this module, skeletal coordinates are initially transformed into Gaussian heatmap-based 3D volumes (H) for Joint and Limb modalities individually. RGB, Joint, and Limb modalities are all separately trained through PoseConv3D [21], capturing spatial-temporal skeleton dynamics and RGB spatial context, respectively. Subsequently, the extracted RGB and skeleton features are combined via a cross-modal fusion training stage to exploit complementary information between these modalities comprehensively.

Uni-Modal Encoding Module: We leverage the VideoSwinT network [ 9 ] to independently encode four distinct modalities: RGB frames, Taylor-based temporal encoding, optical flow (computed via MemFlow), and depth estimates. Specifically, for the RGB modality, we first employ transfer learning by pretraining VideoSwinT on the MA-52 dataset and subsequently fine-tune the pretrained model on the iMiGUE dataset. For the remaining modalities (Taylor, optical flow, and depth), VideoSwinT is directly trained from scratch on the iMiGUE dataset. VideoSwinT uses a 3D shifted-window self-attention mechanism that efectively captures fine-grained spatial-temporal details within each modality.

Ensemble Module: Probabilities from the PoseConv3D Cross-Modal Fusion Module and VideoSwinT Uni-Modal Encoding Module are combined via weighted ensemble, with weights set empirically according to validation performance. This integration approach efectively exploits modality complementarity, improving robustness and accuracy in micro-gesture recognition.

3.3. PoseConv3D Cross-Modal Fusion Module

To efectively align skeleton-based information (consisting of joints and limbs) with RGB video representations and facilitate fine-grained complementary interactions across these modalities, we adopt PoseConv3D [21] for cross-modal integration.

Specifically, we first transform the 2D coordinates of skeletal keypoints into heatmap-based representations. By applying Gaussian distributions and calculating the heatmap values using the point-to-segment distance formula, we compute and stack the heatmaps of each keypoint across all frames to generate 3D heatmap volumes. The resulting heatmaps are as follows:

H ∈ ℝ × × × ,

H ∈ ℝ × × × , (2) (3) (4) (5) where H denotes the joint-position heatmaps, and H denotes the limb-connection heatmaps. Here, is the total number of frames, is the number of skeletal joints, and is the number of skeletal limbs (connections between joints). , and represent the spatial resolution (height and width) of each heatmap. Subsequently, the RGB frames R ∈ ℝ × × ×3 and skeleton heatmaps H , H are taken as input data.

Prior to network training, data augmentation processes (e.g., scaling, cropping) are consistently applied to both RGB video frames and skeleton heatmap modalities to enhance data diversity and improve model robustness. Subsequently, the augmented data from each modality is separately forwarded into the PoseConv3D module, which extracts deep spatiotemporal feature representations. The PoseConv3D network generates modality-specific predictions denoted formally as ŷ , where ∈ {, , } indicates RGB, joint heatmap, and limb heatmap modalities, respectively. Each modality-specific network is initially pretrained independently by minimizing the cross-entropy (CE) classification loss: ℒ = CE (ŷ , ) , ∈ { R, J, L}, where denotes the ground-truth action labels.

Next, we conduct a joint fine-tuning procedure by simultaneously optimizing combined RGB and skeleton-based modalities using the following paired-training losses: ℒR+J = ℒR + ℒJ,

ℒR+L = ℒR + ℒL.

During model inference, the predictions yielded by distinct modalities are integrated at the probability level via a late fusion strategy. Formally, let ⋆ = SoftMax (ŷ⋆), ⋆ ∈ {, , } , represent modality-specific probability distributions. We then fuse predictions through average fusion to achieve final predictive distributions:

PR+J = 1 (PR + PJ), 2

PR+L = 1 (PR + PL).

3.5. Ensemble Module

In the final ensemble stage, we introduce a probability-based weighted fusion strategy to efectively aggregate predictions derived from multiple modality-specific networks. Specifically, class probability vectors independently output by the PoseConv3D (RGB + J, RGB+L) and VideoSwin Transformer (RGB∗, Taylor, Flow, Depth) models are integrated using empirically determined weights obtained via validation-set performance.

The ensemble prediction (Pfinal ∈ ℝ ) is computed by summing the weighted contributions of individual modality-specific probabilities, as follows:

Pifnal = ∑ P , ∈ { R+J, R+L, R, T, F, D} where each weight is selected based on the classification performance observed on validation samples.

This proposed ensemble-based fusion mechanism enables comprehensive exploitation of the complementary strengths inherent in multiple modality-specific models, thereby significantly improving the robustness and overall efectiveness of our multi-modal micro-gesture recognition framework.

3.4. VideoSwinT Uni-Modal Encoding Module

Unlike existing skeleton-video modality fusion methods, we propose a multimodal framework based on the VideoSwinT [ 9 ], which encodes RGB video, optical flow video, Taylor-expanded video, and depth video. This encoding strategy efectively integrates color, texture, dynamic motion, and geometric structural information to better capture multidimensional micro-action features, thus enabling more ifne-grained action recognition.

Specifically, we independently optimize each modality-specific backbone by minimizing the crossentropy (CE) classification loss. Prior to training on the target iMiGUE dataset, the RGB modality network is initially pretrained on the MA-52 dataset (R⋆ ∈ ℝ × × ×3 ) [11], which provides extensive coverage of 52 types of micro-actions. After pretraining, the RGB modality network is fine-tuned on the iMiGUE dataset along with other modalities. The loss functions for pretraining and fine-tuning, along with the probability computation, are formulated as follows: ℒ = CE(ŷ , ), ∈ {

R⋆, R, T, F, D},

P = SoftMax (ŷ ), ∈ { R, T, F, D}.

4. Experiments 4.1. Experimental Setup

Dataset. iMiGUE (identity-free video dataset for Micro-Gesture Understanding and Emotion analysis) dataset [ 2 ] consists of micro-gestures (MGs) primarily involving upper limbs, collected from post-match press conference videos of professional tennis players. It includes 31 MG categories and an additional non-MG class, comprising a total of 18,499 labeled MG samples annotated from 359 long video sequences (ranging from 0.5 to 26 minutes), totaling approximately 3.77 million frames. The dataset provides two modalities: RGB videos and corresponding 2D skeletal joint data extracted via OpenPose. iMiGUE adopts a cross-subject evaluation protocol, splitting 72 subjects into 37 for training and 35 for testing, with 12,893 samples in the training set, 777 in validation, and 4,562 in testing. In addition, we pretrain the proposed method on the Micro-Action 52 [11] dataset and then fine-tune it on the iMiGUE dataset. Micro-Action 52 is a large-scale, whole-body micro-action dataset collected by a professional interviewer to capture unconscious human micro-action behaviors. The dataset contains 22,422 (22.4K) samples interviewed from 205 participants, where the annotations are categorized into two levels: 7 body-level and 52 action-level micro-action categories. There are 11,250, 5,586, and 5,586 instances in the training, validation, and test sets, respectively. (6) (7) Evaluation Metrics. For the micro-gesture classification challenge, we employ top-1 accuracy as the evaluation metric to quantitatively assess classification performance.

Implementation Details. The provided dataset includes original RGB videos and skeletal data extracted using OpenPose, featuring 137 full-body keypoints. To optimize data, we select 36 keypoints for the upper-body, facial landmarks, and hands. We also enhance data representation by generating additional modalities: depth using the method by Chen et al. [27], Taylor video modality via Wang et al.’s [25] approximation, and optical flow through Dong et al.’s [26] MemFlow approach. For modeling, PoseConv3D [21] is used to capture spatial-temporal dynamics in skeletal information (J), limb connections (L), and combined RGB with skeletal data (RGB+J and RGB+L). VideoSwin Transformer [ 9 ] is applied to RGB, depth, Taylor, and optical flow modalities for spatial-temporal processing. To enhance robustness, we perform transfer learning with VideoSwinT: initially pretraining on RGB data from Micro-Action 52 (MA-52) [11], followed by fine-tuning on the iMiGUE dataset [ 2 ]. Finally, we employ an ensemble fusion strategy, assigning weights to each modality based on contribution and correlation. We integrate RGB*, Taylor, Flow, and Depth from VideoSwin, along with RGB+Joint and RGB+Limb from PoseConv3D.

4.2. Experimental Results

We evaluated the proposed method on the iMiGUE dataset and compared its performance against state-of-the-art methods reported in the MiGA Challenges from 2023 to 2025. As presented in Table 1, we provide the classification results of the top three competitors from these three consecutive editions, clearly demonstrating the consistent superiority of our proposed method over previous best-performing approaches across all years. Specifically, our approach achieved a Top-1 accuracy of 73.213%, ranking first in the 2025 competition, significantly outperforming the second-place accuracy of 68.697%. Compared with the best performance in the 2024 MiGA Challenge, our method realized an improvement of approximately 3%, thus substantially exceeding the results from the 2023 edition as well.

Here, we conduct comprehensive experimental settings to evaluate multiple modalities, including skeleton data (joints and limbs), RGB frames, Taylor series approximation videos (Taylor), optical flow, and depth information. As shown in Table 2, two backbone frameworks, namely PoseConv3D [21] and VideoSwin [ 9 ], were employed to thoroughly explore performance across various modality combinations. Experimental outcomes demonstrate that while single-modality inputs generally show 1The 1st MiGA-IJCAI Challenge (2023) Track 1 Leaderboard: https://codalab.lisn.upsaclay.fr/competitions/11758#results 2The 2nd MiGA-IJCAI Challenge (2024) Track 1 Leaderboard: https://www.kaggle.com/competitions/2nd-miga-ijcai-challengetrack1/leaderboard 3The 3rd MiGA-IJCAI Challenge (2025) Track 1 Leaderboard: https://www.kaggle.com/competitions/the-3rd-mi-ga-ijcaichallenge-track-1/leaderboard moderate competitiveness, they nevertheless yield relatively lower accuracies, highlighting the inherent challenges of relying on a single modality in micro-gesture classification tasks. However, the incorporation of multiple modalities consistently results in enhanced performance, clearly emphasizing the complementary and distinctive nature of the various modalities in improving classification accuracy.

Our subsequent multimodal fusion experiments verify the complementary nature of diverse data streams. Specifically, integrating skeleton (joint and limb) data with RGB frames results in an accuracy improvement to 71.416%, clearly demonstrating the strength of combining structural and appearancebased representations. Incorporating the Taylor modality further boosts accuracy to 72.096%, reflecting benefits from pixel-level temporal-spatial approximations that efectively capture subtle dynamic gestures. Additional integration of optical flow and depth modalities improves performance even further, reaching an accuracy of 72.644%, confirming their roles as valuable supplementary information sources. Ultimately, through an optimized multimodal fusion weighting strategy, our method achieves a Top-1 accuracy of 73.213%. These results strongly afirm the advantages of properly designed multimodal fusion techniques and emphasize the eficacy and robustness of the presented approach over previously published state-of-the-art methods in micro-gesture recognition tasks.

5. Conclusion

In this paper, we proposed MM-Gesture, a novel multimodal ensemble framework for micro-gesture recognition. Our method integrates complementary features from six modalities—skeleton, limb, RGB, Taylor series approximation, optical flow, and depth—to leverage their distinct fine-grained characteristics. Additionally, we employed transfer learning by pretraining the RGB-based model on the Micro-Action 52 dataset before fine-tuning on the target iMiGUE dataset. Experiments demonstrate that our multimodal fusion significantly outperforms single or fewer modality baselines. Our model achieved a top-1 accuracy of 73.213% on the challenging iMiGUE dataset, ranking first in the 3rd MiGA Competition at IJCAI 2025.

For future work, we aim to explore the integration of Multimodal large language models (MLLMs) [29, 30] and skeleton-based micro-gesture encoders. We plan to utilize MLLMs’ rich semantic understanding and extensive prior knowledge to enhance micro-gesture recognition through interactive prompts and contextual reasoning, further advancing multimodal and afective human behavior understanding. Additionally, we will incorporate modalities such as gaze [31], audio [32], and remote photoplethysmography (rPPG) [33] to enable comprehensive multimodal emotion analysis.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62272144,72188101,62020106007, and U20A20183), the Major Project of Anhui Province (202203a05020011), the Fundamental Research Funds for the Central Universities (JZ2024HGTG0309), and the Earth System Big Data Platform of the School of Earth Sciences, Zhejiang University.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT solely for grammar and spelling checks and minor language refinement. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [11] D. Guo, K. Li, B. Hu, Y. Zhang, M. Wang, Benchmarking micro-action recognition: Dataset, methods, and applications, IEEE Transactions on Circuits and Systems for Video Technology 34 (2024) 6238–6252. [12] K. Li, D. Guo, G. Chen, X. Peng, M. Wang, Joint skeletal and semantic embedding loss for micro-gesture classification, arXiv preprint arXiv:2307.10624 (2023). [13] H. Huang, X. Guo, W. Peng, Z. Xia, Micro-gesture classification based on ensemble hypergraphconvolution transformer., in: MiGA@ IJCAI, 2023. [14] K. Li, P. Liu, D. Guo, F. Wang, Z. Wu, H. Fan, M. Wang, Mmad: Multi-label micro-action detection in videos, arXiv preprint arXiv:2407.05311 (2024). [15] K. Li, D. Guo, G. Chen, F. Liu, M. Wang, Data augmentation for human behavior analysis in multiperson conversations, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9516–9520. [16] J. Gu, K. Li, F. Wang, Y. Wei, Z. Wu, H. Fan, M. Wang, Motion matters: Motion-guided modulation network for skeleton-based micro-action recognition, in: Proceedings of the 33rd ACM International Conference on Multimedia, 2025. [17] S. Sun, D. Liu, J. Dong, X. Qu, J. Gao, X. Yang, X. Wang, M. Wang, Unified multi-modal unsupervised representation learning for skeleton-based action understanding, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2973–2984. [18] J. Dong, S. Sun, Z. Liu, S. Chen, B. Liu, X. Wang, Hierarchical contrast for unsupervised skeletonbased action representation learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2023, pp. 525–533. [19] G. Chen, F. Wang, K. Li, Z. Wu, H. Fan, Y. Yang, M. Wang, D. Guo, Prototype learning for micro-gesture classification, arXiv preprint arXiv:2408.03097 (2024). [20] H. Huang, Y. Wang, L. Kerui, Z. Xia, Multi-modal micro-gesture classification via multiscale heterogeneous ensemble network, MiGA@ IJCAI (2024). [21] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting skeleton-based action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978. [22] F. Wang, D. Guo, K. Li, M. Wang, Eulermormer: Robust eulerian motion magnification via dynamic ifltering within transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 2024, pp. 5345–5353. [23] F. Wang, D. Guo, K. Li, Z. Zhong, M. Wang, Frequency decoupling for motion magnification via multi-level isomorphic architecture, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18984–18994. [24] X. Huang, H. Zhou, K. Yao, K. Han, Froster: Frozen clip is a strong teacher for open-vocabulary action recognition, arXiv preprint arXiv:2402.03241 (2024). [25] L. Wang, X. Yuan, T. Gedeon, L. Zheng, Taylor videos for action recognition, in: Forty-first

International Conference on Machine Learning, 2024. [26] Q. Dong, Y. Fu, Memflow: Optical flow estimation and prediction with memory, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19068–19078. [27] S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, B. Kang, Video depth anything: Consistent depth estimation for super-long videos, arXiv preprint arXiv:2501.12375 (2025). [28] H. Xu, L. Cheng, Y. Wang, S. Tang, Z. Zhong, Towards fine-grained emotion understanding via skeleton-based micro-gesture recognition, arXiv preprint arXiv:2506.12848 (2025). [29] Y. Xu, L. Zhu, Y. Yang, Mc-bench: A benchmark for multi-context visual grounding in the era of mllms, arXiv preprint arXiv:2410.12332 (2024). [30] Y. Xu, L. Zhu, Y. Yang, Gg-editor: Locally editing 3d avatars with multimodal large language model guidance, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 10910–10919. [31] F. Liu, K. Li, Z. Zhong, W. Jia, B. Hu, X. Yang, M. Wang, D. Guo, Depth matters: Spatial proximitybased gaze cone generation for gaze following in wild, ACM Transactions on Multimedia Computing, Communications and Applications 20 (2024) 1–24. [32] J. Zhao, F. Wang, K. Li, Y. Wei, S. Tang, S. Zhao, X. Sun, Temporal-frequency state space duality: An eficient paradigm for speech emotion recognition, in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing, 2025, pp. 1–5. [33] W. Qian, K. Li, D. Guo, B. Hu, M. Wang, Cluster-phys: Facial clues clustering towards eficient remote physiological measurement, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 330–339.

[1]

Chen ,

Liu ,

Li ,

Shi , G . Zhao, Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning , in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition , 2019 , pp. 1 - 8 .

[2]

Liu ,

Shi ,

Chen ,

Yu ,

Li , G. Zhao, imigue: An identity-free video dataset for microgesture understanding and emotion analysis , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021 , pp. 10631 - 10642 .

[3]

Chen ,

Shi ,

Liu ,

Li , G. Zhao, Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis , International Journal of Computer Vision 131 ( 2023 ) 1346 - 1366 .

[4]

Chen ,

B. W.

Schuller , E. Adeli, G. Zhao, The 2nd challenge on micro-gesture analysis for hidden emotion understanding (miga) 2024: Dataset and results , in: MiGA 2024: Proceedings of IJCAI 2024 Workshop & Challenge on Micro-gesture Analysis for Hidden Emotion Understanding (MiGA 2024) co-located with 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024 ), 2024 .

[5]

Li ,

Guo ,

Wang , Proposal-free video grounding with contextual pyramid network , in: Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , pp. 1902 - 1910 .

[6]

Wang ,

Li ,

Nie ,

Duan ,

Zou ,

Wu ,

Wang ,

Wei , Exploiting ensemble learning for cross-view isolated sign language recognition , arXiv preprint arXiv:2502.02196 ( 2025 ).

[7]

Balazia , P. Müller, Á. L. Tánczos , A. v.

Liechtenstein , F.

Bremond , Bodily behaviors in social interaction: Novel annotations and state-of-the-art evaluation , in: Proceedings of the 30th ACM International Conference on Multimedia , 2022 , pp. 70 - 79 .

[8]

Li ,

Xing ,

Liu ,

Xia ,

Wen ,

Kälviäinen , Deemo: De-identity multimodal emotion recognition and reasoning , arXiv preprint arXiv:2504.19549 ( 2025 ).

[9]

Liu ,

Ning ,

Cao ,

Wei ,

Zhang ,

Lin ,

Hu , Video swin transformer , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022 , pp. 3202 - 3211 .

[10]

Li ,

Guo , G. Chen,

Fan ,

Xu ,

Wu ,

Fan ,

Wang , Prototypical calibrating ambiguous samples for micro-action recognition , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 39 , 2025 , pp. 4815 - 4823 .