<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>through Multimodal Fusion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jihao Gu</string-name>
          <email>jihao.gu.23@ucl.ac.uk</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fei Wang</string-name>
          <email>jiafei127@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kun Li</string-name>
          <email>kunli.hfut@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanyan Wei</string-name>
          <email>weiyy@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiliang Wu</string-name>
          <email>wu_zhiliang@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dan Guo</string-name>
          <email>guodan@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Micro-Gesture, Action Recognition, Multi-modal, Ensemble Fusion, Transfer Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Artificial Intelligence, Hefei Comprehensive National Science Center</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ReLER, CCAI, Zhejiang University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology</institution>
          ,
          <addr-line>HFUT</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University College London (UCL)</institution>
          ,
          <addr-line>Gower Street, London, WC1E 6BT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Xinsight Lab, Research Institute, Hefei Zhongjuyuan Intelligent Technology Co., Ltd.</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>In this paper, we present MM-Gesture, the solution developed by our team HFUT-VUT, which ranked 1st in the micro-gesture classification track of the 3rd MiGA Challenge at IJCAI 2025, achieving superior performance compared to previous state-of-the-art methods. MM-Gesture is a multimodal fusion framework designed specifically for recognizing subtle and short-duration micro-gestures (MGs), integrating complementary cues from joint, limb, RGB video, Taylor-series video, optical-flow video, and depth video modalities. Utilizing PoseConv3D and Video Swin Transformer architectures with a novel modality-weighted ensemble strategy, our method further enhances RGB modality performance through transfer learning pre-trained on the larger MA-52 dataset. Extensive experiments on the iMiGUE benchmark, including ablation studies across diferent modalities, validate the efectiveness of our proposed approach, achieving a top-1 accuracy of 73.213%. Code is available at: https://github.com/momiji-bit/MM-Gesture.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        PoseConv3D [21] and Video Swin Transformer [
        <xref ref-type="bibr" rid="ref6 ref9">9, 6</xref>
        ], integrating information across six complementary
modalities: joint, limb, RGB video, Taylor video, optical flow video, and depth video . In addition,
      </p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
to enhance the performance of the RGB modality, we apply transfer learning by pre-training on the
Micro-Action 52 dataset [11] and fine-tuning on the iMiGUE dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The key contributions of this paper can be summarized as follows:
• We present an integrated multi-modal MGs classification network that utilizes complementary
information from six diverse modalities: joint, limb, RGB video, Taylor video, optical flow video,
and depth video.
• We propose an efective ensemble fusion method capable of eficiently integrating six modalities,
enabling the joint exploitation of modality-specific strengths for improved MGs classification
accuracy.
• Extensive experiments on the iMiGUE dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] demonstrate that the proposed MM‑Gesture
achieves state-of-the-art performance, reaching a Top-1 accuracy of 73.213%, which is the highest
reported accuracy in previous Micro-gesture Analysis (MiGA) challenges.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Micro-Gestures (MGs) are becoming increasingly important in understanding human emotions, focusing
on subtle body movements in daily interactions. Advances in this field have been driven by the
development of large benchmark datasets and sophisticated model architectures [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3, 11</xref>
        ]. Key
datasets include the SMG dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which consists of recordings from 40 participants engaged in
storytelling, capturing upper limb micro-gestures and emotional states. The iMiGUE dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] ofers
identity-free videos of 72 athletes at press conferences, annotated with 32 micro-gesture categories
for analyzing both actions and emotions. The MA-52 dataset [11] expands the focus to full-body
micro-actions, with 22,000 samples covering 52 action-level and 7 body-level categories, sourced from
psychological interviews to recognize subtle visual cues.
      </p>
      <p>
        Current models primarily focus on limited modalities. RGB-based methods leverage spatial-temporal
modeling strategies, such as a pure Transformer backbone with shifted 3D local attention windows [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
MANet [11] integrates SE and TSM modules with semantic embedding loss for fine-grained
microaction recognition. Skeleton-based approaches include a 3D-CNN model with joint and semantic
embedding losses [12], and an EHCT framework [13] employs hypergraph-based attention and ensemble
Transformers [22, 23] to capture high-order joint relations and address class imbalance. In contrast,
skeleton sequences can be encoded as 3D heatmaps and fused with RGB inputs through a dual-branch
multimodal network [21]. Inspired by this network, Chen et al. [19] adopt channel-wise cross-attention
and prototype refinement to enhance feature fusion and category discrimination, while Huang et al. [24]
design a multi-scale heterogeneous fusion network. Recently, Li et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose a hierarchical
prototype-based calibration method to resolve ambiguity in fine-grained actions. Overall, current
methods only focus on the RGB or skeleton data.
      </p>
      <p>To exploit the complementarity between diferent multimodal data, we propose the MM-Gesture
model, adopting a comprehensive multimodal approach that integrates six modalities: joint, limb, RGB
video, Taylor video, optical flow video, and depth video. This approach enables a deeper understanding
and representation of micro-gestures, capturing their nuances and dynamics. Additionally, we leverage
transfer learning from the MA-52 dataset to infuse valuable prior knowledge into the RGB modality,
further enhancing its recognition accuracy. Consequently, our model improves performance on existing
benchmarks and paves the way for advanced applications in human emotion understanding through
micro-gesture analysis.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Data Pre-processing</title>
        <p>We adopted the RGB videos (R ∈ ℝ × × ×3 ) provided by the oficial dataset, along with a subset of 36
skeleton keypoints ( ) selected from the original 137 points, to form the input joint data (J ∈ ℝ × ×2 ).</p>
        <p>These cleaned keypoints focus specifically on the upper body, hands, and facial joints. Additionally, we
constructed input limb data (L ∈ ℝ ××2 ) by computing spatial diferences between adjacent joint pairs
defined by the skeletal edges (  ) connecting the selected keypoints.</p>
        <p>To efectively capture multi-modal gesture information, we employ advanced, of-the-shelf modality
extraction methods to generate complementary auxiliary modalities. Specifically, we utilize
Taylorseries temporal expansion videos, optical-flow videos, and depth-estimation videos, each modality
providing distinct yet complementary gesture-related information. By leveraging the ensemble among
these diverse modalities, our proposed MM-Gesture model efectively exploits multi-modal feature
complementarity.</p>
        <p>T ∈ ℝ( −)× × ×3
,</p>
        <p>F ∈ ℝ( −1)× × ×3
,</p>
        <p>D ∈ ℝ × × ×3
,
T = ℱtaylor(R∶+ ),</p>
        <p>F = ℱflow (R∶+1 ),</p>
        <p>D = ℱdepth(R ),
(1)
where each symbol is defined as follows:
•  : Temporal length of the input RGB video.
•  ,  : Height and Width of the input RGB video frames.
•  : Temporal window length for computing the truncated Taylor-series expansion.
• R : The RGB frame at time step  .
• ℱ : The Taylor-series-based video calculated according to the approach [25], where  denotes
the maximum order of the truncated Taylor-series expansion and  represents the temporal
window length used for aggregating local temporal context.
• ℱ  : The optical-flow-based modality computed using the MemFlow network [ 26], which
estimates optical flow representations F from consecutive frames R and R+1 .
• ℱℎ : The depth-estimation-based modality generated using the monocular depth estimation
algorithm [27], resulting in depth representations D .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Network Architecture</title>
        <p>As shown in Figure 1, the proposed multi-modal micro-gesture recognition framework (MM-Gesture)
consists of three main modules:</p>
        <p>Cross-Modal Fusion Module: In this module, skeletal coordinates are initially transformed into
Gaussian heatmap-based 3D volumes (H) for Joint and Limb modalities individually. RGB, Joint, and
Limb modalities are all separately trained through PoseConv3D [21], capturing spatial-temporal skeleton
dynamics and RGB spatial context, respectively. Subsequently, the extracted RGB and skeleton features
are combined via a cross-modal fusion training stage to exploit complementary information between
these modalities comprehensively.</p>
        <p>
          Uni-Modal Encoding Module: We leverage the VideoSwinT network [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to independently encode
four distinct modalities: RGB frames, Taylor-based temporal encoding, optical flow (computed via
MemFlow), and depth estimates. Specifically, for the RGB modality, we first employ transfer learning by
pretraining VideoSwinT on the MA-52 dataset and subsequently fine-tune the pretrained model on the
iMiGUE dataset. For the remaining modalities (Taylor, optical flow, and depth), VideoSwinT is directly
trained from scratch on the iMiGUE dataset. VideoSwinT uses a 3D shifted-window self-attention
mechanism that efectively captures fine-grained spatial-temporal details within each modality.
        </p>
        <p>Ensemble Module: Probabilities from the PoseConv3D Cross-Modal Fusion Module and VideoSwinT
Uni-Modal Encoding Module are combined via weighted ensemble, with weights set empirically
according to validation performance. This integration approach efectively exploits modality complementarity,
improving robustness and accuracy in micro-gesture recognition.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. PoseConv3D Cross-Modal Fusion Module</title>
        <p>To efectively align skeleton-based information (consisting of joints and limbs) with RGB video
representations and facilitate fine-grained complementary interactions across these modalities, we adopt
PoseConv3D [21] for cross-modal integration.</p>
        <p>Specifically, we first transform the 2D coordinates of skeletal keypoints into heatmap-based
representations. By applying Gaussian distributions and calculating the heatmap values using the point-to-segment
distance formula, we compute and stack the heatmaps of each keypoint across all frames to generate
3D heatmap volumes. The resulting heatmaps are as follows:</p>
        <p>H ∈ ℝ × × ×
,</p>
        <p>H ∈ ℝ × × ×
,
(2)
(3)
(4)
(5)
where H denotes the joint-position heatmaps, and H denotes the limb-connection heatmaps. Here, 
is the total number of frames,  is the number of skeletal joints, and  is the number of skeletal limbs
(connections between joints).  , and  represent the spatial resolution (height and width) of each
heatmap. Subsequently, the RGB frames R ∈ ℝ × × ×3 and skeleton heatmaps H , H are taken as
input data.</p>
        <p>Prior to network training, data augmentation processes (e.g., scaling, cropping) are consistently
applied to both RGB video frames and skeleton heatmap modalities to enhance data diversity and improve
model robustness. Subsequently, the augmented data from each modality is separately forwarded into
the PoseConv3D module, which extracts deep spatiotemporal feature representations. The PoseConv3D
network generates modality-specific predictions denoted formally as ŷ , where  ∈ {,  , } indicates
RGB, joint heatmap, and limb heatmap modalities, respectively. Each modality-specific network is
initially pretrained independently by minimizing the cross-entropy (CE) classification loss:
ℒ = CE (ŷ ,  ) ,  ∈ { R, J, L},
where  denotes the ground-truth action labels.</p>
        <p>Next, we conduct a joint fine-tuning procedure by simultaneously optimizing combined RGB and
skeleton-based modalities using the following paired-training losses:
ℒR+J = ℒR + ℒJ,</p>
        <p>ℒR+L = ℒR + ℒL.</p>
        <p>During model inference, the predictions yielded by distinct modalities are integrated at the probability
level via a late fusion strategy. Formally, let  ⋆ = SoftMax (ŷ⋆), ⋆ ∈ {,  , } , represent modality-specific
probability distributions. We then fuse predictions through average fusion to achieve final predictive
distributions:</p>
        <p>PR+J = 1 (PR + PJ),
2</p>
        <p>PR+L = 1 (PR + PL).</p>
        <p>2</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Ensemble Module</title>
        <p>In the final ensemble stage, we introduce a probability-based weighted fusion strategy to efectively
aggregate predictions derived from multiple modality-specific networks. Specifically, class probability
vectors independently output by the PoseConv3D (RGB + J, RGB+L) and VideoSwin Transformer
(RGB∗, Taylor, Flow, Depth) models are integrated using empirically determined weights obtained
via validation-set performance.</p>
        <p>The ensemble prediction (Pfinal ∈ ℝ ) is computed by summing the weighted contributions of
individual modality-specific probabilities, as follows:</p>
        <p>Pifnal = ∑   P ,  ∈ { R+J, R+L, R, T, F, D}
where each weight   is selected based on the classification performance observed on validation samples.</p>
        <p>This proposed ensemble-based fusion mechanism enables comprehensive exploitation of the
complementary strengths inherent in multiple modality-specific models, thereby significantly improving the
robustness and overall efectiveness of our multi-modal micro-gesture recognition framework.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4. VideoSwinT Uni-Modal Encoding Module</title>
        <p>
          Unlike existing skeleton-video modality fusion methods, we propose a multimodal framework based on
the VideoSwinT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which encodes RGB video, optical flow video, Taylor-expanded video, and depth
video. This encoding strategy efectively integrates color, texture, dynamic motion, and geometric
structural information to better capture multidimensional micro-action features, thus enabling more
ifne-grained action recognition.
        </p>
        <p>Specifically, we independently optimize each modality-specific backbone by minimizing the
crossentropy (CE) classification loss. Prior to training on the target iMiGUE dataset, the RGB modality
network is initially pretrained on the MA-52 dataset (R⋆ ∈ ℝ × × ×3 ) [11], which provides extensive
coverage of 52 types of micro-actions. After pretraining, the RGB modality network is fine-tuned on
the iMiGUE dataset along with other modalities. The loss functions for pretraining and fine-tuning,
along with the probability computation, are formulated as follows:
ℒ = CE(ŷ ,  ),  ∈ {</p>
        <p>R⋆, R, T, F, D},</p>
        <p>P = SoftMax (ŷ ),  ∈ { R, T, F, D}.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>
          Dataset. iMiGUE (identity-free video dataset for Micro-Gesture Understanding and Emotion analysis)
dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] consists of micro-gestures (MGs) primarily involving upper limbs, collected from post-match
press conference videos of professional tennis players. It includes 31 MG categories and an additional
non-MG class, comprising a total of 18,499 labeled MG samples annotated from 359 long video sequences
(ranging from 0.5 to 26 minutes), totaling approximately 3.77 million frames. The dataset provides two
modalities: RGB videos and corresponding 2D skeletal joint data extracted via OpenPose. iMiGUE
adopts a cross-subject evaluation protocol, splitting 72 subjects into 37 for training and 35 for testing,
with 12,893 samples in the training set, 777 in validation, and 4,562 in testing. In addition, we pretrain
the proposed method on the Micro-Action 52 [11] dataset and then fine-tune it on the iMiGUE dataset.
Micro-Action 52 is a large-scale, whole-body micro-action dataset collected by a professional interviewer
to capture unconscious human micro-action behaviors. The dataset contains 22,422 (22.4K) samples
interviewed from 205 participants, where the annotations are categorized into two levels: 7 body-level
and 52 action-level micro-action categories. There are 11,250, 5,586, and 5,586 instances in the training,
validation, and test sets, respectively.
(6)
(7)
Evaluation Metrics. For the micro-gesture classification challenge, we employ top-1 accuracy as the
evaluation metric to quantitatively assess classification performance.
        </p>
        <p>
          Implementation Details. The provided dataset includes original RGB videos and skeletal data
extracted using OpenPose, featuring 137 full-body keypoints. To optimize data, we select 36 keypoints
for the upper-body, facial landmarks, and hands. We also enhance data representation by generating
additional modalities: depth using the method by Chen et al. [27], Taylor video modality via Wang et
al.’s [25] approximation, and optical flow through Dong et al.’s [26] MemFlow approach. For modeling,
PoseConv3D [21] is used to capture spatial-temporal dynamics in skeletal information (J), limb
connections (L), and combined RGB with skeletal data (RGB+J and RGB+L). VideoSwin Transformer [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is
applied to RGB, depth, Taylor, and optical flow modalities for spatial-temporal processing. To enhance
robustness, we perform transfer learning with VideoSwinT: initially pretraining on RGB data from
Micro-Action 52 (MA-52) [11], followed by fine-tuning on the iMiGUE dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Finally, we employ an
ensemble fusion strategy, assigning weights to each modality based on contribution and correlation. We
integrate RGB*, Taylor, Flow, and Depth from VideoSwin, along with RGB+Joint and RGB+Limb
from PoseConv3D.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Results</title>
        <p>We evaluated the proposed method on the iMiGUE dataset and compared its performance against
state-of-the-art methods reported in the MiGA Challenges from 2023 to 2025. As presented in Table 1,
we provide the classification results of the top three competitors from these three consecutive editions,
clearly demonstrating the consistent superiority of our proposed method over previous best-performing
approaches across all years. Specifically, our approach achieved a Top-1 accuracy of 73.213%, ranking first
in the 2025 competition, significantly outperforming the second-place accuracy of 68.697%. Compared
with the best performance in the 2024 MiGA Challenge, our method realized an improvement of
approximately 3%, thus substantially exceeding the results from the 2023 edition as well.</p>
        <p>
          Here, we conduct comprehensive experimental settings to evaluate multiple modalities, including
skeleton data (joints and limbs), RGB frames, Taylor series approximation videos (Taylor), optical flow,
and depth information. As shown in Table 2, two backbone frameworks, namely PoseConv3D [21]
and VideoSwin [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], were employed to thoroughly explore performance across various modality
combinations. Experimental outcomes demonstrate that while single-modality inputs generally show
1The 1st MiGA-IJCAI Challenge (2023) Track 1 Leaderboard: https://codalab.lisn.upsaclay.fr/competitions/11758#results
2The 2nd MiGA-IJCAI Challenge (2024) Track 1 Leaderboard:
https://www.kaggle.com/competitions/2nd-miga-ijcai-challengetrack1/leaderboard
3The 3rd MiGA-IJCAI Challenge (2025) Track 1 Leaderboard:
https://www.kaggle.com/competitions/the-3rd-mi-ga-ijcaichallenge-track-1/leaderboard
moderate competitiveness, they nevertheless yield relatively lower accuracies, highlighting the inherent
challenges of relying on a single modality in micro-gesture classification tasks. However, the
incorporation of multiple modalities consistently results in enhanced performance, clearly emphasizing the
complementary and distinctive nature of the various modalities in improving classification accuracy.
        </p>
        <p>Our subsequent multimodal fusion experiments verify the complementary nature of diverse data
streams. Specifically, integrating skeleton (joint and limb) data with RGB frames results in an accuracy
improvement to 71.416%, clearly demonstrating the strength of combining structural and
appearancebased representations. Incorporating the Taylor modality further boosts accuracy to 72.096%, reflecting
benefits from pixel-level temporal-spatial approximations that efectively capture subtle dynamic
gestures. Additional integration of optical flow and depth modalities improves performance even
further, reaching an accuracy of 72.644%, confirming their roles as valuable supplementary information
sources. Ultimately, through an optimized multimodal fusion weighting strategy, our method achieves a
Top-1 accuracy of 73.213%. These results strongly afirm the advantages of properly designed multimodal
fusion techniques and emphasize the eficacy and robustness of the presented approach over previously
published state-of-the-art methods in micro-gesture recognition tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we proposed MM-Gesture, a novel multimodal ensemble framework for micro-gesture
recognition. Our method integrates complementary features from six modalities—skeleton, limb,
RGB, Taylor series approximation, optical flow, and depth—to leverage their distinct fine-grained
characteristics. Additionally, we employed transfer learning by pretraining the RGB-based model on
the Micro-Action 52 dataset before fine-tuning on the target iMiGUE dataset. Experiments demonstrate
that our multimodal fusion significantly outperforms single or fewer modality baselines. Our model
achieved a top-1 accuracy of 73.213% on the challenging iMiGUE dataset, ranking first in the 3rd MiGA
Competition at IJCAI 2025.</p>
      <p>For future work, we aim to explore the integration of Multimodal large language models (MLLMs) [29,
30] and skeleton-based micro-gesture encoders. We plan to utilize MLLMs’ rich semantic understanding
and extensive prior knowledge to enhance micro-gesture recognition through interactive prompts and
contextual reasoning, further advancing multimodal and afective human behavior understanding.
Additionally, we will incorporate modalities such as gaze [31], audio [32], and remote photoplethysmography
(rPPG) [33] to enable comprehensive multimodal emotion analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the National Natural Science Foundation of China
(62272144,72188101,62020106007, and U20A20183), the Major Project of Anhui Province
(202203a05020011), the Fundamental Research Funds for the Central Universities (JZ2024HGTG0309),
and the Earth System Big Data Platform of the School of Earth Sciences, Zhejiang University.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT solely for grammar and spelling checks
and minor language refinement. After using this tool, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.
[11] D. Guo, K. Li, B. Hu, Y. Zhang, M. Wang, Benchmarking micro-action recognition: Dataset,
methods, and applications, IEEE Transactions on Circuits and Systems for Video Technology 34
(2024) 6238–6252.
[12] K. Li, D. Guo, G. Chen, X. Peng, M. Wang, Joint skeletal and semantic embedding loss for
micro-gesture classification, arXiv preprint arXiv:2307.10624 (2023).
[13] H. Huang, X. Guo, W. Peng, Z. Xia, Micro-gesture classification based on ensemble
hypergraphconvolution transformer., in: MiGA@ IJCAI, 2023.
[14] K. Li, P. Liu, D. Guo, F. Wang, Z. Wu, H. Fan, M. Wang, Mmad: Multi-label micro-action detection
in videos, arXiv preprint arXiv:2407.05311 (2024).
[15] K. Li, D. Guo, G. Chen, F. Liu, M. Wang, Data augmentation for human behavior analysis in
multiperson conversations, in: Proceedings of the 31st ACM International Conference on Multimedia,
2023, pp. 9516–9520.
[16] J. Gu, K. Li, F. Wang, Y. Wei, Z. Wu, H. Fan, M. Wang, Motion matters: Motion-guided
modulation network for skeleton-based micro-action recognition, in: Proceedings of the 33rd ACM
International Conference on Multimedia, 2025.
[17] S. Sun, D. Liu, J. Dong, X. Qu, J. Gao, X. Yang, X. Wang, M. Wang, Unified multi-modal unsupervised
representation learning for skeleton-based action understanding, in: Proceedings of the 31st ACM
International Conference on Multimedia, 2023, pp. 2973–2984.
[18] J. Dong, S. Sun, Z. Liu, S. Chen, B. Liu, X. Wang, Hierarchical contrast for unsupervised
skeletonbased action representation learning, in: Proceedings of the AAAI Conference on Artificial
Intelligence, volume 37, 2023, pp. 525–533.
[19] G. Chen, F. Wang, K. Li, Z. Wu, H. Fan, Y. Yang, M. Wang, D. Guo, Prototype learning for
micro-gesture classification, arXiv preprint arXiv:2408.03097 (2024).
[20] H. Huang, Y. Wang, L. Kerui, Z. Xia, Multi-modal micro-gesture classification via multiscale
heterogeneous ensemble network, MiGA@ IJCAI (2024).
[21] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting skeleton-based action recognition, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
2969–2978.
[22] F. Wang, D. Guo, K. Li, M. Wang, Eulermormer: Robust eulerian motion magnification via dynamic
ifltering within transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence,
volume 38, 2024, pp. 5345–5353.
[23] F. Wang, D. Guo, K. Li, Z. Zhong, M. Wang, Frequency decoupling for motion magnification via
multi-level isomorphic architecture, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2024, pp. 18984–18994.
[24] X. Huang, H. Zhou, K. Yao, K. Han, Froster: Frozen clip is a strong teacher for open-vocabulary
action recognition, arXiv preprint arXiv:2402.03241 (2024).
[25] L. Wang, X. Yuan, T. Gedeon, L. Zheng, Taylor videos for action recognition, in: Forty-first</p>
      <p>International Conference on Machine Learning, 2024.
[26] Q. Dong, Y. Fu, Memflow: Optical flow estimation and prediction with memory, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19068–19078.
[27] S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, B. Kang, Video depth anything: Consistent
depth estimation for super-long videos, arXiv preprint arXiv:2501.12375 (2025).
[28] H. Xu, L. Cheng, Y. Wang, S. Tang, Z. Zhong, Towards fine-grained emotion understanding via
skeleton-based micro-gesture recognition, arXiv preprint arXiv:2506.12848 (2025).
[29] Y. Xu, L. Zhu, Y. Yang, Mc-bench: A benchmark for multi-context visual grounding in the era of
mllms, arXiv preprint arXiv:2410.12332 (2024).
[30] Y. Xu, L. Zhu, Y. Yang, Gg-editor: Locally editing 3d avatars with multimodal large language
model guidance, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024,
pp. 10910–10919.
[31] F. Liu, K. Li, Z. Zhong, W. Jia, B. Hu, X. Yang, M. Wang, D. Guo, Depth matters: Spatial
proximitybased gaze cone generation for gaze following in wild, ACM Transactions on Multimedia
Computing, Communications and Applications 20 (2024) 1–24.
[32] J. Zhao, F. Wang, K. Li, Y. Wei, S. Tang, S. Zhao, X. Sun, Temporal-frequency state space duality:
An eficient paradigm for speech emotion recognition, in: ICASSP 2025-2025 IEEE International
Conference on Acoustics, Speech and Signal Processing, 2025, pp. 1–5.
[33] W. Qian, K. Li, D. Guo, B. Hu, M. Wang, Cluster-phys: Facial clues clustering towards eficient
remote physiological measurement, in: Proceedings of the 32nd ACM International Conference
on Multimedia, 2024, pp. 330–339.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning</article-title>
          ,
          <source>in: 2019 14th IEEE International Conference on Automatic Face &amp; Gesture Recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, imigue: An identity-free video dataset for microgesture understanding and emotion analysis</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10631</fpage>
          -
          <lpage>10642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>131</volume>
          (
          <year>2023</year>
          )
          <fpage>1346</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, The 2nd challenge on micro-gesture analysis for hidden emotion understanding (miga) 2024: Dataset and results</article-title>
          ,
          <source>in: MiGA 2024: Proceedings of IJCAI 2024 Workshop</source>
          &amp;
          <article-title>Challenge on Micro-gesture Analysis for Hidden Emotion Understanding (MiGA 2024) co-located with 33rd</article-title>
          <source>International Joint Conference on Artificial Intelligence (IJCAI</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Proposal-free video grounding with contextual pyramid network</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1902</fpage>
          -
          <lpage>1910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Exploiting ensemble learning for cross-view isolated sign language recognition</article-title>
          ,
          <source>arXiv preprint arXiv:2502.02196</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazia</surname>
          </string-name>
          , P. Müller, Á. L.
          <string-name>
            <surname>Tánczos</surname>
            ,
            <given-names>A. v.</given-names>
          </string-name>
          <string-name>
            <surname>Liechtenstein</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bremond</surname>
          </string-name>
          ,
          <article-title>Bodily behaviors in social interaction: Novel annotations and state-of-the-art evaluation</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kälviäinen</surname>
          </string-name>
          , Deemo:
          <article-title>De-identity multimodal emotion recognition and reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2504.19549</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Video swin transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3202</fpage>
          -
          <lpage>3211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Prototypical calibrating ambiguous samples for micro-action recognition</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>39</volume>
          ,
          <year>2025</year>
          , pp.
          <fpage>4815</fpage>
          -
          <lpage>4823</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>