<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Tang);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Skeleton-Based Micro-Gesture Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hao Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lechao Cheng</string-name>
          <email>chenglc@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaxiong Wang</string-name>
          <email>wangyx15@stu.xjtu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shengeng Tang</string-name>
          <email>tangsg@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhun Zhong</string-name>
          <email>zhunzhong007@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Guangzhou, China.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hefei University of Technology</institution>
          ,
          <addr-line>No. 485, Danxia Road, Shushan District, Hefei, 230601</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We present our solution to the MiGA Challenge at IJCAI 2025, which aims to recognize micro-gestures (MGs) from skeleton sequences for the purpose of hidden emotion understanding. MGs are characterized by their subtlety, short duration, and low motion amplitude, making them particularly challenging to model and classify. We adopt PoseC3D as the baseline framework and introduce three key enhancements: (1) a topology-aware skeleton representation specifically designed for the iMiGUE dataset to better capture fine-grained motion patterns; (2) an improved temporal processing strategy that facilitates smoother and more temporally consistent motion modeling; and (3) the incorporation of semantic label embeddings as auxiliary supervision to improve the model's generalization ability. Our method achieves a Top-1 accuracy of 67.01% on the iMiGUE test set. As a result of these contributions, our approach ranks third on the oficial MiGA Challenge leaderboard. The source code is available at https://github.com/EGO-False-Sleep/Miga25_track1.</p>
      </abstract>
      <kwd-group>
        <kwd>Micro-gesture</kwd>
        <kwd>action classification</kwd>
        <kwd>data preprocessing</kwd>
        <kwd>skeleton-based action recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
• We introduce an improved temporal sampling and alignment strategy that departs from the
original ST-GCN formulation. This approach enhances motion continuity and enables more
coherent representation of raw skeleton sequences.
• Our complete pipeline, including the proposed topological and temporal enhancements, achieves
a Top-1 accuracy of 67.01% on the iMiGUE test set.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this challenge, we experiment with two representative skeleton-based action recognition frameworks:
ST-GCN and PoseC3D. This section details our design choices, empirical findings, and analysis based
on both architectures.</p>
      <sec id="sec-2-1">
        <title>2.1. Skeleton Augmentation with Facial Keypoints</title>
        <p>Human skeletal connectivity is spatially consistent and relatively easy for graph-based models to learn.
However, in the iMiGUE micro-gesture recognition task, many action categories—such as touching
the face, adjusting a hat, or biting lips—are localized in the facial region. To enhance facial motion
perception, we extend the standard 22-joint OpenPose skeleton to a 41-joint structure by incorporating
additional facial landmarks (e.g., cheeks, eyebrows, and lips). This augmentation provides finer spatial
resolution in regions critical to emotion-related micro-gestures.</p>
        <p>Figure 2 compares the skeletal connectivity diagrams under diferent keypoint configurations. While
this modification benefits representation, it also diverges from the original graph topology assumptions
of ST-GCN, as discussed in the next subsection.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Partitioning and Graph Reasoning in ST-GCN</title>
        <p>Efective action recognition requires modeling meaningful spatiotemporal motion patterns. ST-GCN
achieves this by partitioning the skeleton graph into sub-regions, such as centripetal, centrifugal, and
stationary limbs [10]. While efective for coarse-scale action categories, we find this partitioning
suboptimal for micro-motion classification, likely due to the lack of distinguishable limb dynamics and
limited data scale.</p>
        <p>To compensate, we first enrich the model input with additional facial keypoints. These regions,
while less dynamic in raw motion, are semantically aligned with many micro-gesture classes. However,
empirical results reveal that these added keypoints degrade performance in ST-GCN. We hypothesize
this is due to:
• Overfitting from increased graph complexity and limited training data.
• The added nodes not contributing salient temporal or relational motion patterns that ST-GCN is
designed to exploit.
• The peripheral nature and low motion magnitude of facial keypoints reducing their relative
attention in the learned graph features.</p>
        <p>Thus, despite their semantic relevance, these keypoints are possibly treated as noise within the ST-GCN’s
ifxed topology and partitioning scheme 1.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. PoseC3D with Extended Keypoints</title>
        <p>Given the limitations of ST-GCN, we shift our focus to PoseC3D, a 3D-CNN based method that processes
skeletons as spatiotemporal heatmaps. Even in its baseline form, PoseC3D significantly outperforms
ST-GCN. More importantly, when extended with facial keypoints, PoseC3D benefits substantially in
performance. We attribute this to two key properties:
• Heatmap-based Representation: PoseC3D encodes keypoints as dense heatmaps, which
preserve richer spatial information and allow the network to infer latent movement patterns—even
in low-motion regions. This representation has higher information entropy than raw joint
coordinates, enabling stronger generalization.
• Flexible 3D Convolutions: The spatiotemporal convolutions in PoseC3D operate over the
entire motion volume with uniform treatment of all locations. Unlike GCN’s, the receptive field
and feature propagation are not constrained by predefined skeletal graphs, granting PoseC3D
greater expressivity and robustness to irrelevant noise.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Temporal Frame Stream Processing</title>
        <p>Temporal modeling is a critical component in micro-gesture recognition, given the subtlety and brevity
of such actions. The default temporal sampling strategy in ST-GCN adopts simple rule-based heuristics:
• When the number of frames exceeds the target length, a continuous subsequence is randomly
cropped from the original sequence;
• When the number of frames is insuficient, zero-padding is applied to extend the sequence to the
required length.</p>
        <p>However, these strategies often fail to preserve the complete temporal structure of micro-gestures.
Random cropping may exclude key motion cues, while zero-padding introduces artificial discontinuities
that disrupt temporal coherence. These limitations are particularly detrimental in micro-motion
scenarios, where discriminative features are both sparse and temporally localized. To address these
issues, we propose a structure-preserving temporal alignment strategy as follows:
1We acknowledge that this interpretation may be influenced by our limited experience with ST-GCN and time constraints
during the challenge. We welcome future improvements from the community in this direction.</p>
        <p>• For over-length sequences, we perform uniform interval sampling, ensuring that both the first
and last frames are retained. This guarantees that the sampled sequence spans the full temporal
range of the original gesture;
• For under-length sequences, we apply linear interpolation to generate intermediate frames,
thereby expanding the sequence to the target length while maintaining temporal smoothness
and continuity.</p>
        <p>Compared to conventional approaches, the proposed strategy ofers better coverage of the gesture
trajectory and preserves fine-grained motion dynamics. Empirical results further confirm that this
refinement contributes to improved model stability and recognition accuracy in the micro-gesture
classification task.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset: iMiGUE [7]</title>
        <p>
          The Micro-Gesture Understanding and Emotion Analysis (iMiGUE) dataset consists of 32 micro-gesture
(MG) categories and one additional non-MG class. All data are collected from post-match press
conference videos of professional tennis players. The dataset comprises a total of 18,499 annotated MG
samples, which are labeled from 359 long video sequences ranging in duration from 0.5 to 26 minutes,
totaling approximately 3,765,600 frames. iMiGUE provides two modalities for each sample: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) RGB
videos, and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) 2D skeletal joint coordinates extracted using the OpenPose pose estimation framework.
This multi-modal design enables both appearance-based and skeleton-based gesture analysis, supporting
the development of robust models for emotion understanding based on subtle behavioral cues.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Metrics and Implementation Details</title>
        <p>We evaluate micro-gesture classification performance using the Top-1 Accuracy, which measures the
percentage of samples for which the predicted label exactly matches the ground truth. Our method
is implemented based on the open-source PySkl toolbox [15], and the training pipeline incorporates
a loss function inspired by the winning solution of the MiGA 2023 challenge. The model is trained
using Stochastic Gradient Descent (SGD) with a momentum of 0.9, a weight decay of 3 × 10−4, and a
batch size of 24. The initial learning rate is set to 0.1/3, and we adopt a Cosine Annealing learning rate
schedule. We use ResNet3D-SlowOnly as the feature extraction backbone and I3D as the classification
head. For multi-stream ensemble modeling, which integrates joint and limb modalities, we apply a
weighted fusion scheme with a ratio of 1:1, ensuring equal contribution from both sources of motion
information.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experiments</title>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Quantitative Results</title>
        <p>2Available at: https://www.kaggle.com/competitions/the-3rd-mi-ga-ijcai-challenge-track-1/leaderboard</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we present the solution developed for the MiGA Challenge held at IJCAI 2025. Throughout
the process, we employed both the ST-GCN and PoseC3D models, comparing their similarities and
diferences to explore the relationship between convolutional approaches and sequential data. Ultimately,
by leveraging joint and limb modality data and adopting PoseC3D as the backbone—combined with the
semantic embedding loss[16] proposed in 2023—our method achieved third place with a Top-1 accuracy
of 67.01%. For this task, we recognize that there remains ample room for further research. Moving
forward, we plan to address the challenges from additional perspectives, such as improved denoising
techniques, strategies for handling imbalanced data, and the integration of RGB video streams, among
others.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgements</title>
      <p>This work has been supported by the National Natural Science Foundation of China (Grant No. 62472139),
by the Anhui Provincial Natural Science Foundation, China (Grant No. 2408085QF191).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: paraphrase and reword
sentences, and check for grammar and spelling errors. After using this tool/service, the authors reviewed
and edited the content as needed and take full responsibility for the publication’s content.
for eficient action recognition, ACM Transactions on Multimedia Computing, Communications
and Applications 21 (2025) 1–20.
[3] S. Tang, J. He, D. Guo, Y. Wei, F. Li, R. Hong, Sign-idd: Iconicity disentangled difusion for sign
language production, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39,
2025, pp. 7266–7274.
[4] S. Tang, J. He, L. Cheng, J. Wu, D. Guo, R. Hong, Discrete to continuous: Generating smooth
transition poses from sign language observations, in: Proceedings of the Computer Vision and
Pattern Recognition Conference, 2025, pp. 3481–3491.
[5] Y. Zhang, L. Cheng, Y. Wang, Z. Zhong, M. Wang, Towards micro-action recognition with
limited annotations: An asynchronous pseudo labeling and training approach, arXiv preprint
arXiv:2504.07785 (2025).
[6] C. Fang, L. Cheng, Y. Mao, D. Zhang, Y. Fang, G. Li, H. Qi, L. Jiao, Separating noisy samples from
tail classes for long-tailed image classification with label noise, IEEE Transactions on Neural
Networks and Learning Systems (2023).
[7] X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, G. Zhao, imigue: An identity-free video dataset for
microgesture understanding and emotion analysis, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2021, pp. 10631–10642.
[8] H. Chen, H. Shi, X. Liu, X. Li, G. Zhao, Smg: A micro-gesture dataset towards spontaneous body
gestures for emotional stress state analysis, International Journal of Computer Vision 131 (2023)
1346–1366.
[9] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting skeleton-based action recognition, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
2969–2978.
[10] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action
recognition, in: Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
[11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep
visual-semantic embedding model, Advances in neural information processing systems 26 (2013).
[12] M.-C. Yeh, Y.-N. Li, Multilabel deep visual-semantic embedding, IEEE transactions on pattern
analysis and machine intelligence 42 (2019) 1530–1536.
[13] Z. Wei, J. Zhang, Z. Lin, J.-Y. Lee, N. Balasubramanian, M. Hoai, D. Samaras, Learning visual
emotion representations from web data, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 13106–13115.
[14] P. P. Filntisis, N. Efthymiou, G. Potamianos, P. Maragos, Emotion understanding in videos through
body, context, and visual-semantic embedding loss, in: Computer Vision–ECCV 2020 Workshops:
Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, 2020, pp. 747–755.
[15] H. Duan, J. Wang, K. Chen, D. Lin, Pyskl: Towards good practices for skeleton action recognition,
in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7351–7354.
[16] K. Li, D. Guo, G. Chen, X. Peng, M. Wang, Joint skeletal and semantic embedding loss for
micro-gesture classification, arXiv preprint arXiv:2307.10624 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , L. Cheng,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Mixed resolution network with hierarchical motion modeling for eficient action recognition, Knowledge-Based Systems 294 (</article-title>
          <year>2024</year>
          )
          <fpage>111686</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hao</surname>
          </string-name>
          , L. Cheng, S. Zhao,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Mixed attention and channel shift transformer</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>