<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Patapati);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Santosh Patapati</string-name>
          <email>santosh@cyrionlabs.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Trisanth Srinivasan</string-name>
          <email>trisanth@cyrionlabs.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amith Adiraju</string-name>
          <email>aadiraju@cyrionlabs.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cyrion Labs</institution>
          ,
          <addr-line>Texas</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>Micro-gesture recognition is a challenging task in afective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining dificulty in fully adapting vision-language models like CLIP for micro-gesture recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>Micro-gesture recognition</kwd>
        <kwd>Vision-language models</kwd>
        <kwd>CLIP adaptation</kwd>
        <kwd>Pose-guided fusion</kwd>
        <kwd>Multi-modal deep</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Micro-gestures (MGs) are subtle, spontaneous body movements that can reveal hidden emotional
states [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], often occurring when people attempt to suppress their true feelings. Unlike overt actions
or expressive gestures, micro-gestures involve minute motions (e.g., slight fidgeting, brief facial or
limb movements) that are short in duration and low in amplitude, making them hard to detect and
classify [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The analysis of micro-gestures has gained traction in afective computing and human
behavior understanding because these involuntary cues provide valuable insight into a person’s internal
state. Automatic recognition of micro-gestures is therefore important for applications in psychology,
human-computer interaction, and emotion analysis [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>In this paper, we present CLIP-MG, a novel multi-modal framework for micro-gesture classification on
iMiGUE. Our approach builds upon previous work by incorporating pose (skeleton) data in a principled
way. The main contributions are summarized as follows:
1. We develop a system that uses human pose cues to help guide the semantic query extraction from
video frames. The skeleton information helps focus the CLIP visual encoder on the regions of
subtle motion. This creates a semantic query embedding that is rich with features relevant to
pose.
2. We introduce a gated fusion mechanism to combine visual and skeleton representations efectively.</p>
      <p>Our gated fusion learns to weight and integrate the two modalities. This allows pose features to
adaptively modulate the visual features before and during the cross-attention process.
3. We extend CLIP to a multi-modal setting. The pose-based query is fed into the CLIP transformer
for cross-attention over semantically significant visual token features. This limits the model to
attend to parts relevant to gesture. This results in a fused representation that has both semantic
and motion-specific information for classification.</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
4. We evaluate CLIP-MG on the iMiGUE micro-gesture dataset. We additionally perform numerous
ablation studies to quantify how much each proposed component improves performance. The
results of our ablation studies provide insights for future researchers and future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. The iMiGUE Dataset</title>
        <p>Recent progress in this area has been driven by the introduction of specialized datasets for micro-gesture
understanding. In particular, iMiGUE is a large-scale video dataset introduced by Liu et al. [6] for
identity-free micro-gesture understanding and emotion analysis. The iMiGUE dataset consists of video
footage of tennis players during post-match interviews, with detailed frame-level annotations of various
micro-gestures. The dataset contains 72 subjects (split into 37 for training and 35 for testing in a
cross-subject protocol) and a total of 33 gesture classes.</p>
        <p>One thing to note is that the class distribution is highly imbalanced. 28 of the 33 classes are tail classes
with relatively few samples, meaning they collectively only make up less than 60% of the data. This
long-tailed distribution, combined with the subtlety and high intra-class variability of micro-gestures,
makes the recognition task extremely challenging [7].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Signal Processing &amp; Machine Learning Techniques for Micro-Gesture</title>
      </sec>
      <sec id="sec-2-3">
        <title>Recognition</title>
        <p>To tackle these challenges, the research community has organized the Micro-Gesture Analysis (MiGA)
challenges in recent years [7, 8]. These competitions have spurred the development of novel
multimodal approaches that leverage both video (RGB) and skeleton (pose) modalities for micro-gesture
recognition. In the 2024 MiGA Challenge, for example, all top-performing methods integrated pose
information alongside RGB frames. The winning entry by Chen et al. introduced a prototype-based
learning approach with a two-stream 3D CNN (PoseConv3D) backbone for RGB and pose, cross-modal
attention fusion, and a prototypical refinement component to calibrate ambiguous samples [ 9]. This
method achieved a Top-1 accuracy of 70.25% on the iMiGUE test set, substantially outperforming earlier
approaches. The second-place method by Huang et al. proposed a multi-scale heterogeneous ensemble
network (M2HEN) combining a 3D convolutional model and a Transformer for feature diversity, reaching
70.19% accuracy [10]. Another notable approach by Wang et al. leveraged the vision-language model
CLIP: they used a frozen CLIP as a teacher network for RGB frames and injected CLIP-derived text
embeddings into a pose-based model, achieving 68.9% accuracy with an ensemble of RGB, joint, and
limb pose streams. These eforts demonstrate that multi-modal fusion and semantic knowledge transfer
are highly important in improving micro-gesture recognition [11].</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.3. CLIP-Based Video Understanding</title>
        <p>Meanwhile, in the broader action recognition field, researchers have explored using pre-trained
visionlanguage models like CLIP for video understanding. CLIP (Contrastive Language-Image Pre-training)
[12] has shown powerful visual feature representations aligned with semantics via natural language
supervision. However, straightforward fine-tuning of CLIP on video data can neglect smaller semantic
information. To address this, Quan et al. proposed Semantic-Constrained CLIP (SC-CLIP) [13]. SC-CLIP
adapts CLIP to video by generating a compact semantic query from dense visual tokens and using
cross-attention to refocus the model on those action-relevant semantics. This ”constrains” CLIP’s
attention to discriminative features and yields stronger zero-shot and fine-grained recognition.</p>
        <p>SC-CLIP demonstrates that directing attention to semantically meaningful regions can improve
ifne-grained video understanding. Micro-gestures, despite being low in their extent of movement, still
take place with subtle visual semantics. Skeleton key-points give precise spatial-temporal anchors
(hands, face, shoulders) that show where and when these cues take place. Thus, we design a pose-guided
semantic attention mechanism that uses skeletal cues to steer CLIP towards where the gesture is taking
place. This creates a query that captures the subtle semantics of micro-gestures.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our model (illustrated in Figure 1) has several components working in sequence: (1) a visual encoder
(based on CLIP’s vision transformer) processes the RGB frames, (2) a skeleton encoder processes the
pose sequence, (3) pose-based semantic query generation producse semantic queries from visual features
guided by pose features, (4) a gated fusion and semantics-based cross-attention fuses the modalities and
improves the representation, and (5) a classification head outputs the predicted gesture label. In the
following, we detail each component and the overall pipeline.</p>
      <sec id="sec-3-1">
        <title>3.1. Visual Encoder</title>
        <p>We adopt the OpenAI CLIP ViT-B/16 image tower [14] with the standard 224 × 224 input resolution
and 16 × 16 patching, which yields  = 196 patch tokens plus one [CLS] token per frame. The internal
transformer width is 768 dimensions, while CLIP’s projection head maps the final [CLS] embeddings to
a 512-dimensional space. From each micro-gesture clip we uniformly sample  ′ = 8 frames. Formally,
for frame  we obtain the token sequence:
  = {  , CLS,  ,1 , … ,  ,196 },
 , CLS ∈ ℝ768 .</p>
        <p>During training we freeze the first 10 of the 12 ViT blocks and fine-tune only the last two blocks
together with our added components [15]. Temporal information is aggregated by average-pooling the
eight [CLS] embeddings to produce the per-clip visual feature.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Skeleton Encoder</title>
        <p>pose frames:
We use the OpenPose format [16] to extract skeleton features. Given a clip, we first sample  ′ = 32
 () = { 1 , … ,  3(2) },
()

() ∈ ℝ18×2 .</p>
        <p>To stay time-aligned with the eight RGB frames, the 32 pose frames are grouped into eight
nonoverlapping four-frame windows centered on  1, … ,  8. The heat-maps of each window are
averagepooled along the temporal axis. This results in eight pose volumes that correspond one-to-one with the
RGB inputs.</p>
        <p>Each joint is then rasterized into a 256 × 256 canvas as a 2D Gaussian [17]:
 ,,
()
= exp(−
( −  
() )2 + ( −</p>
        <p>() )2
2  2
),
 = 2.5 px,
where ( 
()
,  
() )is the  -th joint of frame  . Stacking all joints and time-steps yields a 4D tensor</p>
        <p>produces a 256-dimensional clip descriptor ℎ() . A linear
projection:
maps it to  = 512 so the pose feature matches the CLIP visual dimension  = 512 :
  ∈ ℝ512×256
ℎ̃ ()
=   ℎ()</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Pose-Guided Semantic Query Generation</title>
        <p>The proposed semantic-query approach extracts a representation of the video’s most important cues
with the support of the skeleton features. It does so (1) spatially, by concentrating on visual tokens that
are near body parts exhibiting motion, and (2) temporally, by giving higher weight to frames where the
pose dynamics show that a micro-gesture is taking place.</p>
        <p>Let  , be the set of patch embeddings from all selected frames (excluding the global tokens). We
ifrst identify a subset of these visual tokens that are relevant to the micro-gesture. ”Pose guidance” is
applied by using the skeleton features to weight or select visual tokens:
• We compute an attention mask over image patches based on the distance of each patch to the
nearest skeletal joint position. If a patch lies close to a joint that is moving significantly, it receives
a higher weight. For example, if  , are the coordinates of joint  in frame  , we can define a
relevance score:</p>
        <p>, = exp(−min |pos( , ) −  , |2/ 2)
where pos( , ) is the spatial location of patch  and  controls the spatial scale. This yields
weights  ,</p>
        <p>
          ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] that highlight patches near active joints.
• Additionally, we leverage the skeleton encoder’s output ℎ() as a global descriptor of the motion.
        </p>
        <p>We project ℎ() to the same dimension  and use it to modulate the visual tokens via a simple
gating:</p>
        <p>̃, =   , ,
where  =  (⟨ ℎ () ,  , ⟩).</p>
        <p>Here  is the sigmoid of the dot-product between the projected pose feature  ℎ () and the visual
token  , , so it down-weights any token not well aligned with the pose direction.
After computing the pose-based weights  , , we flatten the full set of visual tokens
{  , ∣  = 1, … ,  ,  = 1, … , 
},
with  = 8 ,  = 196 , and  =  ×  = 1568</p>
        <p>. We then aggregate them into a single  -dimensional
semantic query  ∈ ℝ 512 by weighted mean pooling:
 =
 
∑</p>
        <p>∑  ,  ,
=1 =1
 
∑</p>
        <p>∑  ,
=1 =1
.</p>
        <p>This pose-weighted query  thus encapsulates the most relevant gesture semantics and is passed to
the cross-attention component to guide the final feature fusion and classification.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Gated Multi-modal Fusion</title>
        <p>
          Before feeding the query into the CLIP transformer, we further integrate the pose information via
a gated fusion mechanism inspired by Arevalo et al. [20]. The goal here is to merge the skeleton
representation with the visual representation in a way that the model can selectively attend to one or
the other modality as needed. We implement gated fusion at two points in the pipeline:
• We fuse the skeleton encoder output ℎ() with the semantic query  . First we compute a gating
vector
Let
where   ∈ ℝ× is a learned projection and  is the sigmoid. We then modulate the query by
In our implementation  is element-wise, so each feature of  is scaled into [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], allowing
pose-aligned dimensions to be amplified or suppressed.
• Similarly, we fuse the pose descriptor ℎ() into the CLIP encoder’s intermediate token sequence.
be the set of visual features from CLIP’s penultimate layer (these serve as keys and values in
cross‐attention). We then compute a second gating vector
        </p>
        <p>=  (  ℎ() ),
  =  ⊙  +  ⊙ (1 − ).</p>
        <p>= {  1,  2, … ,   }
 =  (  ℎ() ) ∈ ℝ ,
and apply it element‐wise:</p>
        <p>̃ =   ⊙ ,  = 1, … ,  .</p>
        <p>This global gate highlights or suppresses certain channels based on pose.</p>
        <p>These gating operations are learned end-to-end and ensure that the multi-modal information is
blended before the cross-attention step. The gating is soft (continuous values between 0 and 1), so the
model can learn to rely on pose heavily in some scenarios or ignore it in others. This adaptability is
important because pose data can sometimes be noisy or incomplete (e.g., occluded joints), so a static
fusion might hurt performance if pose is trusted blindly. Our gated fusion allows the network to fall
back to visual cues when pose is uncertain, and vice versa.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Cross-Attention with Semantic Query</title>
        <p>Next, we apply a cross-attention mechanism driven by our pose-guided query. We insert the query
vector   as an extra token into the final transformer layer of the CLIP visual encoder, allowing it to
attend over the gated visual token set  . This focused attention refines the representation by pooling
features most relevant to the detected micro-gesture.</p>
        <p>Consider the transformer architecture of CLIP’s visual encoder. Let
 = {  1̃ , … ,   ̃ },  =  =  ,
  ̃ ∈ ℝ ,  ∈ ℝ 1× .</p>
        <p>We then insert our pose‐guided query   into the final layer and compute cross-attention:
 out = (,  ,  ) =
softmax (

√
⊤
)  ,
which yields  out ∈ ℝ1× .</p>
        <p>The cross-attention computes an output query embedding  out that is a weighted sum of the values  ,
with weights determined by the compatibility of  with keys  . Mathematically, if we denote  (1×D),
 (N×D),  (N×D), the attention is:
(,  ,  ) =
softmax (</p>
        <p>)  .</p>
        <p>⊤
√
where the softmax produces a 1 ×  vector of attention weights. The resulting  out (of dimension
1×D) is efectively a semantic-aware video representation that has ”pooled” information from the visual
tokens, biased by the semantic content of  and thereby by the pose cues we injected. In other words,
 out should ideally encode the crucial features needed to distinguish the micro-gesture class.</p>
        <p>This unique attention mechanism forces the model to concentrate on what is important for the
gesture. It acts as a form of feature selection: among the many visual features of a scene (some possibly
irrelevant background or person identity cues), it emphasizes those that correlate with the action
semantics. In our case, because  was guided by pose, the attention is further narrowed to regions of
actual motion or posture change.</p>
        <p>After cross-attention, we obtain  out which we consider as the fused video representation for the
whole clip.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Classification and Training Objective</title>
        <p>The final stage is the classification of the micro-gesture. We feed the fused representation  out (dimension
 ) into a classifier head, implemented as a simple two-layer MLP followed by softmax. This yields a
probability distribution  ∈̂ ℝ  over the  gesture classes (here  = 33 for iMiGUE). We train the model
using a supervised classification objective. The primary loss is the cross-entropy between the predicted
distribution and the ground-truth label. Given a training sample  with true class label   (represented
as a one-hot vector) and predicted probabilities  ̂ , the loss is:
the parts of the CLIP encoder we allow to be fine-tuned.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset and Evaluation Protocol</title>
        <p>We conduct experiments on the iMiGUE dataset, focusing on the micro-gesture classification task. As
described earlier, iMiGUE contains 33 micro-gesture classes collected from interview videos of tennis
players. These gestures include subtle body-language cues such as pressing lips or touching one’s jaw.
We set aside a portion of the training data (20%) to serve as a local validation set for our experiments.
This local validation is used for model selection and ablation studies due to the unavailability of a
separate testing environment at the time of experimentation. However, final results on the test set are
referenced for comparison with other approaches [8, 7].</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Comparisons</title>
        <p>As shown in Table 1, CLIP-MG achieves a Top-1 accuracy of 61.82%, outperforming a range of
singlemodality baselines [21, 22, 23, 24]. Notably, CLIP-MG even performs on par with standalone architectures
such as Video Swin-B [25] and PoseConv3D [19] despite using a largely frozen CLIP backbone and a
compact pose encoder. This shows that steering CLIP’s attention with pose-guided semantic queries
yields more discriminative features for fine-grained micro-gestures. However, the proposed architecture
does not set a new state-of-the-art in this area. It comes close in performance to the dense-sparse
fusion network DSCNet [26] (62.50%), but falls behind architectures presented in previous editions of
the MiGA challenge [9, 10, 11]. These results motivate future work to explore richer query adaptation
and improved temporal fusion to close the gap with top models.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>We performed comprehensive ablation experiments to validate the contribution of each component in
CLIP-MG. Table 2 reports Top-1 accuracy on our validation split.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Without Pose Branch</title>
          <p>Here, we completely eliminate the pose branch to see the benefit of adding pose at all. The model gave
45.30% (–16.52 pp) accuracy. Thus, adding the pose branch (with our fusion and guidance) yields a
16.52% gain, which demonstrates that skeleton data carries complementary information for the task.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Without Pose Guidance</title>
          <p>In this variant, we remove the pose influence from the semantic query generation. The query is
generated purely by clustering visual tokens without using skeleton data. The semantics-based
crossattention still operates, but only on the visual-based query. We found that the accuracy dropped to
51.23% (–10.59 pp). This confirms that pose guidance is essential and provides a significant boost.
This makes sense because, in theory, without pose the model may attend to irrelevant semantics or
background context. This misses subtle gesture cues.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. Without Semantic Cross-Attention</title>
          <p>Here, we skip SCCA. Instead, we simply concatenate the global visual [CLS] embedding with the pose
feature and feed that to a classifier. This essentially tests a late-fusion approach without our semantic
query mechanism. The accuracy was 53.17% (–8.65 pp). This indicates that the semantic query and
cross-attention are efective at focusing on important features that a flat concatenation would miss.</p>
        </sec>
        <sec id="sec-4-3-4">
          <title>4.3.4. Without Gated Fusion</title>
          <p>In this ablation, we disable the gating in both the query generation and the visual token modulation.
We still generate a query using pose (via simple concatenation of average visual token and pose feature)
and perform cross-attention. The accuracy achieved we 60.08% (–1.74 pp), a modest drop. This shows
that gating helps but is not as critical as the presence of pose info or semantics-based cross-attention.
The gating mostly fine-tunes the balance between modalities.</p>
        </sec>
        <sec id="sec-4-3-5">
          <title>4.3.5. Discussion</title>
          <p>These ablations show that each component of CLIP-MG plays a supporting role, although certain
components are more important than others. Dropping the entire pose branch drives accuracy down
to 45.30% (–16.52 pp). This demonstrates how much discriminative information there is within the
skeletal signal. Removing pose guidance lowers accuracy from 61.82% to 51.23% (–10.59 pp), showing
that skeletal information is very important for localizing subtle joint motions, as they act as an attention
prior [27] that focuses the visual stream towards the regions where micro-gestures occur. Eliminating
cross-attention drops accuracy to 53.17% (–8.65 pp), which indicates that without a mechanism to
selectively pool pose-weighted tokens, the model may struggle to tell apart very similar gestures.
Finally, disabling gated fusion yields 60.08% (–1.74 pp), which indicates that adaptively balancing pose
and visual information slightly improves the robustness of the architecture.</p>
          <p>Taken together, these results show how the diferent components efectively complement each other.
Pose cues localize the gesture, cross-attention extracts the relevant semantics, and gating balances both
streams. We find the highest performance when all the components are combined.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Works</title>
      <p>We introduced CLIP-MG, a pose-guided, multi-modal CLIP architecture for micro-gesture recognition
on the iMiGUE benchmark. By guiding CLIP’s visual attention with skeleton-based spatial priors,
generating compact semantic queries, and fusing pose and appearance via a learnable gate, CLIP-MG
extracts subtle, discriminative features that simple RGB or pose-only models cannot recognize. Our
model achieves 61.82% Top-1 accuracy, outperforming most single-modality baselines and performing
on par with strong 3D-CNN and vision-transformer approaches. Extensive ablation studies confirm
that each component provides a measurable benefit. The experiments provide insights into how each
component interacts with and complements others, which highlights important design patterns that can
inform future model development in micro-gesture classification and similar fine-grained recognition
tasks. Our findings demonstrate the value of integrating multimodal and semantic information to
address challenging visual recognition problems.</p>
      <p>
        Our future work will explore richer temporal approaches and data strategies to close the gap between
CLIP-MG and more recent state-of-the-art models [9]. First, integrating sequence models (temporal
transformers or recurrent layers over cross-attention outputs) should capture patterns that static
sampling loses. Second, video motion magnification [ 28] could amplify imperceptible movements.
This would help with pose tracking and visual encoding. Third, joint pre-training on related
actiongesture datasets and weakly- or self-supervised learning could improve feature robustness [29]. Finally,
regarding accuracy, we plan to incorporate uncertainty-aware gating for noisy skeletons and
classbalanced or prototype-based calibrations to address long-tail imbalance [30]. To improve and better
evaluate the explainability of the model, we will incorporate gradient-weighted class activation mapping
(Grad-CAM) [31] and more recent attention-aware token-filtering approaches [ 32]. Currently, the
proposed architecture sufers heavily due to its relatively low speed on commodity hardware. To
address this issue, we plan to experiment with several multimodal compression and optimization
algorithms for more eficient computing [ 33, 34, 35, 36]. We plan to train an improved version of our
architecture for downstream tasks on the DAIC-WoZ dataset [
        <xref ref-type="bibr" rid="ref5">5, 37, 38</xref>
        ] for low-level mental health
analysis. These diferent research directions are promising in pushing pose-guided CLIP models closer
to (and beyond) human-level understanding of the subtlest gestures.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o to edit the paper, checking for grammar
and spelling mistakes. GPT-4o was also utilized to revise the draft for brevity and improved flow. After
using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility
for the publication’s content.
[6] X. Liu, H. Shi, H. Chen, Z. Yu, X. Li, G. Zhao, imigue: An identity-free video dataset for
microgesture understanding and emotion analysis, in: CVPR, 2021.
[7] C. Haoyu, et al., The 2nd challenge on micro-gesture analysis for hidden emotion understanding
(miga) 2024: Dataset and results, in: MiGA 2024: Proceedings of IJCAI 2024 WorkshopChallenge
on Micro-gesture Analysis for Hidden Emotion Understanding (MiGA 2024) co-located with 33rd
International Joint Conference on Artificial Intelligence (IJCAI 2024), 2024.
[8] G. Zhao, et al., The workshop challenge on micro-gesture analysis for hidden emotion
understanding (miga), in: MiGA 2023: Proceedings of IJCAI 2023 WorkshopChallenge on Micro-gesture
Analysis for Hidden Emotion Understanding (MiGA 2023) co-located with 32nd International Joint
Conference on Artificial Intelligence (IJCAI 2023), 2023.
[9] G. Chen, et al., Prototype learning for micro-gesture classification, in: MiGA 2024: Proceedings of
IJCAI 2024 WorkshopChallenge on Micro-gesture Analysis for Hidden Emotion Understanding
(MiGA 2024) co-located with 33rd International Joint Conference on Artificial Intelligence (IJCAI
2024), 2024.
[10] H. Huang, et al., Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble
network, in: MiGA 2024: Proceedings of IJCAI 2024 WorkshopChallenge on Micro-gesture
Analysis for Hidden Emotion Understanding (MiGA 2024) co-located with 33rd International Joint
Conference on Artificial Intelligence (IJCAI 2024), 2024.
[11] Y. Wang, et al., A multimodal micro-gesture classification model based on clip, in: MiGA 2024:
Proceedings of IJCAI 2024 WorkshopChallenge on Micro-gesture Analysis for Hidden Emotion
Understanding (MiGA 2024) co-located with 33rd International Joint Conference on Artificial
Intelligence (IJCAI 2024), 2024.
[12] A. Radford, et al., Learning transferable visual models from natural language supervision, ICML
(2021).
[13] Z. Quan, et al., Semantic matters: A constrained approach for zero-shot video action recognition,
in: Pattern Recognition, 2025.
[14] A. Dosovitskiy, et al., An image is worth 16x16 words: Transformers for image recognition at
scale, in: ICLR, 2021.
[15] X. H., et al., Videoclip: Learning video representations from text and clips, arXiv (2021).
[16] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using
part afinity fields, in: CVPR, 2017.
[17] X. Zhou, et al., On heatmap representation for 6d pose estimation, in: ICCV, 2019.
[18] S. V. Patapati, T. Srinivasan, H. Musku, A. Adiraju, A framework for eca-based psychotherapy,
2025.
[19] J. Zhang, Z. Huang, Y. Chen, Poseconv3d: Revisiting skeleton-based action recognition, in:</p>
      <p>European Conference on Computer Vision (ECCV), 2020.
[20] J. Arevalo, et al., Gated multimodal units for information fusion, in: ICLR, 2020.
[21] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action
recognition, in: AAAI Conference on Artificial Intelligence, 2018.
[22] S. Liu, et al., MS-G3D: Multi-scale graph convolution for skeleton-based action recognition, in:</p>
      <p>IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[23] B. Zhou, A. Andonian, A. Oliva, A. Torralba, Temporal relational reasoning in videos, in: European</p>
      <p>Conference on Computer Vision (ECCV), 2018.
[24] J. Lin, C. Gan, S. Han, TSM: Temporal shift module for eficient video understanding, in: IEEE/CVF</p>
      <p>International Conference on Computer Vision (ICCV), 2019.
[25] Z. Liu, et al., Video swin transformer: Hierarchical vision transformer for video recognition, in:</p>
      <p>IEEE/CVF International Conference on Computer Vision (ICCV), 2022.
[26] Q. Cheng, et al., DSCNet: Dense-sparse complementary network for human action recognition,</p>
      <p>Expert Systems with Applications (2024).
[27] H. Zhang, et al., Look closer to see better: Recurrent attention convolutional neural network for
ifne-grained image recognition, CVPR (2019).
[28] N. Wadhwa, et al., Eulerian video magnification for revealing subtle changes in the world, in:</p>
      <p>SIGGRAPH, 2013.
[29] T. Han, et al., Self-supervised video representation learning with neighborhood context aggregation,</p>
      <p>ECCV (2020).
[30] B. Kang, et al., Decoupling representation and classifier for long-tail recognition, in: ICLR, 2020.
[31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual
explanations from deep networks via gradient-based localization, in: 2017 IEEE International Conference
on Computer Vision (ICCV), IEEE, 2017, pp. 618–626.
[32] T. Naruko, H. Akutsu, Speed-up of vision transformer models by attention-aware token filtering,
2025. arXiv:2506.01519v1, preprint.
[33] Y. Omri, P. Shrof, T. Tambe, Token sequence compression for eficient multimodal computing,
2025. arXiv:2504.17892v1.
[34] L. Lei, J. Gu, X. Ma, C. Tang, J. Chen, T. Xu, Generic token compression in multimodal large
language models from an explainability perspective, 2025. arXiv:2506.01097.
[35] X. Tan, P. Ye, C. Tu, J. Cao, Y. Yang, L. Zhang, D. Zhou, T. Chen, Tokencarve: Information-preserving
visual token compression in multimodal large language models, 2025. arXiv:2503.10501.
[36] J. Cao, P. Ye, S. Li, C. Yu, Y. Tang, J. Lu, T. Chen, Madtp: Multimodal alignment-guided dynamic
token pruning for accelerating vision-language transformer, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2024, p. –. doi:10.1109/CVPR.
2024.00XX.
[37] J. Gratch, R. Artstein, G. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault,
S. Marsella, D. Traum, S. Rizzo, L.-P. Morency, The distress analysis interview corpus of human
and computer interviews, in: Proceedings of the Ninth International Conference on Language
Resources and Evaluation (LREC), European Language Resources Association (ELRA), 2014, pp.
3123–3128.
[38] F. Ringeval, B. Schuller, M. Valstar, N. Cummins, R. Cowie, L. Tavabi, M. Schmitt, S. Alisamir,
S. Amiriparian, E.-M. Messner, S. Song, S. Liu, Z. Zhao, A. Mallol-Ragolta, Z. Ren, M. Soleymani,
M. Pantic, Avec 2019 workshop and challenge: State-of-mind, detecting depression with ai, and
cross-cultural afect recognition, in: Proceedings of the 9th International on Audio/Visual Emotion
Challenge and Workshop, AVEC ’19, Association for Computing Machinery, New York, NY, USA,
2019, p. 3–12. URL: https://doi.org/10.1145/3347320.3357688. doi:10.1145/3347320.3357688.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ekman</surname>
          </string-name>
          ,
          <article-title>Observing and coding facial expression of emotion</article-title>
          ,
          <source>Handbook of Emotion</source>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Funes</surname>
          </string-name>
          , et al.,
          <article-title>Micro-expression recognition: A survey</article-title>
          ,
          <source>in: FG</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pantic</surname>
          </string-name>
          , Afective multimedia databases: Afective video databases,
          <source>Handbook of Afective Computing</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Picard</surname>
          </string-name>
          ,
          <article-title>Automatic prediction of human behavior in social settings</article-title>
          ,
          <source>in: IUI</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Patapati</surname>
          </string-name>
          ,
          <article-title>Integrating large language models into a tri-modal architecture for automated depression classification</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2407</volume>
          .19340v5, preprint.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>