<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Micro-gesture Classification via Global-Aware Importance Estimation in Vision Transformer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xin Hu</string-name>
          <email>xinhu123@stu.xidian.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chenyang Pu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunan Li</string-name>
          <email>yunanli@xidian.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yulang Xu</string-name>
          <email>yulangxu@stu.xidian.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kun Xie</string-name>
          <email>xiekun@xidian.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiguang Miao</string-name>
          <email>qgmiao@xidian.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi Province</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Technology, Xidian University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Xi'an Key Laboratory of Big Data and Intelligent Vision</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Micro-gesture classification presents persistent challenges due to the subtle nature of gestures, short duration, and high susceptibility to background interference. While Vision Transformers (ViTs) have shown strong potential in modeling spatio-temporal dependencies, their uniform treatment of all patch tokens often causes attention to be diluted across static or irrelevant background regions. This "information averaging" efect becomes particularly problematic when the foreground signals are weak and the background remains visually dominant. To address this, we propose the Global-Aware Importance Estimation (GAIE) module, which analyzes token-level semantic contributions and guides the ViT to focus more efectively on fine-grained and meaningful gesture regions. GAIE preserves contextually valuable background cues while compressing redundant information, thus enhancing the model's sensitivity to subtle foreground movements. Our method achieved second place in Track 1 of the MiGA 2025 Challenge, demonstrating its efectiveness in real-world micro-gesture classification scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Micro-gesture classification</kwd>
        <kwd>Video Vision Transformer</kwd>
        <kwd>Background redundancy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Micro-gesture [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] classification in video presents numerous challenges, primarily due to the subtlety
of gestures, their short duration, and susceptibility to interference from large amounts of irrelevant
background information. In recent years, Transformer-based models—particularly Vision Transformers
(ViTs)—have demonstrated strong potential in modeling spatio-temporal dependencies [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. However,
conventional ViTs treat all patch tokens equally, ignoring their semantic importance diferences. This
often causes the model’s attention to be dispersed across background regions, thereby weakening its
focus on critical gesture features.
      </p>
      <p>In the specific context of micro-gesture classification, models face unique dificulties: target regions
(e.g., subtle finger motions or local facial expressions) typically occupy very small spatial areas and
exhibit low motion amplitude, while large portions of the video background remain static or highly
repetitive. Although such background stability may appear "non-distracting" on the surface, it often
leads to an "information averaging" phenomenon in globally modeled architectures like ViTs.</p>
      <p>
        More specifically, since ViTs default to treating all tokens equally [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], when foreground changes are
subtle and the background is dominant in size, the model may over-rely on background tokens. As a
result, discriminative signals from the foreground can be diluted or even overwhelmed. This issue is
especially pronounced in micro-gesture classification, leading to problems such as shifted attention,
weak discriminative features, and insensitivity to subtle variations.
      </p>
      <p>Moreover, while certain background regions may not directly contain gesture-related motion, they can
still provide semantic or temporal context that supports gesture interpretation. Simply discarding these
tokens risks losing potentially useful information. Therefore, there is a pressing need for a mechanism
that can identify key tokens while efectively integrating redundant background information, rather
than eliminating it entirely.</p>
      <p>To this end, we propose the Global-Aware Importance Estimation (GAIE) Module, an
attentionbased method for analyzing information contribution. GAIE aims to preserve foreground sensitivity
while adaptively compressing and integrating background tokens, enabling ViTs to better focus on
ifne-grained and meaningful gesture regions.</p>
      <p>Our method achieved second place in Track 1 of the MiGA 2025 Challenge, demonstrating the
efectiveness of the GAIE module in enhancing micro-gesture classification performance in real-world
scenarios. The main contributions of our method are summarized as follows:
• We propose GAIE, a Global-Aware Importance Estimation module that enhances ViT’s focus on
subtle and discriminative gesture regions by adaptively weighting token importance.
• Our method achieves 68.70% Top-1 accuracy on the iMiGUE test set, ranking 2nd in Track 1 of
the MiGA 2025 Challenge.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Micro-gesture Classification</title>
        <p>
          Micro-gesture (MiG) [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ] classification aims to automatically recognize and categorize subtle,
lowamplitude movements occurring on human facial regions or body parts, such as transient facial twitches
and subconscious hand gestures. In recent years, several methods have made significant progress in
this task and achieved leading performance in the MiGA Challenge. For example, Ensemble Mode [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
employs a multi-scale heterogeneous ensemble network with residual connections and group training
strategy to enhance micro-gesture representation. M2HEN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] integrates cross-modal fusion and
prototypical refinement modules to improve feature discriminability. VCLIP [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] adopts a CLIP-based
distillation framework that leverages 3D heatmaps and textual features for cross-modal interaction.
These approaches have all achieved excellent results in micro-gesture recognition within the MiGA
Challenge.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Token Merging</title>
        <p>
          Most token merging methods are designed for Vision Transformers (ViTs) performing image
classification tasks. For example, ToMe [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] merges tokens based on bipartite soft matching, and DifRate [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
performs adaptive token compression by jointly pruning and merging tokens through a diferentiable
compression rate. However, these methods perform dynamic merging over all tokens without explicitly
preserving the most critical ones.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <sec id="sec-3-1">
        <title>3.1. Overall network framework</title>
        <p>
          As shown in Figure 1, we adopt ViT [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ] as the backbone network to process the input video. The
video is represented as  ∈ R ×× × , where  denotes the number of frames (temporal dimension),
 and  represent the height and width of each video frame, and  is the number of channels.
        </p>
        <p>
          To eficiently extract spatio-temporal features from the video, we employ the tubelet embedding
method. This approach divides the input video  into a series of non-overlapping spatio-temporal
patches [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], each with a size of  × ℎ ×  , where  denotes the temporal length and ℎ,  denote the
spatial dimensions. Each tubelet is linearly projected into a -dimensional feature space, and cosine
positional encodings are added to form the initial tokens 0 ∈ R× .
        </p>
        <p>Feature extraction. After obtaining tokens 0 are fed into a stack of  Transformer encoder layers.
Each encoder layer consists of Layer Normalization (LN), Multi-Head Self-Attention (MSA), and a
Multi-Layer Perceptron (MLP).</p>
        <p>L ×(Encoder w/ or w/o GAIE Module)
E
m
b
e
d
d
i
n
g</p>
        <p>N
o
r
m</p>
        <p>M
u
l
t
i
H
e
a
d
A
t
t
e
n
t
i
o
n</p>
        <p>G
A
I</p>
        <p>E
Encoder</p>
        <p>N
o
r
m</p>
        <p>The computation at the -th layer is defined as follows:
′ = MSA(LN(−1 )) + −1
 = MLP(LN(′)) + ′</p>
        <p>It is worth noting that we insert the proposed Global-Aware Importance Estimation (GAIE) module
into selected encoder layers to suppress redundant background information. Finally, the mean of the
output tokens from the last layer, Mean(), is fed into a linear classification head to produce the final
predictions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Global-Aware Importance Estimation</title>
        <p>To address the issue that Vision Transformer (ViT) treats redundant background regions indiscriminately
in micro-gesture recognition tasks, we propose the Global-Aware Importance Estimation (GAIE) module.
This module is designed to identify and retain regions that are highly relevant to foreground gestures
based on each token’s contribution within the global context, while suppressing background information
that may interfere with gesture discrimination.</p>
        <p>ViT’s multi-head self-attention mechanism exhibits strong global modeling capabilities, and notably,
it begins to distinguish between foreground and background regions even at shallow and intermediate
layers. Motivated by this observation, the GAIE module introduces a token-level global importance
estimation strategy to guide the subsequent feature fusion process more efectively.</p>
        <p>Global Importance Score. Specifically, we first introduce a Global Importance Score to quantify the
significance of each token within the overall spatio-temporal context. For the -th token in the input
sequence, let q denote its query vector, and K represent the key matrix composed of all tokens’ key
vectors. The attention weights w assigned by the -th token to all others are computed as:
w = Softmax
︂( qK⊤ )︂</p>
        <p>√
score =</p>
        <p>1 ∑︁ ,</p>
        <p>Then, the global-aware importance score for the -th token is defined as the average degree to which
it is attended to across all attention distributions:
=1
where  is the total number of tokens. This score allows us to assess each token’s potential contribution
to micro-gesture recognition, while retaining its global contextual representation. Foreground regions
(1)
(2)
(3)
(4)
sorted by</p>
        <p>G
l
o
b
a
lI
m
p
o
tr
a
n
c
e
S
c
o
r
e</p>
        <sec id="sec-3-2-1">
          <title>Attention Map</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Step 1: sorted by global importance score and split to Set and Set .</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Step 2: Compute the cosine similarity between set between the Set and Set .</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Step 3: Keep one edge from tokens in Set and their most similar token in Set .</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Step 4: Merge connected tokens and concatenate the sets back together.</title>
          <p>Global-Aware Importance Estimation
often have higher scores, while some background areas—such as objects involved in the gesture or
occluding elements—may also yield moderate or high importance due to their contextual relevance.</p>
          <p>Aggregation Strategy. Based on the global importance estimation, the GAIE module aggregates
semantic information from low-importance background tokens into the most relevant foreground
tokens via cosine similarity.</p>
          <p>
            The overall aggregation strategy of GAIE is illustrated in Figure 2 and includes the following steps:
1. Token Selection: All tokens are ranked based on their Global Importance Scores. The top 
tokens are selected to form the foreground candidate set ℱ , while the remaining  −  tokens
constitute the background set ℬ. The number of selected tokens is controlled by a keeping ratio
 =  .
2. Bipartite Graph Construction: A fully connected bipartite graph is built [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] between ℱ and ℬ
using cosine similarity to capture the inter-token relationships.
3. Redundancy Mapping: For each token in ℬ, only the most similar connection to a token in ℱ
is preserved. These links indicate spatio-temporal background redundancy and are used to guide
semantic aggregation.
4. Token Fusion: For each foreground token, the features of its connected background tokens are
averaged and merged into it, thus suppressing redundant background information.
          </p>
          <p>By integrating the GAIE module, we efectively compress redundant background content while
preserving critical spatio-temporal context, thereby enhancing the model’s expressiveness and recognition
accuracy in micro-gesture regions.</p>
          <p>When GAIE is inserted into the -th encoder layer, the original set of  tokens is re-evaluated and
refined. After importance scoring and background-token fusion, only  tokens are retained ( &lt;  ),
forming the new token set . This mechanism not only reduces token redundancy but also strengthens
the model’s attention to gesture-related regions, facilitating fine-grained representation and improved
discriminative performance in subsequent layers.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          iMiGUE Dataset. The iMiGUE dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] contains 32 micro-gestures and an additional
non-microgesture class, collected from post-match press conference videos of professional tennis players. The
challenge follows an interdisciplinary evaluation protocol, where 72 subjects are divided into a training
set consisting of 37 subjects and a test set consisting of 35 subjects.
        </p>
        <p>For the micro-gesture classification track, a total of 12,893, 777, and 4,562 MG clips from iMiGUE are
used for training, validation, and testing, respectively.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation Details</title>
        <p>Our model is implemented using Python and the PyTorch framework, and trained on two NVIDIA RTX
4090 GPUs. The backbone network is the vanilla ViT-Base with joint spatio-temporal attention. The
tubelet size is set to  × ℎ ×  = 2 × 16 × 16.</p>
        <p>GAIE modules are inserted after the MSA components of the 4th, 7th, and 10th encoder layers, with
a keeping ratio  = 0.7.</p>
        <p>Input video frames are first resized to a spatial resolution of 256 × 256 . During training, frames
are randomly cropped to 224 × 224 , and during inference, center cropping to 224 × 224 is applied.
Temporally, 16 frames are randomly sampled during training and uniformly sampled during inference.</p>
        <p>We use the AdamW optimizer with a weight decay of 0.05 and an initial learning rate of 2 × 10 −5 .
The learning rate follows a cosine decay schedule with a minimum value of 2 × 10 −6 . The model is
trained for 30 epochs on the iMiGUE dataset.</p>
        <p>
          The batch size is set to 2 per GPU, resulting in a total batch size of 4. Model initialization is performed
using ViT-Base weights pretrained and fine-tuned on the K400 dataset [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] via VideoMAE [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. It is
important to note that our training set includes the validation set.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results and Analysis</title>
        <p>4.3.1. Comparison with other entries
Table 1 presents the final Top 3 results of the The 3rd MiGA-IJCAI Challenge Track 1. We compare our
method with the top-performing entries on the leaderboard. As shown in Table 1, our method achieved
2nd place with a Top-1 Accuracy of 68.70It is worth noting that our method only utilizes the RGB
modality, while other methods may have exploited additional modalities or fused multiple sources of
information. Despite this, our approach demonstrates strong generalization and robustness, achieving
competitive performance purely from RGB data. This highlights the efectiveness and eficiency of our
design, especially in scenarios where additional modalities are unavailable or impractical.
4.3.2. Comparison with State-of-the-art Methods
As shown in Table 2, Our method achieves an accuracy of 68.70% using the RGB modality,
significantly outperforming other methods that rely solely on RGB inputs, such as TSM (58.77%), VideoSwin
(61.73%), and ViViT (67.84%). This demonstrates the superior representational capacity of our proposed
architecture in visual modeling.</p>
        <p>For the Skeleton modality, methods like ST-GCN (46.38%) and AAGCN (54.73%) show relatively
lower performance, while the 3D convolution-based PoseC3D achieves 61.11%, still falling short of our
RGB-only method. This suggests that the Skeleton modality alone may sufer from limited information
in the context of micro-gesture recognition.</p>
        <p>Moreover, although multimodal methods such as Ensemble Mode (70.25%), M2HEN (70.19%), and
VCLIP (68.90%) leverage both RGB and Skeleton inputs, our method—using RGB only—achieves results
comparable to some of these multimodal approaches.
4.3.3. Efectiveness of the proposed method
As shown in Table 3, our proposed method significantly reduces computational cost (MACs from
101.848G to 65.820G) and inference latency (from 21.96 ms to 13.74 ms), while achieving improved
recognition accuracy (from 67.84% to 68.70%), all under a similar parameter scale. This clearly
demonstrates that our model enhances the ability to model critical micro-gesture features while maintaining
computational eficiency.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we propose a novel network for micro-gesture recognition, featuring a Global-Aware
Importance Estimation Module (GAIE) designed to suppress irrelevant and redundant background
information. Our method achieved 2nd place in The 3rd MiGA-IJCAI Challenge Track 1, demonstrating
strong performance on micro-gesture recognition tasks using only RGB modality. Experimental results
on the MiGA Track 1 dataset validate the efectiveness of our approach, highlighting its competitive
advantage even against multimodal methods. The proposed network ofers a lightweight yet powerful
solution for fine-grained spatiotemporal micro-gestures understanding.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work is jointly supported by the National Natural Science Foundation of China under grants
No.62472342, and 62272364, the National Science and Technology Major Project under grant
No.2022ZD0117103, the provincial Key Research and Development Program of Shaanxi under grant
No.2024GH-ZDXM-47, the Research Project on Higher Education Teaching Reform of Shaanxi Province
under grant No.23JG003, the Fundamental Research Funds for the Central Universities under grant
No.QTZX25037.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.
[15] L. Shi, Y. Zhang, J. Cheng, H. Lu, Skeleton-based action recognition with multi-stream adaptive
graph convolutional networks, IEEE Transactions on Image Processing 29 (2020) 9532–9545.
[16] H. Duan, Y. Zhao, K. Chen, D. Lin, B. Dai, Revisiting skeleton-based action recognition, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
2969–2978.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>131</volume>
          (
          <year>2023</year>
          )
          <fpage>1346</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, imigue: An identity-free video dataset for microgesture understanding and emotion analysis</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10631</fpage>
          -
          <lpage>10642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>ICLR</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arnab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lučić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Vivit: A video vision transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6836</fpage>
          -
          <lpage>6846</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bolya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feichtenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <article-title>Token merging: Your ViT but faster</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kerui</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>Xia, Multi-modal micro-gesture classification via multiscale heterogeneous ensemble network</article-title>
          ,
          <source>MiGA-IJCAI workshop</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Prototype learning for micro-gesture classification</article-title>
          ,
          <source>MiGA-IJCAI workshop</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A multimodal micro-gesture classification model based on clip</article-title>
          ,
          <source>MiGA-IJCAI workshop</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          , Difrate:
          <article-title>Diferentiable compression rate for eficient vision transformers</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>17164</fpage>
          -
          <lpage>17174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Back</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natsev</surname>
          </string-name>
          , et al.,
          <article-title>The kinetics human action video dataset</article-title>
          ,
          <source>arXiv preprint arXiv:1705.06950</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Wang,</surname>
          </string-name>
          <article-title>VideoMAE: Masked autoencoders are data-eficient learners for self-supervised video pre-training</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gan</surname>
          </string-name>
          , S. Han,
          <article-title>Tsm: Temporal shift module for eficient video understanding</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>7083</fpage>
          -
          <lpage>7093</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Video swin transformer</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3202</fpage>
          -
          <lpage>3211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Spatial temporal graph convolutional networks for skeleton-based action recognition</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>32</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>