<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Micro-gesture Online Recognition using Learnable Query Points</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pengyu Liu</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fei Wang</string-name>
          <email>jiafei127@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kun Li</string-name>
          <email>kunli.hfut@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guoliang Chen</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanyan Wei</string-name>
          <email>weiyy@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shengeng Tang</string-name>
          <email>tangsg@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiliang Wu</string-name>
          <email>wu_zhiliang@zju.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dan Guo</string-name>
          <email>guodan@hfut.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Anhui Zhonghuitong Technology Co., Ltd</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CCAI, Zhejiang University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Artificial Intelligence, Hefei Comprehensive National Science Center</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology</institution>
          ,
          <addr-line>HFUT</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we briefly introduce the solution developed by our team, HFUT-VUT, for the Micro-gesture Online Recognition track in the MiGA challenge at IJCAI 2024. The Micro-gesture Online Recognition task involves identifying the category and locating the start and end times of micro-gestures in video clips. Compared to the typical Temporal Action Detection task, the Micro-gesture Online Recognition task focuses more on distinguishing between micro-gestures and pinpointing the start and end times of actions. Our solution ranks 2nd in the Micro-gesture Online Recognition track.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Micro-gesture</kwd>
        <kwd>action online recognition</kwd>
        <kwd>video understanding</kwd>
        <kwd>Mamba</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Humans can express emotions and communicate with others through various non-verbal forms,
among which gestures play a crucial role in emotional expression and communication [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1,
2, 3, 4, 5</xref>
        ]. Examples include “cover face”, “fold arms”, and “cross fingers”, which convey
human emotions to the outside world. Additionally, these micro-gestures (MGs) are often not
spontaneous but occur unconsciously in specific environments. Unlike macro gestures intended
for communication, non-spontaneous MGs better reflect genuine human emotions, making the
study of MGs more meaningful in understanding human emotions. SMG [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and iMiGUE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
are the datasets to assess and analyze human emotional states through MGs information. These
datasets provide a stronger representation of human emotions, significantly contributing to a
deeper understanding of genuine human feelings.
      </p>
      <p>Compared to common macro gestures, Micro-gesture Online Recognition is more challenging
because MGs appear more irregularly and randomly than existing action or gesture recognition
datasets. Additionally, there may be co-occurrence relationships between diferent classes of
actions, and transformations may occur between diferent MGs. Moreover, the finer distinctions
between diferent categories of MGs make it more dificult to determine the start and end times
of actions due to their smaller movement amplitudes.</p>
      <p>
        In this challenge, we adopt PointTAD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as the baseline. The main contributions of our
method are as follows:
• We introduce the Mamba-MHSA block for Micro-gesture Online Recognition, which
better distinguishes and locates action categories compared to the baseline model.
• In the Micro-gesture Online Recognition challenge, our solution achieves an F1 score
of 14.34 on the test set, securing 2nd in the competition. The experimental results
demonstrate that our model can efectively distinguish and locate MGs.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Current research predominantly focuses on common macro gestures or actions [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], which
have limited capability in reflecting human emotions. This is because humans can subjectively
control their gestures and actions to hide their true emotions. In contrast, MGs typically
occur involuntarily and uncontrollably, providing a more accurate reflection of genuine human
emotions, which is crucial for understanding behavior and emotions. Here, we review the
related technologies: micro-gesture datasets, temporal action detection, and Mamba.
      </p>
      <p>
        Micro-gesture Datasets. The iMiGUE [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] dataset is the first publicly available dataset, aimed
at recognizing and understanding suppressed or hidden emotions through MGs. It includes 359
videos with a total duration of 2092 minutes, collected from 72 subjects from 28 countries. The
dataset is annotated with 18,499 MG samples across 32 categories, averaging 51 MG actions
per video, with each MG instance ranging from 0.18 seconds to 80.92 seconds, and an average
duration of 2.55 seconds. The SMG [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] dataset focuses on naturally occurring MGs under stress,
collected from 40 participants of various ages, genders, and racial backgrounds, divided into 16
types of MGs. The SMG dataset has been applied in various studies on micro-gesture recognition
and emotion analysis, demonstrating its utility in these research fields.
      </p>
      <p>
        Micro-gesture Online Recognition. Guo et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed a novel deep network
combining graph convolution and Transformer encoders to extract motion features from 2D skeleton
sequences. This combination leverages the strengths of both graph convolution and Transformer.
Their contributions collectively advance the state-of-the-art in micro-gesture recognition,
providing a robust framework for emotion analysis based on MGs.
      </p>
      <p>
        Temporal Action Detection. Temporal action detection has been studied as a multi-label
frame-wise classification problem in previous literature. Early models [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] mainly focused
on modeling the temporal relationships between frames using Gaussian filters in the time
dimension. Current research primarily deals with processing information at diferent scales
and integrating spatiotemporal attention during processing. Tirupattur et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduced
RGB frames
      </p>
      <p>···
Video features: T × D</p>
      <p>Query Points:</p>
      <p>×  
Query Vectors:
  ×</p>
      <p>Multi-level
Interactive
Module</p>
      <p>MHSA
Mamba Block ×M
Mamba-MHSA</p>
      <p>F
F
N
Updated Query Points:</p>
      <p>×  
Uptaded Query Vectors:
  ×</p>
      <p>×L
Action Decoder</p>
      <p>Proposal:   × 2</p>
      <p>Transform</p>
      <p>
        FFN
Class:   × 
an attention-based Multi-label Action Dependency layer (MLAD) in their model, significantly
improving the co-occurrence dependencies and temporal dependencies of actions. Dai et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
proposed a novel ConvtransFormer network named MS-TCT that incorporates global and local
time relationship encoders and a time-scale mixer for efective multi-scale feature fusion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
addressing the complexities of temporal relationships. Tan et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] presented an end-to-end
action detection model named PointTAD that leverages learnable query points for precise
localization and diferentiation of actions in multi-label videos. These studies provide valuable
insights for micro-gesture online recognition.
      </p>
      <p>
        Mamba. The Transformer architecture and its core self-attention mechanism [
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18">15, 16, 17, 18</xref>
        ]
achieve significant success in deep learning. However, the Transformer faces ineficiency issues
when processing long sequences. Structured State Space Models (SSMs) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], combining
characteristics of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks
(CNNs), have shown potential in certain data modalities. SSMs perform well on continuous signal
data but less efectively on discrete and information-dense data. To address these shortcomings,
Mamba introduces a selection mechanism that allows SSM parameters to adjust dynamically
based on input data, improving model performance on discrete modalities. Mamba has notable
advantages in inference speed and sequence length scalability. Thus, we incorporate Mamba into
our model, combining Mamba [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] with self-attention to better model diferent semantics.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Task Definition</title>
        <p>We formulate the Micro-gesture Online Recognition task as a set prediction problem.
Given a continuous video clip with  frames, we predict a set of action instances  =
{ = (, , )}= 1, where  is the number of learnable queries,  and  are the starting
and ending timestamps of the -th detected instance, and  is its action category. The ground
truth action set to detect is denoted as ˆ = {︁̂︁ = (︀ ̂︀, ̂︀, ̂︀︀) }︁= 1, where ̂︀ and ̂︀ are the
starting and ending timestamps of the -th action,  is the ground truth action category, and
̂︀
 is the number of ground truth actions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Overall Architecture</title>
        <p>The overall architecture of our model is shown in Figure 1. The model consists of a video encoder
and an action decoder. For each video sequence, we select an RGB sequence of length  , a set
of learnable query points  = {}=1, and query vectors  = R × . The learnable query
points are used to locate the positions of action boundaries, and the query vectors decode action
semantics and positions from the features input to the model. The action decoder comprises
 stacked decoder layers. Each layer of the action decoder takes video features, the latest
query points  , and the latest query vectors  as input. Each action decoder layer includes
two parts: 1) the Mamba-MHSA block models the relationships among query vectors and the
potential relationships between diferent action categories; 2) the Multi-level Interactive Module
dynamically models the relationships based on query vectors between point-level and same
action categories. Finally, we use a Feed-Forward Network(FFN) to decode the action labels
from the query vectors and convert the query points into detection outputs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Video Encoder</title>
        <p>
          We use the I3D network [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] as our model’s video encoder, integrating the video encoder with
the action decoder for end-to-end training. To facilitate model deployment and speed up feature
extraction, we avoid using the optical flow part of the I3D backbone network. Finally, the
temporal stride of the encoded video features is 4, and the spatiotemporal representations are
compressed into temporal features through spatial average pooling.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Learnable Query Points</title>
        <p>Using only the start and end times to represent an action instance limits its boundary and content
description. Therefore, to improve the representation flexibility, a point-based representation
method is used to learn keyframes of action boundaries and semantics within instances. For
each query, the point-based representation is  = {}=1, where  is the time position of
the -th query point, and the number of points per query is . During training, query points
are initially placed at the midpoint of the input video sequence and are then refined through
iterations in the action decoder layers by the query vectors , gradually approaching their final
positions. Specifically, at each layer, the ofsets of query points are predicted from the updated
query vectors via linear projection. In action decoder layer , the representation of a query’s
query points is   = {︀ }︀ =1, with the ofsets denoted as {︀ ∆ }︀ =1. This operation can be
summarized as:
 +1 = {︁(︁ 
 + ∆  ·  · 0.5)︁}︁  ,
=1
(1)
where  = max (︀ )︀ − min (︀ )︀ . For relatively short actions, the update step size of the query
points is smaller, aiding in the localization of short actions. Additionally, the action query points
updated by the previous action decoder layer become the input to the next action decoder layer
after passing through a layer of FFN.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Mamba-MHSA Block</title>
        <p>
          Compared to Transformers [
          <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
          ], the recently proposed Mamba has demonstrated powerful
capabilities in sequence modeling. Therefore, we introduce Mamba into our model and combine
it with the Multi-Head Self-Attention (MHSA) to model the relationships of query vectors,
forming the Mamba-MHSA block. Our Mamba-MHSA module consists of  of Mamba blocks
and an MHSA. The Mamba block processes the query vectors  of the -th Mamba block
based on a selective state space model.
        </p>
        <p>Mamba is designed based on state space models (SSMs) and requires defining three key
parameters  ∈ R× ,  ∈ R× 1, and  ∈ R1× . The SSMs are defined by the following
diferential equations:
ℎ′() = ℎ() + (),</p>
        <p>() = ℎ().</p>
        <p>We need to discretize the above equations. The discretized SSMs include a time parameter
∆ , which converts the continuous parameters  and  into discrete parameters. The specific
formulas are as follows:</p>
        <p>After discretization, the block can be expressed as:</p>
        <p>= exp(∆ ),
 = (∆ )− 1(exp(∆ ) − )∆ .</p>
        <p>ℎ = ℎ− 1 + ,</p>
        <p>= ℎ.</p>
        <p>Next, we use a global convolution operation to obtain the output +1 by convolving the
input sequence  with a structured convolutional kernel . The convolution kernel  is
precomputed from the parameters , , and , and its calculation method is as follows:
+1 =  () =  ×  =  × (, , . . . , − 1).
(8)</p>
        <p>After passing through  of Mamba blocks, the query vectors  are input into a
MultiHead Self-Attention block to obtain the output. With the Mamba-MHSA block, the model gains
stronger selectivity and perceptual capability for the input query vectors, allowing it to better
model the relationships between diferent action instances.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Multi-Level Interactive Module</title>
        <p>Previous temporal action detectors often have deficiencies in decoding sampled frames, as they
typically aggregate semantics from diferent aspects and levels infrequently. Thus, we consider
a multi-level interactive module to aggregate multi-level semantics.</p>
        <p>
          Point-Level Local Semantic Extraction We use the deformable convolution [
          <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
          ] to
extract point-level features within a local neighborhood. For the -th query point, considering
that more time ofsets can more precisely cover the area around the sub-points, thereby capturing
more information, but they also increase the computational cost, we predict 4 time ofsets
(2)
(3)
(4)
(5)
(6)
(7)
        </p>
        <p>The ofsets and weights are generated by linear projection from the query vector . This
process can be represented as:</p>
        <p>Channel mix enhances action semantics using dynamic projection along the channel
dimension:</p>
        <p>= ReLU(LayerNorm(ReLU(LayerNorm( ,1)) ,2)) ∈ R× .</p>
        <p>These two features are then concatenated along the channel and compressed through a linear
layer to the size of the query vector. The query vector is updated to obtain the query vector for
the next layer input +1. This process can be represented as:</p>
        <p>+1 =  + Linear(Concat( , )).</p>
        <p>Instance-Level Semantic Mixing Since actions can occur simultaneously, modeling only
the temporal aspect may cause overlapping actions to have similar representations, leading to
classification errors. Therefore, dynamic convolution is used to mix semantics across frames
and channels. The mixed features of the query points use  ∈ R× . Given the query vector
, the parameters for frame mix and channel mix are generated:
  = Linear() ∈ R×  ,  ,1 = Linear() ∈ R× ′ ,  ,2 = Linear() ∈ R′× .
(12)</p>
        <p>Frame mix is performed by projecting and then activating with LayerNorm and ReLU across
 points to explore intra-instance relationships:</p>
        <p>= ReLU(LayerNorm(   )) ∈ R×  .</p>
        <p>4 4
{∆ }=1 and corresponding weights {}=1 from the position of this point. Using the query
point at frame  as the center point, we add time ofsets to form four deformable sub-points.
These sub-points represent the local area around the center point. The features at the sub-points
are extracted through bilinear interpolation and multiplied by the weight values to obtain the
point-level feature . This process can be represented as:</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset and Evaluation Metric</title>
        <p>
          Dataset. The spontaneous Micro-Gesture (SMG) dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] consists of 3,692 samples of 17
MGs. The dataset employs a cross-subject evaluation protocol by dividing the 40 subjects into a
training group consisting of long sequences from 35 subjects and a testing group of sequences
from 5 subjects. We only use RGB sequences as input.
        </p>
        <p>(9)
(10)
(11)
(13)
(14)
(15)
Evaluation Metric. We jointly evaluate the detection and classification performances of
algorithms using the  1 score measurement defined below:</p>
        <p>Precision · Recall
 1 = 2 · Precision + Recall .
(16)</p>
        <p>Given a long video sequence that needs to be evaluated, Precision is the fraction of correctly
classified MGs among all gestures retrieved in the sequence by the algorithms, while Recall (or
sensitivity) is the fraction of MGs that have been correctly retrieved over the total amount of
annotated MGs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation Details</title>
        <p>
          We use the I3D backbone network to extract video frames at a rate of 10 fps. A sliding window
mechanism is employed to preprocess video sequences, with the window size( ) set to 128
frames to accommodate most action categories. During training, the overlap ratio is set to 0.75,
while for inference, the overlap ratio is 0. We set  to 48 and  to 30. The I3D backbone uses
pre-trained weights from Kinetics400 [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. The batch size is set to 1, and the initial learning
rate is 1e-4, halved every 10 epochs, for a total of 50 epochs.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Results</title>
        <p>As shown in Table 1, we report the results of the top three teams on the SMG dataset test set.
Our team secured the second place. Although there remains a notable performance disparity
between our method and the first-place “NPU-MUCIS” team, our method significantly exceeds
the performance of the third-place “JDY203” team by 54.52%.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Ablation Study</title>
        <p>Study on the Number of Query Points ( ). In Table 2a, we conduct an ablation study on diferent
numbers of query points. We observe that the model’s performance improves as the number of
query points increases when the number is less than 30. However, when the number of query
points exceeds 30, the model’s performance starts to decrease. Therefore, we choose 30 as the
default number of query points for our model.
1The Kaggle competition page: https://www.kaggle.com/competitions/2nd-miga-ijcai-challenge-track2/leaderboard
a. Query Points in action b. Window size in action c. Action decoder param- d. Mamba Block
paramedetectors parameter 
detectors parameter</p>
        <p>eter 

25
27
30
31
32
35</p>
        <p>F1-score
performance when the window size is set to 128. Thus, we set the window size to 128.</p>
        <p>Study on the number of layers in the Action Decoder (). We investigate the influence of
diferent numbers of layers in the action decoder on the model. According to the results in
information, thereby improving its performance. However, when the number of layers exceeds
4, the model’s performance begins to decrease.</p>
        <p>Study on the number of Mamba Blocks ( ). To balance computational resources, we study
the impact of the number of Mamba blocks on the model. As indicated in Table 2d, the model
performs best when  is set to 2. Additionally, when the number of Mamba blocks exceeds 2,
the model encounters issues with gradient explosion.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we present a solution for the Micro-gesture Online Recognition (MiGA) challenge
at IJCAI 2024. Our approach is based on the PointTAD baseline, enhanced with Mamba-MHSA
to improve the model’s ability to model sequences. This module efectively enhances the model’s
capability for Micro-gesture Online Recognition, achieving an experimental result of 14.34 on
the SMG dataset. In future work, we will consider incorporating skeletal data into the model to
enhance its recognition ability for Micro-gesture Online Recognition.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the National Key R&amp;D Program of China (2022YFB4500601), the
National Natural Science Foundation of China (62272144,72188101,62020106007 and U20A20183),
the Major Project of Anhui Province (202203a05020011), and the Fundamental Research Funds
for the Central Universities (JZ2024HGTG0309).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning</article-title>
          ,
          <source>in: 2019 14th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>131</volume>
          (
          <year>2023</year>
          )
          <fpage>1346</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Joint skeletal and semantic embedding loss for micro-gesture classification</article-title>
          ,
          <source>arXiv preprint arXiv:2307.10624</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Benchmarking micro-action recognition: Dataset, methods, and applications</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>34</volume>
          (
          <year>2024</year>
          )
          <fpage>6238</fpage>
          -
          <lpage>6252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Gloss semantic-enhanced network with online backtranslation for sign language production</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5630</fpage>
          -
          <lpage>5638</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10631</fpage>
          -
          <lpage>10642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Pointtad: Multi-label temporal action detection with learnable query points</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>15268</fpage>
          -
          <lpage>15280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Proposal-free video grounding with contextual pyramid network</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1902</fpage>
          -
          <lpage>1910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Vigt: proposal-free video grounding with a learnable token in the transformer</article-title>
          ,
          <source>Science China Information Sciences</source>
          <volume>66</volume>
          (
          <year>2023</year>
          )
          <fpage>202102</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Micro-gesture online recognition with graphconvolution and multiscale transformers for long sequence (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piergiovanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ryoo</surname>
          </string-name>
          ,
          <article-title>Learning latent super-events to detect multiple activities in videos</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5304</fpage>
          -
          <lpage>5313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Tirupattur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Modeling multi-label action dependencies for temporal action localization</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1460</fpage>
          -
          <lpage>1470</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kahatapitiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ryoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brémond</surname>
          </string-name>
          ,
          <article-title>Ms-tct: Multi-scale temporal convtransformer for action detection</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>20041</fpage>
          -
          <lpage>20051</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Dapc-net: Deformable alignment and pyramid context completion networks for video inpainting</article-title>
          ,
          <source>IEEE Signal Processing Letters</source>
          <volume>28</volume>
          (
          <year>2021</year>
          )
          <fpage>1145</fpage>
          -
          <lpage>1149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xuan</surname>
          </string-name>
          , G. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          , Waveformer:
          <article-title>Wavelet transformer for noiserobust video inpainting</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>38</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>6180</fpage>
          -
          <lpage>6188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Deep stereo video inpainting</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>5693</fpage>
          -
          <lpage>5702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling</article-title>
          ,
          <source>arXiv preprint arXiv:2406.00919</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Xu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Robust attention deraining network for synchronous rain streaks and raindrops removal</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>6464</fpage>
          -
          <lpage>6472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          ,
          <article-title>Eficiently modeling long sequences with structured state spaces</article-title>
          ,
          <source>arXiv preprint arXiv:2111.00396</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          , I. Johnson,
          <string-name>
            <given-names>K.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudra</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Ré,</surname>
          </string-name>
          <article-title>Combining recurrent, convolutional, and continuous-time models with linear state space layers</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>572</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          , T. Dao, Mamba:
          <article-title>Linear-time sequence modeling with selective state spaces</article-title>
          ,
          <source>arXiv preprint arXiv:2312.00752</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Dindar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mesgarani</surname>
          </string-name>
          , Ssamba:
          <article-title>Self-supervised audio representation learning with mamba state space model</article-title>
          ,
          <source>arXiv preprint arXiv:2405.11831</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Quo vadis, action recognition? a new model and the kinetics dataset</article-title>
          ,
          <source>in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6299</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Proceedings of the Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Eulermormer:
          <article-title>Robust eulerian motion magnification via dynamic filtering within transformer</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>38</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>5345</fpage>
          -
          <lpage>5353</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Frequency decoupling for motion magnification via multi-level isomorphic architecture</article-title>
          ,
          <source>arXiv preprint arXiv:2403.07347</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Yan,
          <article-title>Semi-supervised video inpainting with cycle consistency constraints</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>22586</fpage>
          -
          <lpage>22595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Yan,
          <string-name>
            <surname>Divide-</surname>
          </string-name>
          and
          <article-title>-conquer completion network for video inpainting</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>33</volume>
          (
          <year>2023</year>
          )
          <fpage>2753</fpage>
          -
          <lpage>2766</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Back</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natsev</surname>
          </string-name>
          , et al.,
          <article-title>The kinetics human action video dataset</article-title>
          ,
          <source>arXiv preprint arXiv:1705.06950</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>