<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Micro-gesture Online Recognition with Graph-convolution and Multiscale Transformers for Long Sequence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>XuPeng Guo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei Peng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hexiang Huang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhaoqiang Xia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Psychiatry and Behavioral Sciences, Stanford University</institution>
          ,
          <addr-line>California 94305</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Innovation Center NPU Chongqing, Northwestern Polytechnical University</institution>
          ,
          <addr-line>Chongqing 400000</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Electronics and Information, Northwestern Polytechnical University</institution>
          ,
          <addr-line>Xi'an 710129</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Micro-gesture is becoming a fundamental clue of emotion analysis and achieves more attention in this ifeld. The studies are mainly focused on the task of micro-gesture classification which predicts the categories of micro-gesture while no works have been reported for spotting the micro-gestures. As a preliminary step for classification, the micro-gesture online recognition (spotting) that predicts the temporal location and category has achieved limited attention. In this context, we propose a novel deep network for micro-gesture online recognition, which incorporates the graph-convolution and multiscale transformer encoders. Specifically, we utilize a graph-convolution based Transformer module to extract motion features of 2D skeleton sequences, which are then processed by a feature pyramid module to obtain hierarchical multiscale features. We further employ a local Transformer module to model the similarity between micro-gesture frames, and decouple the classification and regression branches to achieve accurate location and category. These Transformers are trained in a two-stage strategy and combined to perform the spotting. Our proposed method is validated on the iMiGUE dataset and has achieved the first ranking in the task of online recognition (Track 2) of the MiGA2023 Challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Micro-gesture online recognition</kwd>
        <kwd>Graph convolution</kwd>
        <kwd>Multiscale Transformer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In daily interactions, the human usually rely on body gestures to perceive emotions, which
plays a crucial role in facilitating communication and understanding between individuals. With
the increasing demand for intelligent systems, such as robots and other human-computer
interaction systems, the ability to recognize and respond to users’ emotions based on their body
gestures has become a critical component [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among the body gestures, micro-gesture (MiG)
is an involuntary reaction triggered by people’s inner emotions. To diferentiate from more
overt behavioral gestures [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], such as waving hand, micro-gestures are often more subtle and
conscious actions, such as biting finger, which are performed while attempting to conceal real
feelings. As this kind of gestures is typically performed unconsciously and unintentionally,
they can reveal the hidden emotional status of human beings, which is the emotional status
that people express intentionally. Psychological studies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] also show that MiGs can be more
reliable emotion indicators.
      </p>
      <p>
        The micro-gesture analysis by computer vision techniques has attracted much attention
in recent years. Micro-gesture analysis can mainly be divided into two classes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: 1) the
classification of body gestures and 2) temporal body gesture localization and recognition (online
recognition or spotting) in long sequences. The relevant researchers are committed to the
former task, which conducts the classification of the pre-segmented clips, and most of the
advanced technologies can achieve quite promising performance [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The latter task is to
detect the temporal frames with micro-gestures from a sequence and recognize it. Currently,
there is a lack of automatic approaches of spotting micro-gestures, highlighting the importance
of developing an automated micro-gesture detection model. This would enable more accurate
and eficient analysis of micro-gestures, which are crucial for understanding and interpreting
human emotions.
      </p>
      <p>In this paper, to locate and recognize the micro-gestures from a long skeleton sequence, we
propose a deep network for detecting micro-gestures by integrating the graph-convolution and
multiscale Transformer encoders. We utilize a graph-convolution Transformer module based
on hypergraphs and hyperedges to extract motion features of 2D skeleton sequences. Then the
hierarchical multiscale features are obtained by a feature pyramid module. We further employ
a multiscale Transformer module to model the similarity between micro-gesture frames. The
classification and regression branches are finally decoupled to achieve accurate location and
category. These Transformers are trained in a two-stage strategy and combined to perform the
spotting. The main contributions of this paper can be summarized as:
• We design a deep network for MiG online recognition for the first time, which integrates
the graph-convolution and multiscale Transformer encoders.
• We explore a graph-convolution Transformer as a feature extractor and a combination of
feature pyramid and local Transformer to locate MiGs, which are trained separately in a
two-stage way.</p>
      <p>• We achieve the first ranking in the Track 2 of MiGA2023 challenge for online recognition.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Overall Architecture</title>
        <p>For spotting the micro-gestures, we propose a Transformer based network, which mainly
consists of four important components: graph-convolution Transformer, hierarchical feature
extractor, local Transformer and micro-gesture estimator. The overall architecture is shown in
Fig. 1. Given a long sequence, the framework outputs the temporal positions (the starting and
ending indexes in a long sequence) and categories of micro-gestures. In the graph-convolution
Transformer, motion features are extracted from the long sequence by the graph convolution
on hypergraphs and hyperedges. Then, the extracted features are further processed by the</p>
        <sec id="sec-2-1-1">
          <title>Self-Attention</title>
          <p>local window_1
(II) Hierarchical Feature Extractor
(Ⅳ) Micro-gesture Estimator
Concatenation
…..
…..</p>
          <p>…..
 4Upsample
 3Upsample
 2Upsample
 1</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Self-Attention</title>
          <p>local window_n</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Time dimension</title>
          <p>CLS
RE G
Transformer C
h
Local</p>
          <p>Local
Transformer</p>
          <p>Local
Transformer</p>
          <p>Local
Transformer
a
n
n
e
l
A
tt
e
n
t
i
o
n</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Head</title>
        </sec>
        <sec id="sec-2-1-5">
          <title>Head</title>
        </sec>
        <sec id="sec-2-1-6">
          <title>Head</title>
        </sec>
        <sec id="sec-2-1-7">
          <title>Head</title>
          <p>(Ⅰ) Graph-convolution Transformer (ⅠII) Local Transformer</p>
          <p>Global Pooling
Temporal Conv</p>
          <p>×L
HyperSA
 1
…</p>
          <p>C
o
n
c
a
t
e
n
a
it
o
n</p>
          <p>clip_1
clip_n
…
…
raT yH
n ep
 1 fs
… rreom rrgahp …


(I) graph-convolution Transformer, (II) hierarchical feature extractor, (III) local Transformer, and (IV)
micro-gesture estimator.
hierarchical feature extractor to obtain multiscale encoding features, which is fed to local
Transformer for modelling the correlation between frames within an inner window. Finally, the
interval of micro-gesture is predicted by decoupling classification and regression branches.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Graph-convolution Transformer</title>
        <p>
          The performance of micro-gesture online recognition depends on the ability to capture subtle
motion information in spatial and temporal dimensions. Therefore, the selection of backbone
model plays a critical role in determining the detection performance. In the field of image
processing, it is widely recognized that pretrained classification models can serve as the backbone
of downstream tasks to extract features, such as object detection. Drawing inspiration from this,
we also choose a video recognition model as the backbone for our proposed method. Although
any graph-convolution network can be used as the backbone, in order to efectively process
the 2D skeleton sequence, we choose to exploit the hypergraph Transformer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for the action
recognition as the backbone to represent the micro-gesture clip, which is shown in Fig. 1 (I).
        </p>
        <p>The use of hypergraphs and hyperedges in hypergraph Transformer (Hyperformer) allows
for a more comprehensive representation of the input data, enabling the model to capture subtle
nuances in the skeletal point relationships and structures that are crucial for accurate
microgesture detection. Currently, hypergraph Transformer is primarily utilized for macro-action
recognition, while there exists the significant diference between micro-gesture and behavioral
action. To address this issue, we choose to train the Hyperformer on the iMiGUE dataset in
the first stage, which uses the sequence clips of the recognition task (Track 1) to learn the
parameters. The trained model is then utilized as a one-stage feature extractor to extract the
motion-aware features, which can be matched with various micro-gesture spotting networks
to achieve precise micro-gesture location. Given a pre-segmented clip with  frames, the
feature extracted by the trained Hyperformer with the input  ∈ R × ×  can be embedded as
 ∈ R(/8)× , where  and  represent the number and characteristic dimensions of skeletal
points, respectively. We choose to use fixed-length sliding window for solving the problem of
varying sequence lengths. Therefore, we concatenate the features of diferent small fragments
in the time dimension to obtain the motion features in a sliding window, which is fed into the
subsequent module. The concatenation operation is given by:
 =  (1, 2, · , )
(1)
where  is the number of clips in a sliding window,  can be embedded as  ∈
R(× /8)× .</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Hierarchical Feature Extractor</title>
        <p>As the diferent durations of micro-gestures exist in a long sequence, hierarchical feature
pyramid is beneficial to capture diferent temporal window lengths (multiscale information).
The block module of the pyramid in our network shown in Fig. 1 (II) is similar to the C3 module
in YOLOv5 1 but with some key diferences. Specifically, we utilize 1D convolution with kernel
size of 3 × 1 and stride of 1, followed by layer normalization and SiLU activation function. In
order to ensure that the model can capture micro-gesture features with a short duration, the
stride of the first layer in the feature pyramid is set to 1, other layers are set to 2. Subsequently,
we acquire multiscale features   through linear upsampling and concatenation operations,
which enable the integration of rich contextual information and enhance the representation
can be embedded as {  ∈ R( ′ /2− 1)× ,  = 1, 2, 3, 4}.
ability of the features. Given a feature  ∈ R ′ ×  extracted from the previous module,</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Local Transformer</title>
        <p>
          Since the occurrence of micro-gesture is often inseparable from the contextual frames, we
employ the attention mechanism to measure the similarity between frames and model the
dependency of frames. Transformer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is utilized for similarity modeling between frames,
but the traditional transformer with global attention mechanism may not be suitable for long
sequence. It is recognized that the temporal context beyond a certain range is less informative
for micro-gesture detection, and the global attention can introduce redundant information that
interferes with the analysis. So we utilize the local Transformer by limiting attention within
a local window [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which is shown in Fig. 1 (III). A series of overlapping local windows are
generated in the time dimension of  . Then we calculate self-attention in each window. Finally,
the embedding results of each window are concatenated in the time dimension to obtain a
comprehensive representation of the micro-gesture sequence.
        </p>
        <p>Given   ∈ R × ,   is utilized to project encoded representations of Query(Q), Key(K),
and Value(V) by using  ∈ R×  ,  ∈ R×  ,  ∈ R×  , which are given by:
 =   × ,  =   ×  ,  =   × 
(2)
1https://github.com/ultralytics/yolov5
Then multi-head attention (MHA) will be applied in a local window by the following operations:
 (, , ) = (ℎ0, . . . , ℎ) ,
ℎ =   max
︂(
 )︂
√


(3)
(4)
where  is the number of heads,   is the parameter matrix, ,  and  respectively
represent ,  and  in the -th local window. Then the results of each window MHA are
concatenated in the time dimension to obtain the encoded results, which is given by:
 =
∑︁  (  (, , ))</p>
        <p>dimension.
where  ∈ R ×  and  (· ) the concatenation of MHA results in the time</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Micro-gesture Estimator</title>
        <p>
          Finally, the encoded features of local Transformer are fed to the estimator module to predict the
location and category of micro-gestures. The estimator module consists of decoupling regression
and classification branches, which is shown in Fig. 1 (IV). The former predicts the distance
to the starting and ending frames of the micro-gesture at each point in the time dimension,
while the classification branch is responsible for identifying the category to which it belongs. In
order to obtain classification and regression related feature information, we employ the channel
attention mechanism and apply it before sending the features to the head. To prevent overfitting,
we enforce weight sharing among these attention layers. To further achieve accurate localization
of the gesture interval for micro-gesture detection, we adopt the approach proposed by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
which treats the regression problem as a distribution prediction problem to model uncertainty.
 ∈
        </p>
        <p>R × 2, and the classification branch can be embedded as  ∈
Given an encoded feature  ∈ R × , the output of the regression branch can be embedded as
R ×  , where  is
the number of categories of micro-gestures. In the second stage of model learning, the local
Transformer and estimator are jointly trained by the training data of online recognition.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Metric</title>
        <p>
          Dataset. Micro-Gesture Understanding and Emotion analysis (iMiGUE) dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is employed
to evaluate our proposed method. The dataset consists of 32 categories from post-match press
conference videos of famous tennis players. The micro-gestures are annotated from 359 long
video sequences and are captured in RGB modality and 2D skeletal joints collected from the
Open-Pose algorithm. The 2D skeletal joints consist of a total of 137 key points, including 25
body points, 42 hand points, and 70 face points. In the MiGA2023 challenge, only 2D skeletal
points are allowed to be used as model input.
        </p>
        <p>⋂︀  ℎ
 ⋃︀  ℎ</p>
        <p>≥ 
 1 −  =</p>
        <p>2 
2  +   +</p>
        <p>Metric. The true positive (TP) per interval in one sequence is defined based on the intersection
between the spotted interval and the ground-truth interval. The spotted interval  is
considered as TP if it fits the following condition:
(5)
(6)
where  takes 0.3, and  ℎ represents the ground truth of the micro-gesture interval
(onset-ofset). F1-socre is then used to evaluate the performance of the model, which is given
by:
where   and   represent the false positive and false negative, respectively.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Implementation Details</title>
        <p>
          The hypergraph Transformer model is firstly trained using pre-segmented micro-gesture data
from the iMiGUE dataset with a total of 200 epochs trained. Then, the fully connected layer
of hypergraph Transformer is removed, and the model is utilized as the feature extractor for
detection network. In Fig. 1 (I), the number of layers  in graph-convolution Transformer is
10. The length of one clip fed into feature extractor is 8, the overlap value is 2. We set the
length of the sliding window to 512. In local Transformer, the local window size is set to 8,
the overlap value is set to 4. The local Transformer and estimator are secondly trained for
200 epochs with a cosine learning rate schedule and 5 warmup epochs. We use Adam as an
optimizer, where the initial learning rate is 1 − 4. The mini-batch size is 32, and the weight
decay is 5e-4. None-Maximum Suppression (NMS) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is used to remove the duplicated boxes
and obtain real results.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Results</title>
        <p>
          As no baseline approach has been reported in the past, we only report the final performance on
iMiGUE to evaluate the influence of two important components, i.e., the hypergraph Transformer
and multiscale Transformer. Table 1 presents the performance of using other components to
replace the above two components for micro-gesture online recognition. When HD-GCN
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] trained on iMiGUE is used as the feature extractor to observe the impact of hypergraph
Transformer, the result of using HD-GCN indicates that this model achieves worse performance
as it may be unable to accurately capture motion information. Subsequently, LSSNet [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which
has demonstrated strong performance in micro-expression detection, is selected as the detection
network to observe the impact of multiscale Transformer. It also suggests that our model has
more efectiveness in interframe modeling.
        </p>
        <p>The visualization results of micro-gesture online recognition are demonstrated through two
sets of skeletal sequences in Fig. 2. The first set shown in Fig. 2 (a) displays accurately spotted
micro-gesture, while the second set depicted in Fig. 2 (b) demonstrates micro-gesture that is
spotted incorrectly with an IoU value of 0.15. The results show that our method still faces
challenges in accurately locating some samples.</p>
        <p>...
...</p>
        <p>…
…
✓
7493
5669</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we proposed a Transformer based model for micro-gesture online
recognition, which integrates graph-convolution and multiscale Transformers. Our proposed method
achieved excellent performance on the iMiGUE dataset, but it is important to note that the
development of micro-gesture online recognition is still in its early stages and there is much
room for improvement in terms of detection accuracy.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partly supported by the Natural Science Foundation of Chongqing (No.
CSTB2022NSCQ-MSX0977), and the Key Research and Development Program of Shaanxi (Nos.
2021ZDLGY15-01 and 2023-ZDLGY-12).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning</article-title>
          ,
          <source>IEEE Int. Conf. Automatic Face and Gesture Recognition</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vinciarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pantic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bourlard</surname>
          </string-name>
          ,
          <article-title>Social signal processing: Survey of an emerging domain</article-title>
          ,
          <source>Image Vis. Comput</source>
          .
          <volume>27</volume>
          (
          <year>2009</year>
          )
          <fpage>1743</fpage>
          -
          <lpage>1759</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>B. de Gelder</surname>
          </string-name>
          , J. V. den Stock,
          <string-name>
            <surname>H. K. M. Meeren</surname>
            ,
            <given-names>C. B. A.</given-names>
          </string-name>
          <string-name>
            <surname>Sinke</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          <string-name>
            <surname>Kret</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tamietto</surname>
          </string-name>
          ,
          <article-title>Standing up for the body. recent progress in uncovering the networks involved in the perception of bodies and bodily expressions</article-title>
          ,
          <source>Neurosci. Biobehav. Rev</source>
          .
          <volume>34</volume>
          (
          <year>2010</year>
          )
          <fpage>513</fpage>
          -
          <lpage>527</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis</article-title>
          ,
          <source>Int. J. Comput. Vision</source>
          <volume>131</volume>
          (
          <year>2023</year>
          )
          <fpage>1346</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , J. Cheng, H. Lu,
          <article-title>Skeleton-based action recognition with shift graph convolutional network</article-title>
          ,
          <source>Proc. Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <article-title>Disentangling and unifying graph convolutions for skeleton-based action recognition</article-title>
          ,
          <source>Proc. Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>140</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-Q.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Keuper</surname>
          </string-name>
          ,
          <article-title>Hypergraph transformer for skeleton-based action recognition</article-title>
          ,
          <source>arXiv abs/2211</source>
          .09590 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Proc. Int. Conf. Neural Inf. Process. Syst</source>
          .
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Micro-expression spotting with multi-scale local transformer in long videos, Pattern Recognit</article-title>
          .
          <source>Lett</source>
          .
          <volume>168</volume>
          (
          <year>2023</year>
          )
          <fpage>146</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dayoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sunderhauf</surname>
          </string-name>
          ,
          <string-name>
            <surname>Varifocalnet:</surname>
          </string-name>
          <article-title>An iou-aware dense object detector</article-title>
          ,
          <source>Proc. Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2021</year>
          )
          <fpage>8510</fpage>
          -
          <lpage>8519</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis</article-title>
          ,
          <source>Proc. Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2021</year>
          )
          <fpage>10626</fpage>
          -
          <lpage>10637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Neubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <article-title>Eficient non-maximum suppression</article-title>
          ,
          <source>Proc. Int. Conf. Pattern Recognit</source>
          .
          <volume>3</volume>
          (
          <year>2006</year>
          )
          <fpage>850</fpage>
          -
          <lpage>855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Hierarchically decomposed graph convolutional networks for skeleton-based action recognition</article-title>
          ,
          <source>arXiv abs/2208</source>
          .10741 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.-W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Lssnet: A two-stream convolutional neural network for spotting macro- and micro-expression in long videos</article-title>
          ,
          <source>Proc. ACM Int. Conf. Multimedia</source>
          (
          <year>2021</year>
          )
          <fpage>4745</fpage>
          -
          <lpage>4749</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>