<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Micro-gesture Classification Based on Ensemble Hypergraph-convolution Transformer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hexiang Huang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>XuPeng Guo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei Peng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhaoqiang Xia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Psychiatry and Behavioral Sciences, Stanford University</institution>
          ,
          <addr-line>California 94305</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Innovation Center NPU Chongqing, Northwestern Polytechnical University</institution>
          ,
          <addr-line>Chongqing 400000</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Electronics and Information, Northwestern Polytechnical University</institution>
          ,
          <addr-line>Xi'an 710129</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Micro-gesture classification has emerged as a significant research area within emotion analysis and human-computer interaction, garnering increasing attention. While some skeleton-based action recognition algorithms utilizing graph convolution networks have shown competence in micro-gesture classification, these deep models still face challenges in representing subtle temporal actions and handling the long-tailed distribution of samples. To address these issues, this paper proposes a deep framework with ensemble hypergraph-convolution Transformers, which fuses multiple models focused on various categories. In this model, the Transformers with hypergraph based attention are constructed and extended to enhance the representation ability of single model. Then a data grouping training and ensemble method is employed to handle imbalanced categories for micro-gestures, resulting in a significant improvement in classification accuracy of single models. Finally, our algorithm model is evaluated on the iMiGUE dataset, which achieves the Top-1 accuracy of 0.6302 and the second ranking in the MiGA2023 Challenge (Track 1: Micro-gesture Classification).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Micro-gesture classification</kwd>
        <kwd>Long-tailed distribution</kwd>
        <kwd>Graph-convolution Transformer</kwd>
        <kwd>Ensemble model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Micro-gesture (MiG) classification refers to the process of identifying and categorizing small and
subtle movements appeared on the human face and body, such as eye blinks, facial expressions,
or hand gestures. The goal of automatic MiG classification is to accurately recognize and
interpret these subtle movements, which can provide valuable insights into the understanding
of a person’s thoughts, emotions, and intentions. Deep learning algorithms and computer vision
techniques [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] are commonly employed to accomplish this task, finding frequent application
in areas like human-computer interaction, emotion recognition, and biometric identification.
      </p>
      <p>
        Due to the progress of deep learning techniques for action recognition [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], the deep models
have also been utilized to recognize the categories of MiG with the data of RGB and skeleton
modalities. In the early works for MiG [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], the RGB based methods, e.g., the temporal
segmentation network (TSN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the skeleton based methods, e.g., spatio-temporal graph
convolution network (ST-GCN) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], originally for action recognition have been applied to
evaluate the performance of recognizing MiGs as benchmarks. Although the RGB modality can
provide more information of MiGs, the identity privacy for the people restricts the application
of RGB modality. Therefore, the study focused on MiG tasks from skeleton modality.
      </p>
      <p>
        Despite the dataset having been released for two years, there are currently limited reported
works on skeleton-based MiG classification as the challenges of modeling subtle motions
from the skeleton. But the graph convolutional networks (GCN) are commonly used in the
task of skeleton based action recognition [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. GCN is a graph-based presentation learning
method originally designed for key point classification tasks. In applications, the relationships
between diferent types of key points and edges in the graph need to be modeled, and these
relationships can be very complicated. In this case, using a standard graph structure becomes less
appropriate because high-order semantic correlation can be far more complicated than the binary
relationships model by such graph. In contrast, hypergraphs [10, 11] provide more flexible and
rich representation capabilities, which can be used to represent multiple relationships between
key points of diferent types. The hyperedges can be used to construct complex relationships
between key points of diferent types. By mapping the key points and edges in the hypergraph
to a low-dimensional vector space, the graph neural network can not only improve its training
capabilities but also enhance reasoning processes, thus provides a GCN with stronger and more
comprehensive representation capabilities. So in order to capture the potential relationships that
exist between the key points of human skeleton, the self-attention (SA) based on hypergraph
[12] (called HyperSA) in a Transformer encoder was proposed to combine the Transformer
[13] with skeleton for measuring both paired and higher-order relationships and applied to
skeleton-based action recognition.
      </p>
      <p>To capture the complicated relationships between diferent skeleton points from the face and
body for MiGs, we extend the HyperSA by enhancing the self-attention weight with considering
the relationship of hyperedges, which reorganizes the four parts of the SA module into diferent
branches. These branches are integrated during the learning process, and the results obtained
from this integration address the issue of insuficient learning from a single branch. Furthermore,
since the data collected from real-world scenario often exhibit an imbalanced distribution, or a
long-tailed distribution, a single model trained on relatively unbalanced data tends to exhibit
biased predictions favoring the head categories, resulting in poorer performance on the tail
categories. To overcome this problem, inspired by the data partitioning concept proposed by Cai
et al. [14], we propose to partition the training data into two overlapping subsets and ensemble
several independent models together by training them separately. The main contributions of
this paper can be summarized as:
• We design a deep framework of ensemble hypergraph-convolution Transformer (EHCT)
for the task of MiG classification.
• We extend the HyperSA by enhancing the hyperedges of SA module to promote the
representation ability for MiGs.
• We leverage the ensemble strategies to combine several independent models to weaken
the impact of imbalanced data.
• We perform extensive experiments and achieve the second ranking in the Track 1 of</p>
      <p>MiGA2023 Challenge.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>The main framework of our proposed method (EHCT) is shown in Fig. 1 (a). In the
framework, we design two classifiers, namely, the main classifier and auxiliary classier, by using the
same-architecture base model of hyperformer convolution Transformer (HCT) to promote the
discrimination ability and mitigate the long-tailed distribution of data. For the base model, the
attention weight between key points and hyperedges are enhanced (eHyperSA) by considering
the relationships between individual key points in the body, face, left and right hands. The
details of three important components are described in the following section.</p>
      <sec id="sec-2-1">
        <title>2.1. Hypergraph-convolution Transformer</title>
        <p>As shown in Fig. 1 (b), the self-attention layer combined with the temporal convolution
layer in the HCT is the basic block and stacked by  layers [12]. The skeletal input  =
{⃗1, ⃗2, ..., 1⃗37} comprises the key points extracted from a single frame, including those
pertaining to the body, face, left and right hands, are presented in 2D format ⃗ = (, , ) by using
the protocol of OpenPose [15]. According to self-attention mechanism [16], a linear
transformation is applied to input  through multiplication with three weight matrices, resulting in the
derivation of matrices , , and  .</p>
        <p>In the self-attention module of HCT shown in Fig. 1 (c), the feature  with the hyperedges
of hypergraph is constructed by Eq. 1:</p>
        <p>= − 1 ,
where  represents the incidence matrix of key points and hyperedges. In the matrix , each
row represents a key point and each column represents a hyperedge.  is the diagonal matrix
representing the degree matrix of hyperedges, and  represents the projection matrix of
hyperedges. Based on the hyperedge feature  , we extend the self-attention in our model
(eHyperSA), which is expressed as follows:
 =  +  +  +   ,</p>
        <p>⏟  ⏞ ⏟  ⏞ ⏟  ⏞ ⏟  ⏞
In the eHyperSA, the basic attention (components  and ) and relative positional embedding
(component )  are used and similar to [17, 12]. In contrast to the vanilla HyperSA [12], the
component  in the above equation is newly added, which considers the inner product of the
hyperedge feature matrix  , improving the attention between hyperedges.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Main Classifier</title>
        <p>Given that single base model may potentially impact the weight of components  and  in Eq. 2,
in the interest of enhancing their weight and eficacy in the classification task, the main classifier
explores multiple base models (HCTs) as multiple branches and integrates them directly.
(1)
(2)</p>
        <p>AuxPrediction
Softmax
Linear
Other</p>
        <p>Prediction
Softmax</p>
        <p>Linear
HCT</p>
        <p>HCT</p>
        <p>HCT</p>
        <p>HCT</p>
        <p>HCT
Auxiliary
Classifier</p>
        <p>Main</p>
        <p>Classifier</p>
        <p>Data in
Tail Categories</p>
        <p>Division</p>
        <p>Data in
All Categories
(a) EHCT
(c) eHyperSA</p>
        <p>K
(b) HCT</p>
        <p>MatMul
Softmax</p>
        <p>V</p>
        <p>Q
Global Avg.</p>
        <p>Pooling
Temporal Conv
eHyperSA</p>
        <p>×
Temporal
Module
Spatial
Module</p>
        <p>To execute multi-branch integration, each branch in the main classifier emphasizes the
primacy of components  and  while selectively incorporating components  and . The
corresponding mathematical equation for attention in each branch is show in Eq. 3:
 =  +  + ∑︁ 2 {︀  ,   }︀ .</p>
        <p>where the parameter  denotes the number of branches.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Auxiliary Classifier</title>
        <p>The matrix  is multiplied by the output of each branch’s attention calculated by the Softmax
function, and the integration is performed as follows:</p>
        <p>
          =  (  (  () )) ,
The final output is obtained by taking the average of the output logits  of each branch, which
is shown in Eq. 5:
 =
∑︀ 
,
The data used in the task of MiG classification usually exhibit an imbalanced distribution across
the diferent categories with a long-tailed distribution, e.g., the iMiGUE dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In order
(3)
(4)
(5)
to mitigate the adverse impact of the sample imbalance, inspired by GoogLeNet [18] and ACE
[14], we design an auxiliary classifier.
        </p>
        <p>The data are bifurcated based on the count of data instances per category into two major
categories, namely the head and tail categories. Subsequently, all data instances corresponding
to the tail categories are extracted, and the same number of instances as the tail categories are
randomly selected from the head categories to form the tail training set. In this tail training set,
the labels of the selected instances from the head categories are reassigned to other categories,
while the labels of the tail categories are one-to-one mapped to the original labels in the dataset.</p>
        <p>With the logits from the main classifier and the auxiliary classifier, the way of combining
these two outputs is calculated as follows:
 =  +  · ℎ {} +  ·   {} ,
(6)
where the hyperparameter  denoted as the weight by which the logits of the auxiliary classifier,
when predicted as the other category, is accumulated into the logits of the main classifier, and
the hyperparameter  denoted as the weight by which the logits of the auxiliary classifier,
when predicted as a tail category, is accumulated into the logits of the main classifier through a
mapping relationship. The final prediction can be obtained from the following equation:
  =  (()) .
(7)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        In this section, we evaluate our model on the iMiGUE dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] by following the protocol of
MiGA2023 Challenge (Track 1: Micro-gesture Classification). The dataset, metrics, ablation
study and comparison experiments are reported in the following sections.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Metrics</title>
        <p>
          In this challenge, the iMiGUE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] dataset with fixed training and test samples is used to evaluate
our proposed method. This dataset includes a total of 32 categories of MiGs, and covers two
emotions as well as 72 subjects with each gender accounting for half of the total number of
subjects. It consists of 18,499 samples taken from 359 videos with a resolution of 1280 × 720.
Each video is about 0.5-25.8 minutes long. Since the iMiGUE dataset is collected in-the-wild
setting, the overall dataset presents a long-tailed (unbalanced) distribution.
        </p>
        <p>To evaluate the classification performance of our model, we employ Top-1 accuracy and
Top-5 accuracy as evaluation metrics, the equations of the metrics are as follows:
 − 1 =
∑︀=1[( (|)) = ]

 − 5 = ∑︀=1[ ∈ 5( (|))] , (9)
where  denotes the number of samples,  denotes the feature of the -th sample,  denotes
the true label of the -th sample,  (|) denotes the probability distribution obtained from
the model’s predictions for the -th sample, and 5 denotes the top five categories with the
highest probabilities.</p>
        <p>In our experiments, the key parameter settings are configured as follows: 150 training epochs,
a batch size of 8, an initial learning rate of 0.0005, and a learning rate decay rate of 0.1. In Fig.
1 (b) HCT, the number of stacked layers  is set to 10. All experiments are performed on an
NVIDIA GeForce RTX 4090.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Ablation Study</title>
        <p>Firstly, in order to verify the efectiveness of various parts of the self-attention mechanism based
on the skeletal structure of human body, we conduct a series of ablative research experiments,
and the specific results can be obtained from Table 1.</p>
        <p>We use the vanilla Hyperformer model as the baseline and remove the relative position
encoding  and bias  for the four components of the attention module in vanilla HyperSA
to observe the role of each component. Through the results, it can be observed that compared
to the baseline, when we remove both the relative position encoding  and bias , the Top-1
accuracy is improved by 0.82%. When we remove only the relative position encoding  or bias
 separately, the Top-1 accuracy is improved by 1.34% and 1.56%, respectively. Therefore,
we believe that the relative position encoding  and bias  in HyperSA may not have verify
significant efects on attention extraction.</p>
        <p>Next, in order to further improve the accuracy of the model, we improve the original bias 
into the current component , which is the attention between hyperedges obtained through
the inner product of hyperedge features. By doing this, the Top-1 accuracy of the model is
increased by 1.78% compared to the baseline, indicating that the attention between hyperedges
has achieved significant efects on MiGs.</p>
        <p>Due to the phenomenon of overfitting that may occur during the training of one single model,
its performance may be good only on the training set, but it may decrease when facing new
data. Moreover, when the dataset is complex, a single model often cannot learn global patterns.
Therefore, we integrate multiple models using diferent attention components in training to
improve the generalization and robustness of the single model.</p>
        <p>We employ ensemble learning with three branches, which improves the Top-1 accuracy by
3.67% compared to the baseline. To further enhance the attention weights between key points
(component ) and between key points and hyperedges (component ), we add branches that
only utilize components  and , respectively, resulting in a four-branch ensemble approach.
This further improves the Top-1 accuracy by 4.37% compared to the baseline.</p>
        <p>Furthermore, we select all categories with instance counts at 1/50 of the maximum instance
count as the tail categories. At the same time, we select head categories with a ratio of
approximately 1:1 to merge with the tail categories and construct an independent training set. Through
this training set, an auxiliary classifier is trained that uses all components of attention. The
model with the auxiliary classifier achieves a Top-1 accuracy of 63.02%, which is an improvement
of 6.01% compared to the baseline.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Comparison to State-of-the-art Methods</title>
        <p>
          Our proposed technique is also examined through a comparative analysis on iMiGUE dataset,
which is shown in Table 2. We compare our proposed method with state-of-the-art methods
such as 2D convolutional networks utilizing CNNs with RGB data, GCNs with skeleton data,
and Transformers with skeleton data. Compared to the MS-G3D [19] method, which utilizes 3D
GCN on skeleton data, our method demonstrates an improvement of 8.11% on Top-1 accuracy
and 1.38% on Top-5 accuracy. In comparison with the TSM [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] method, which employs 2D
convolutional networks on RGB data, our method improves Top-1 accuracy by 1.92% and Top-5
accuracy by 0.12%. Compared to the Hyperformer [12] method, which uses Transformer on
sjeketib data, our method shows the significant improvement of 6.01% in Top-1 accuracy and
3.5% in Top-5 accuracy. It is observed that our proposed method (EHCT) outperforms the other
methods, achieving the best performance on the iMiGUE dataset.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In conclusion, this paper introduces a deep framework that utilizes ensemble models based
on hypergraph-convolution Transformer for the MiG classification from human skeleton data.
The skeleton is organized by the proposed hypergraphs, which enable to capture complex
correlations. By enhancing the attention mechanism and multimodel fusion techniques, the
proposed method efectively extracts subtle dynamic features from diferent gestures. As a
result, our designed model surpasses the state-of-the-art performance on the iMiGUE dataset,
demonstrating its efectiveness in accurate classification of human skeleton data.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partly supported by the Natural Science Foundation of Chongqing (No.
CSTB2022NSCQ-MSX0977), and the Key Research and Development Program of Shaanxi (Nos.
2021ZDLGY15-01 and 2023-ZDLGY-12).
[10] Y. Feng, H. You, Z. Zhang, R. Ji, Y. Gao, Hypergraph neural networks, AAAI Conference
on Artificial Intelligence (AAAI) (2019) 3558–3565.
[11] S. Bai, F. Zhang, P. H. S. Torr, Hypergraph convolution and hypergraph attention, Pattern</p>
      <p>Recognition (2021) 107637.
[12] Y. Zhou, C. Li, Z.-Q. Cheng, Y. Geng, X. Xie, M. Keuper, Hypergraph transformer for
skeleton-based action recognition, arXiv abs/2211.09590 (2022).
[13] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, T.-Y. Liu, Do transformers really
perform bad for graph representation?, Neural Information Processing Systems (NeurIPS)
(2021).
[14] J. Cai, Y. Wang, J.-N. Hwang, Ace: Ally complementary experts for solving long-tailed
recognition in one-shot, International Conference on Computer Vision (ICCV) (2021)
112–121.
[15] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, Openpose: Realtime multi-person 2d
pose estimation using part afinity fields, IEEE Transactions on Pattern Analysis and
Machine Intelligence (2018) 172–186.
[16] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
I. Polosukhin, Attention is all you need, Neural Information Processing Systems (NeurIPS)
(2017).
[17] K. Wu, H. Peng, M. Chen, J. Fu, H. Chao, Rethinking and improving relative position
encoding for vision transformer, International Conference on Computer Vision (ICCV)
(2021) 10013–10021.
[18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
A. Rabinovich, Going deeper with convolutions, IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2015) 1–9.
[19] Z. Liu, H. Zhang, Z. Chen, Z. Wang, W. Ouyang, Disentangling and unifying graph
convolutions for skeleton-based action recognition, IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2020) 140–149.
[20] B. Zhou, A. Andonian, A. Torralba, Temporal relational reasoning in videos, European
Conference on Computer Vision (ECCV) (2018) 831–846.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-Q.</given-names>
            <surname>Khor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Revealing the invisible with model and data shrinking for composite-database micro-expression recognition</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          (
          <year>2020</year>
          )
          <fpage>8590</fpage>
          -
          <lpage>8605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Micro-expression spotting with multi-scale local transformer in long videos, Pattern Recognit</article-title>
          . Lett. (
          <year>2023</year>
          )
          <fpage>146</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gan</surname>
          </string-name>
          , S. Han,
          <article-title>Tsm: Temporal shift module for eficient video understanding</article-title>
          ,
          <source>International Conference on Computer Vision</source>
          (ICCV) (
          <year>2019</year>
          )
          <fpage>7082</fpage>
          -
          <lpage>7092</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Mix dimension in poincaré geometry for 3d skeleton-based action recognition</article-title>
          ,
          <source>ACM International Conference on Multimedia (ACM MM)</source>
          (
          <year>2020</year>
          )
          <fpage>1432</fpage>
          -
          <lpage>1440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, SMG: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          (
          <year>2023</year>
          )
          <fpage>1346</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis</article-title>
          ,
          <source>IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2021</year>
          )
          <fpage>10626</fpage>
          -
          <lpage>10637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <article-title>Temporal segment networks for action recognition in videos</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2019</year>
          )
          <fpage>2740</fpage>
          -
          <lpage>2755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Spatial temporal graph convolutional networks for skeleton-based action recognition</article-title>
          ,
          <source>AAAI Conference on Artificial Intelligence (AAAI)</source>
          (
          <year>2018</year>
          )
          <fpage>7444</fpage>
          -
          <lpage>7452</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Lai,
          <string-name>
            <given-names>G.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Graph convolutional neural network for human action recognition: A comprehensive survey</article-title>
          ,
          <source>IEEE Transactions on Artificial Intelligence</source>
          (
          <year>2021</year>
          )
          <fpage>128</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>