<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Representation Learning for Topology-adaptive Micro-gesture Recognition and Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Atif Shah</string-name>
          <email>atif.shah@oulu.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoyu Chen</string-name>
          <email>chen.haoyu@oulu.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guoying Zhao</string-name>
          <email>guoying.zhao@oulu.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Machine Vision and Signal Analysis (CMVS), University of Oulu</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Human-to-human communication is greatly influenced by micro-gestures. The actions of a person inherently reveal information about their true sentiments and potential intentions. Micro-gestures are non-verbal cues that indicate a person's true feelings and intentions; however, they become more challenging to recognize than normal gestures because micro-gestures are subtle and appear for milliseconds. In this work, we propose a graph-encoding convolutional network to extract intrinsic joint representations from skeletons using a self-attention graph convolution module in the spatial domain. The multi-scale temporal convolution module extracts the temporal representation in the time domain and sends it to the classification module to recognize micro-gestures. We evaluate the proposed framework using two micro-gesture datasets, SMG and iMiGUE, and achieve state-of-the-art results.</p>
      </abstract>
      <kwd-group>
        <kwd>Analysis</kwd>
        <kwd>Micro-gestures (MG)</kwd>
        <kwd>Skeleton-based-method</kwd>
        <kwd>Graph convolutional networks</kwd>
        <kwd>Emotion recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Micro-gestures (MGs) play an important role in human-to-human communication [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. The
person’s acts aid in understanding by naturally revealing information such as actual feelings
and possible intentions. MGs are subtle movements that reveal a person’s actual emotions and
intentions. Recently, there has been a lot of interest in empowering intelligent machines with
the same capabilities for understanding human behaviors [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], which is essential for natural
human-computer interaction and many other useful applications [6].
      </p>
      <p>In recent times, modern sensor technology and human position estimation algorithms have
made it considerably simpler to extract human 2D and 3D skeletons [7, 8]. Skeletons are
tempting for MG recognition tasks because they are small in size, robust, and resistant to
changes in viewpoint and cluttered backgrounds [9, 10]. Skeletons are commonly used for
action and MG recognition via Graph convolutional networks (GCNs) [11, 12]. GCNs are an
ideal technique for extracting topological data from skeletons due to their lightweight nature
and the fact that joints and bones naturally form graphs in the human body.</p>
      <p>However, there are still some limitations with skeleton data. One notable issue is the lack
of critical interactive elements and contextual information within the skeleton representation,
which makes it dificult to distinguish between comparable activities.</p>
      <p>PE</p>
      <p>Linear</p>
      <p>Global avg
pooling
MS-TC
S-GC</p>
      <p>Z
Linear
Softmax</p>
      <p>Conv
Self-attenion graph conv</p>
      <p>Multi-head
self-attention
Input</p>
      <p>Output
Contextdependent
topology
Sharedtopology
Embedding
module</p>
      <p>Encoding
module</p>
      <p>Classification
module</p>
      <p>Self-attention graph
convolution (S-GC)</p>
      <p>To address the aforementioned issue, we present a Graph-Encoding Convolutional Network
(GE-CN) with an attention layer that can learn spatial-temporal contextual representations to
recognize MGs. The proposed framework consists of the embedding, encoding and classification
module as shown in Figure 1. The embedding module takes input skeleton sequences, integrates
them with the position embedding of joints and passes the skeleton representation for feature
extraction to the embedding module. The embedding module consists of a self-attention module
that extracts spatial representation and captures intrinsic relationships among joints in the
spatial domain and a multi-scale temporal module to extract temporal features. The learned
latent representations are forwarded to the classification module to recognize the learned gesture
categories. In summary, we have the following contributions:
• We proposed a Graph-Encoding Convolutional Network (GE-CN) framework for various
skeleton topologies using a self-attention graph convolution.
• We analyze the insight of latent space encoding by using the distribution visualization to
ifnd the semantic features.</p>
      <p>• We achieved state-of-the-art results on the two datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Recently many researchers have employed skeleton-based methods for action and gesture
recognition, which are classified as unsupervised and supervised methods.</p>
      <sec id="sec-2-1">
        <title>2.1. Skeleton-based unsupervised methods</title>
        <p>Zheng et al. [13] utilized unsupervised representation learning to capture global motion
dynamics. The generative adversarial network (GAN) is used as an encoder-decoder to model
motion dynamics and acquire discriminative features for action recognition. Similarly, Su
et al. [14] leverage encoder-decoder recurrent neural networks (RNN) for learning features
for action recognition. Lin et al. [15] used an unsupervised Bidirectional-Gated Recurrent
Unit (Bi-GRU) encoder to learn more generalized representations by combining jigsaw puzzles,
motion prediction, and contrastive learning. Li et al. [16] utilized data without labels for
learning view-invariant action and predicting 3D motion by combining RGB and depth images.
Many researchers have also utilized contrastive learning; likewise, Zhou et al. [17] added a
contrastive learning module to the framework to distinguish between confident and ambiguous
action samples. Lin et al. [18] Proposed actionlet-based contrastive learning method where a
motion-adaptive transformation strategy was designed to learn semantic consistency in actionlet
regions.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Skeleton-based supervised methods</title>
        <p>In skeleton data, joints and bones naturally form graphs in the human body; therefore, most
researchers adopted GCNs. Yan et al. [11] presented a spatial-temporal GCN (STGCN) to capture
complex features from a human skeleton. Long short-term memory (LSTM) was used by Si et
al. [19] with the conjunction of convolutional networks to improve performance by employing
more discriminatory spatial and temporal features. Liu et al. [20] proposed a multiscale spatial
graph convolutional operator (MSG3D) to disentangle the skeleton, which removes redundant
features from the skeleton and aggregates efective graph relationships. Shi et al. [ 21] proposed
a two-stream adaptive graph convolutional network (2s-AGCN) for action recognition. Chi et al.
[22] presented the Infogcn framework, which consists of self-attention-based graph convolution
and representation learning to capture complementary features for action recognition.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Framework Architecture</title>
        <p>The GE-CN framework contains an embedding module followed by a stack of encoding modules
and a global average pooling layer and a classification module. The embedding module takes
a sequence of skeleton representation, converts it to initial joint representation and forwards
it to the encoder module, to capture complex spatio-temporal features. A reparametrization
step is employed that is commonly used in variational autoencoders [23]. A random noise ( ) is
introduced, followed by a normal distribution. The mean ( ) value is then added to the product
of a diagonal covariance matrix to sample a variable named  . This approach enables us to find
unbiased gradients and eficiently tune the model via gradient-based optimization strategies.
A classifier is added at the end with a single linear layer and soft-max function to convert the
learned features into class distribution categories.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Embedding module</title>
        <p>The skeleton is represented as a graph ( , ) , with  joints denoted as  and bones as edges  .
Edges are represented as  ×  dimensional adjacency matrix  , where element  , reflects
the link between joints  and  ,  , value is 1 if  and  joints are connected, otherwise 0. The
combination of skeletons makes a sequence of skeleton graphs, which is represented as a joint
feature  ∈   × × , where  shows the number of frames and  denotes the feature dimensions.</p>
        <p>The input joint representations are linearly transformed by the embedding module into  (0)
dimensional vectors while adding the positional embedding ( )
to accommodate the joint
position information. The PE is transferable and learnable across many temporal channels.
where</p>
        <p>(0) represents a hidden layer,   ∈</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Encoding module</title>
        <p>(0)
= (</p>
        <p>) +  
 × (0)
, (.) is a linear function and  is time index.</p>
        <p>The encoder module consists of two modules: Self-attention Graph Convolution (S-GC) to extract
spatial representations and Multi-Scale Temporal Convolution (MS-TC) to extract temporal
features from the skeleton. The input joints are sequentially encoded via S-GC and MS-TC,
followed by normalization. The graph convolution updates the hidden representation via the
following rules:
  (+1) = Θ( ̄

()
 () )
where  ̄ is the normalized adjacency matrix,  is the learned matrix and Θ(.)is the nonlinear
activation function. The S-GC uses the self-attention of joint features to determine the intrinsic
topology and employs the topology as a source of neighborhood vertex information for graph
convolution. The self-attention is a mechanism that links several joints of the body. S-GC takes
into account all the possible relations and estimates positive and bounded weights which are
known as self-attention maps that indicate how strong the connection is between joints. To get
self-attention maps the following mathematical definition is used:
(  ) =  (
    (    )
√
′

)
where   ,   are learnable matrics of 
S-GC learns a topology  ̈ shared over time. The self-attention map and shared topology employ
M multi-heads to enable the model to attend multiple representation subspaces at the same
time. To achieve intrinsic topology, the shared topology and self-attention maps are integrated.
′ dimensions. Aside from the self-attention map, the
(1)
(2)
(3)
(4)
(5)
 ̈ ⊗   (  ) ∈   × ×
where ⊗ is the broadcasted element-wise product. S-GC utilizes the equation 4 as neighborhood
information, and the overall joint representation update rules are formulated as [22]


  (+1) = Θ (∑( ̈() ⊗   ( 
()
))

()
  () )</p>
        <p>After extracting the spatial intrinsic features the MS-TC block extracts the temporal features
using three parallel convolution branches with various kernel sizes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>The proposed GE-CN framework is evaluated using two MG datasets. Both datasets contain
skeleton data, which is utilized in this work. In this work, we used an SGD optimizer with
a 0.9 moment coeficient, a frame size of 64 and a batch size of 32. We used maximum-mean
discrepancy loss borrowed from [22] and used the same experimental settings.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>
          Spontaneous Micro-Gesture (SMG) dataset [
          <xref ref-type="bibr" rid="ref1 ref3">3, 1</xref>
          ] contains 3,692 samples with 17 MGs. The
dataset is obtained from 40 individuals using Kinect [8] and 25 3D joints are collected while
they are narrating a fake and true story.
        </p>
        <p>
          Micro-Gesture Understanding and Emotion Analysis (iMiGUE) dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] contains 32
MGs extracted from the post-match press conference videos. The sample is collected in RGB
and skeleton modalities using OpenPose [7]. A total of 137 key points are collected, including
70 face points, 42 hand points, and 25 body joints. Following the protocol of [11, 20, 24], we use
the 3rd dimension of the OpenPose joint as a pseudo dimension.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>We evaluated the proposed framework using two MG datasets and the results are shown in
Table 1. We compare the results with the baseline method without a self-attention layer and use
accuracy as a metric of evaluation, with top-1% and top-5% accuracy. The first row of Table 1
shows the baseline results on both datasets without a self-attention layer. Using the baseline,
the iMiGUE dataset achieved a top-1 accuracy of 53.98% and a top-5 accuracy of 90.5%. Similarly,
the SMG achieved 62.89% and 94.95% accuracies of top-1 and top-5, respectively. The second
row shows the proposed method with a self-attention layer, where the iMiGUE dataset reached
top-1 accuracy of 56.12 and top-5 accuracy of 90.01%, which is a significant improvement as
compared to the baseline. Likewise, the SMG also improved its accuracy and reached 64.26 and
95.23, top-1 and top-5 accuracies, respectively. Both datasets show that the GE-CN method
improved the results notably. One of the reasons SMG results show better performance is
because most of the person’s action skeleton data is available with full-body joints; however,
for the iMiGUE dataset, only the upper body joints are extracted because almost all of those
skeletons are extracted while the person is sitting on a chair.</p>
        <p>We compare the results with previous methods, as shown in Table 2. We achieved the best
results both on the iMiGUE and SMG datasets. If we look at the MG-G3D method in the 5ℎ row,
we improved the performance significantly using the iMiGUE dataset and the top-5% accuracy
using the SMG dataset; however, we didn’t achieve the best results in the top-1% accuracy on
the SMG dataset and placed slightly below the MS-G3D method.</p>
        <p>We visualized the latent features of all categories of iMiGUE and SMG datasets, as shown in
Figure 3, but for better visualization, we only visualized four categories from each dataset in
Figure 2. Both images show the clusters of classes in diferent color schemes. The target legend
reflects the number of classes in a dataset. The visualization was done using the T-SNE tool.
We transformed the high-dimensional latent features to two dimensions and used one epoch
samples for better visualization.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we introduced a graph-encoding convolutional network that extracts intrinsic
joint representation from skeleton data using a self-attention module in the spatial domain. The
extracted features are forwarded to the multi-scale temporal convolution module to extract
the temporal join relationship. The learned features are fed to the classification module for
micro-gesture classification. The proposed framework was evaluated on two micro-gesture
datasets, SMG and iMiGUE, which achieved state-of-the-art results with top-1% accuracy of
64.26 and 56.12, respectively.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Academy of Finland for Academy Professor project EmotionAI
(grants 336116, 345122), the University of Oulu &amp; The Academy of Finland Profi 7 (grant 352788),
by the Ministry of Education and Culture of Finland for AI forum project.</p>
      <p>The authors also wish to acknowledge CSC – IT Center for Science, Finland, for computational
resources.
[6] H. Chen, E. Tan, Y. Lee, S. Praharaj, M. Specht, G. Zhao, Developing ai into
explanatory supporting models: An explanation-visualized deep learning prototype, in: 14th
International Conference of the Learning Sciences (ICLS) 2020, 2020.
[7] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using
part afinity fields, in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 7291–7299.
[8] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake,
Real-time human pose recognition in parts from single depth images, in: CVPR 2011, Ieee,
2011, pp. 1297–1304.
[9] H. Chen, X. Liu, G. Zhao, Temporal hierarchical dictionary with hmm for fast gesture
recognition, in: 2018 24th international conference on pattern recognition (ICPR), IEEE,
2018, pp. 3378–3383.
[10] H. Chen, X. Liu, J. Shi, G. Zhao, Temporal hierarchical dictionary guided decoding for
online gesture segmentation and recognition, IEEE Transactions on Image Processing 29
(2020) 9689–9702.
[11] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based
action recognition, in: Proceedings of the AAAI conference on artificial intelligence,
volume 32, 2018.
[12] Y. Liu, X. Zhang, Y. Li, J. Zhou, X. Li, G. Zhao, Graph-based facial afect analysis: A review,</p>
      <p>IEEE Transactions on Afective Computing (2022).
[13] N. Zheng, J. Wen, R. Liu, L. Long, J. Dai, Z. Gong, Unsupervised representation learning
with long-term dynamics for skeleton based action recognition, in: Proceedings of the
AAAI Conference on Artificial Intelligence, volume 32, 2018.
[14] K. Su, X. Liu, E. Shlizerman, Predict &amp; cluster: Unsupervised skeleton based action
recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2020, pp. 9631–9640.
[15] L. Lin, S. Song, W. Yang, J. Liu, Ms2l: Multi-task self-supervised learning for skeleton
based action recognition, in: Proceedings of the 28th ACM International Conference on
Multimedia, 2020, pp. 2490–2498.
[16] J. Li, Y. Wong, Q. Zhao, M. S. Kankanhalli, Unsupervised learning of view-invariant action
representations, Advances in neural information processing systems 31 (2018).
[17] H. Zhou, Q. Liu, Y. Wang, Learning discriminative representations for skeleton based
action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2023, pp. 10608–10617.
[18] L. Lin, J. Zhang, J. Liu, Actionlet-dependent contrastive learning for unsupervised
skeletonbased action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, 2023, pp. 2363–2372.
[19] C. Si, W. Chen, W. Wang, L. Wang, T. Tan, An attention enhanced graph convolutional
lstm network for skeleton-based action recognition, in: proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2019, pp. 1227–1236.
[20] Z. Liu, H. Zhang, Z. Chen, Z. Wang, W. Ouyang, Disentangling and unifying graph
convolutions for skeleton-based action recognition, in: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2020, pp. 143–152.
[21] L. Shi, Y. Zhang, J. Cheng, H. Lu, Two-stream adaptive graph convolutional networks
for skeleton-based action recognition, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2019, pp. 12026–12035.
[22] H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, K. Ramani, Infogcn: Representation
learning for human skeleton-based action recognition, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp. 20186–20196.
[23] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114
(2013).
[24] A. Shah, H. Chen, H. Shi, G. Zhao, Eficient dense-graph convolutional network with
inductive prior augmentations for unsupervised micro-gesture recognition, in: 2022 26th
International Conference on Pattern Recognition (ICPR), IEEE, 2022, pp. 2686–2692.
[25] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, H. Lu, Skeleton-based action recognition
with shift graph convolutional network, in: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, 2020, pp. 183–192.
[26] W. Peng, X. Hong, H. Chen, G. Zhao, Learning graph convolutional network for
skeletonbased human action recognition by neural searching, in: Proceedings of the AAAI
conference on artificial intelligence, volume 34, 2020, pp. 2669–2676.
[27] B. Zhou, A. Andonian, A. Oliva, A. Torralba, Temporal relational reasoning in videos, in:</p>
      <p>Proceedings of the European conference on computer vision (ECCV), 2018, pp. 803–818.
[28] M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, D. J. Crandall, Temporal recurrent networks for
online action detection, in: Proceedings of the IEEE/CVF international conference on
computer vision, 2019, pp. 5532–5541.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Smg: A micro-gesture dataset towards spontaneous body gestures for emotional stress state analysis</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>131</volume>
          (
          <year>2023</year>
          )
          <fpage>1346</fpage>
          -
          <lpage>1366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10631</fpage>
          -
          <lpage>10642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Analyze spontaneous gestures for emotional stress state recognition: A micro-gesture dataset and analysis with deep learning</article-title>
          ,
          <source>in: 2019 14th IEEE International Conference on Automatic Face &amp; Gesture Recognition (FG</source>
          <year>2019</year>
          ), IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Multiscale 3d-shift graph convolution network for emotion recognition from human actions</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>37</volume>
          (
          <year>2022</year>
          )
          <fpage>103</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sebe</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Zhao,
          <article-title>Geometry-contrastive transformer for generalized 3d pose transfer</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>