<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal-Related Transformer with Depth Information for 4D Micro-Expression Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jinsheng Wei</string-name>
          <email>weijs@njupt.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jialiang Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guanming Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jingjie Yan</string-name>
          <email>yanjingjie@njupt.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jiangsu Key Laboratory of Intelligent Information Processing and Communication Technology, Nanjing University of Posts and Telecommunications</institution>
          ,
          <addr-line>Nanjing 210003</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications</institution>
          ,
          <addr-line>Nanjing 210003</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Micro-expressions can reflect real emotions, and recognizing micro-expressions has great potential in the fields of criminal investigation, security, and negotiation. Compared to 2D micro-expression images, 4D micro-expression data contains richer information, such as depth information. Thus, this paper explores the efectiveness of depth information in 4D micro-expression video and proposes a Depth Information Temporal Related Transformer (DITRTr) model. DITRTr model mines temporal depth information from 4D point clouds and maps it to the micro-expression category space by modelling the temporal correlation of depth information. Firstly, the model extracts the frontal depth image from each 3D point cloud frame; Then, the pre-trained convolutional neural networks (CNN) are used to extract spatial features of each depth image frame; Finally, a Transformer is adopted to model the correlation between the spatial features of diferent frames and extract spatiotemporal depth microexpression features. The experimental results demonstrate the efectiveness of the proposed method, which ranked second in the 4DME Challenge at the 2025 IJCAI.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Micro-expressions are subtle facial expressions that appear when people try to hide their true emotions.
Compared to ordinary facial expressions, micro-expressions have a shorter duration and lower intensity,
which makes them dificult to detect with the naked eye. Psychological research [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] has shown that
when a person attempts to conceal their true emotions, their face will unconsciously reveal
microexpressions within a very short period of time, allowing people to interpret their true emotions by
analysing these facial micro-expressions. Therefore, accurate analysis and understanding of
microexpressions are of great significance and widely applied in many important fields [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], and
microexpression recognition technology meets the key technological needs of intelligent public and social
security, national security, digital services, smart healthcare, and smart cities. For example, in public
places, judging whether social personnel have negative emotions based on their micro-expressions can
prevent the occurrence of violent safety incidents; In criminal investigation, judging whether a criminal
is lying through their micro-expressions can help grasp the correct direction of criminal investigation
and accelerate the process of handling criminal cases; In the negotiations, negotiators can judge the true
emotions of the other party through their micro-expressions, thereby inferring their actual intentions
and tendencies, which can assist negotiators in making more favorable decisions.
      </p>
      <p>
        Although micro-expression analysis and understanding are of great significance, recognizing
microexpressions is a highly challenging task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The main reason is that micro-expressions occur quickly, and
the movement intensity of facial muscles is extremely subtle [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Furthermore, due to the limitations
of sensory organs, it is dificult for the human eye to detect the occurrence of micro-expressions and
analyze the emotions represented by them. Therefore, how to fully extract micro-expression-related
information from the whole face is crucial for the practical application of micro-expression recognition.
      </p>
      <p>
        Li et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explored 4-dimensional (4D, namely, 3D mesh + temporal changes) facial micro-expression
information and collected a 4D micro-expression dataset (4DME). 4D micro-expression data contains
richer clues and can provide more discriminative information for micro-expression recognition, such as
depth information and diferent view information.
      </p>
      <p>Compared to 2-dimensional (2D) images, depth information is one of the most distinctive features
of 4D data. The movement of facial muscles can cause changes in depth information, and mining
these changes can efectively extract micro-expression features. Therefore, this paper explores 4D
micro-expression recognition from the perspective of depth information. That is, depth information is
mined from 4D micro-expression data to obtain continuous depth image frames.</p>
      <p>Micro-expressions are a dynamic process, and learning the temporal correlation between consecutive
deep image frames can efectively extract dynamic features. Furthermore, CNN can learn spatial depth
features from depth images, but cannot establish temporal correlation between continuous depth images.
Transformer can efectively model the correlation between diferent tokens. Therefore, this paper adopts
CNN and Transformer to model the spatial features of a single frame depth image and the temporal
correlation features between consecutive frames, respectively.</p>
      <p>Overall, the contributions of this paper are as follows:
• 1) This paper explores the efectiveness of depth information in 4D micro-expression data for
micro-expression recognition, proposes a Depth Information Temporal Related Transformer
(DITRTr) model, and extracts discriminative depth spatiotemporal micro-expression features;
• 2) In the DITRTr, CNN and Transformer are introduced to extract deep spatial features and
time-dependent features, respectively, efectively representing the dynamic depth information of
micro-expressions;
• 3) The experimental results indicate that the proposed method is efective and ranked second in
the 1st 4D micro-expression recognition (4DMR) challenge at IJCAI 2025.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>This paper focuses on micro-expression recognition based on a 4D micro-expression dataset. Therefore,
related works are introduced from two aspects, namely, the micro-expression dataset and the recognition
algorithm.</p>
      <sec id="sec-2-1">
        <title>2.1. Micro-Expression Dataset</title>
        <p>
          Early micro-expression research relied almost exclusively on 2D image or video datasets, such as CASME
II [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], SAMM[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and SMIC[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which record subtle facial movements at high frame-rates but discard
any steric information about the face. Based on these datasets, a large number of 2D image/video-based
micro-expression recognition methods [
          <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15 ref16">12, 13, 14, 15, 16</xref>
          ] have been proposed, greatly promoting the
development of micro-expression recognition. However, the lack of depth cues makes these datasets
sensitive to viewpoint changes and illumination, limiting their reliability in unconstrained
criminalinvestigation or surveillance scenarios.
        </p>
        <p>
          To alleviate these weaknesses, Li et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] collected a 4D micro-expression (4DME) dataset. 4DME
is the first spontaneous 4D micro-expression dataset where every frame contains a high-resolution
3D mesh face. Nevertheless, harnessing 4D data for micro-expression recognition is non-trivial, and
the sheer volume of 3D point clouds, irregular mesh topologies and subtle depth variations pose new
feature-extraction challenges.
        </p>
        <p>Consequently, this paper investigates how temporal depth cues can be exploited for 4D
microexpression recognition.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Recognition Algorithm</title>
        <p>At present, there is a lack of work in 4D micro expression recognition, while there are some explorations
in the field of ordinary 3D/4D expression recognition.</p>
        <p>Li et al.[17] presented a comprehensive survey on 3D face recognition methods, covering both
traditional and modern methods. Traditional methods mainly extract distinctive facial features for
matching, while modern methods rely on deep learning for end-to-end recognition. They also reviewed
challenges such as pose, illumination, and expression variations.</p>
        <p>Tian et al. [18] designed a novel deep feature fusion convolution neural network (CNN) for 3D facial
expression recognition (FER). They represented each 3D face scan as 2D facial attribute maps (including
depth, normal, and shape index values) and then combined diferent facial attribute maps to learn facial
representations by fine-tuning a pre-trained deep feature fusion CNN subnet. Global Average Pooling
was used to reduce overfitting.</p>
        <p>Bouzid et al. [19] presented a method for dynamic facial expression recognition based on 4D facial
expression data. Their approach directly extracted spatio-temporal information from the 3D mesh
sequences. Every mesh in the sequences was fed into a spatial auto-encoder using spatial convolutions
to extract spatial embeddings, and then a temporal transformer processed the sequence of embeddings
for facial expression classification.</p>
        <p>Trimech et al. [20] aimed to achieve the task by using point cloud-based DNNs. They enlarged the
dataset by generating synthetic 3D facial expressions and applied a level curve-based sampling strategy
to obtain discriminative point-based representations of 3D faces, achieving promising results.</p>
        <p>Kalapala et al.[21] proposed direct classification of normalised and flattened 3D facial landmarks
reconstructed from 2D images. They used a pre-trained convolutional Face Alignment Network (FAN)
for 3D projection of 2D facial landmarks and tried diferent classifiers in the spherical coordinate system.</p>
        <p>The above methods explore 3D/4D expression recognition. Unlike these methods, this paper proposes
a Depth Information Temporal Related Transformer to achieve 4D micro-expression recognition tasks</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>As shown in Figure 1, this paper proposes a novel DITRTr that includes depth information extraction,
depth spatial features extraction (DSFE) and depth temporal correlation modelling (DTCM) module.</p>
      <sec id="sec-3-1">
        <title>3.1. Depth Information Extraction</title>
        <p>Depth information plays a crucial role in capturing subtle facial muscle movements. Depth maps
are extracted from the 3D meshes in order to obtain the depth information invariant to illumination
and texture, which is essential for accurately representing the minor and rapid facial movements
characteristic of micro-expressions. First, the vertices of meshes are parsed to obtain the point cloud
data:</p>
        <p>= {| = (, , ),  = 1, . . . ,  }.</p>
        <p>To convert the point cloud into a structured depth map representation, we adopt a distance mapping
strategy. The 2D plane is subdivided into a uniform grid with a resolution of  × , and the
Zcoordinate (depth) of the nearest point is assigned to each grid cell. The grid step sizes along the X and
Y axes are computed as:
where , , ,  represent the range of the point cloud along the X and Y dimensions,
respectively. Finally, each depth map  ∈ R×  is constructed by projecting the point cloud onto
the grid and assigning: (, ) = , where (, ) is the grid cell index corresponding to point .
TFollowing this procedure, we generated the frontal-view depth maps for each ME sequence , denoted
as  ∈ R × ×  , where  indicates the the number of normalized frames in the sequence.
(1)
(2)
(3)
(4)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. DTCM Module</title>
        <p>To efectively capture the temporal dynamics information in depth image sequences, we adopt a
Transformer-based architecture as the depth temporal correlation modeling module. The self-attention
mechanism in Transformer enables the model to learn global dependencies across depth image frames,
which is crucial for identifying subtle and rapid facial muscle movements.</p>
        <p>Given an input depth feature sequence , we prepend  learnable class tokens to the input sequence
before feeding it into the DTCM Module. Each class token is designed to represent a specific
microexpression category, enabling the model to independently learn discriminative features for each class.
Formally, let  ∈ R×  denote the class tokens. The concatenated Transformer input is thus:
 = concat(, ) ∈ R(+ )×  .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2. DSFE Module</title>
        <p>To extract discriminative features from the depth maps of micro-expression video sequences, we employ
a ResNet18 as a backbone that has proven efectiveness in visual feature extraction tasks. However,
depth maps possess fundamentally diferent structural and distributional properties compared to RGB
images. Therefore, directly applying a ResNet18 pretrained on RGB images may result in suboptimal
feature representations for depth inputs. To address this challenge, we fine-tune the ResNet18 model to
better adapt to the characteristics of depth maps.</p>
        <p>Specifically, the original ResNet18 first convolutional layer, which expects a three-channel input,
is modified to accept single-channel depth maps. The weights of the new convolutional layer are
initialized by averaging the pretrained RGB weights along the channel dimension to retain transferable
low-level features. Formally, given a depth map input  ∈ R × ×  , the modified ResNet18 produces
an output feature vector F ∈ R ×  after global average pooling:</p>
        <p>=  18().</p>
        <p>To further adapt the feature representations for micro-expression recognition, we introduce a
MLP adapter that projects the features into a lower-dimensional space suitable for the subsequent
Transformer-based temporal modeling. The adapter operation is defined as:</p>
        <p>=   (),  ∈ R ×  ,
where  denotes the dimension in the proposed DTCM Module. By fine-tuning the modified
ResNet18 and training the MLP adapter jointly, the model learns the feature representations tailored to
the depth properties, thereby enhancing its capability to capture subtle micro-expression cues efectively.</p>
        <p>=  (  + ),
where  ∈ R(+ )×  denotes the encoded class token features, and  is the sigmoid activation
function producing the probability scores for each micro-expression class.</p>
        <p>By assigning each expression class a dedicated token, the model is encouraged to learn class-specific
representations, thereby enhancing its ability to distinguish between subtle and co-occurring
microexpressions in a multi-label classification setting.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Loss Function</title>
        <p>To optimize the proposed model for multi-label recognition, we adopt the binary cross-entropy loss (BCE
Loss), incorporating class-specific positive weights to address class imbalance in the dataset. In
microexpression recognition, certain classes are significantly underrepresented compared to others, which
can bias the model towards majority classes if standard loss functions are used without reweighting.
Let ˆ, denote the predicted logit for the -th class of the -th sample, and , ∈ {0, 1} denote the
corresponding ground truth label. The weighted binary cross-entropy loss for a single sample is defined
as:
 = −</p>
        <p>∑︁ [︀  , log  (ˆ, ) + (1 − , ) log(1 −  (ˆ, ))]︀ ,
=1
where  is the total number of classes, and  is the positive weight for class , computed as:</p>
        <p>The self-attention mechanism within the Transformer encoder then performs information integration
across both the temporal frame tokens and the class tokens. Specifically, the self-attention output is
computed as:</p>
        <p>Attention(, ,  ) = softmax
︂(  )︂
√
,
where , ,  ∈ R(+ )×  represent the query, key, and value matrices derived from . This
formulation enables bidirectional interactions between each class token and the temporal feature tokens,
allowing each class token to aggregate relevant information from all frames in the sequence. After
Transformer encoding, the output representations corresponding to the first  class tokens are extracted
and passed through a classification head to produce the final multi-label predictions:
 =</p>
        <p>, ,
, + 
where , and , represent the number of positive and negative samples for class , respectively,
and is a small constant added for numerical stability.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>We conduct our experiments on the 2025 4DMR challenge dataset, a high-resolution 4D micro-expression
dataset. The dataset consists of 100 4D micro-expression samples, each annotated with multiple
expression category labels across five fine-grained emotions.</p>
        <p>Each micro-expression sequence is captured as a dynamic mesh sequence, where every frame is
represented by a 3D mesh file in OBJ format, encoding the detailed facial geometry at that moment.
Since the number of frames per sequence varies across samples, temporal normalization is required for
batch-wise processing. To this end, we analyzed the temporal distribution of the sequences and found
that the average number of frames across the 100 samples is 18.81, while the 90th percentile is 26.10.
Balancing computational eficiency and temporal coverage, we normalized all sequences to a fixed
length of 20 frames. For sequences exceeding 20 frames, we retained only the first 20 and discarded
(6)
(7)
(8)
(9)
the remaining ones; for shorter sequences, we applied zero-padding at the end to reach the required
frame count. To facilitate model training and hyperparameter tuning, 100 samples in the dataset were
randomly divided into a training set of 90 samples and a validation set of 10 samples.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Metrics</title>
        <p>Following the requirements of the challenge competition, we adopt the F1-score as the primary
evaluation metric to evaluate the performance of the model, which efectively balances precision and recall,
especially in scenarios with class imbalance. The F1-score is defined as:</p>
        <p>precision · recall
 1 = 2 · precision + recall .</p>
        <p>1macro =</p>
        <p>1 ∑︁  1,

=1</p>
        <p>The macro-averaged F1-score, denoted as  1, is computed to provide a more comprehensive
evaluation across all classes. This metric gives equal weight to each class, regardless of its sample size,
and is particularly suitable for micro-expression datasets with unbalanced class distributions.
where  is the total number of classes, and  1 is the F1-score calculated for the ℎ class individually.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Implement Detail</title>
        <p>The proposed model was trained using the Adam optimizer with a learning rate set to 1e-4 and a
weight decay of 1e-4 to prevent overfitting. The batch size was set to 32, and the model was trained
for 100 epochs to ensure convergence. For the input data, each sequence was normalized to 20 frames.
Sequences with fewer than 20 frames were zero-padded, while those with more than 20 frames were
truncated to maintain a consistent temporal dimension. The depth maps resolution  was set to 128.
The feature dimension  of the transformer was set to 128.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Challenge Competition Results</title>
        <p>Our method is compared with the first and third-ranked methods in terms of the 4DMR challenge
competition on IJCAI 2025. The performance results are presented in Table 1. Our method secured the
second rank, highlighting its strong competitiveness and overall efectiveness. Specifically, our method
is 0.018 Macro F1-score lower than the first-ranked method and 0.042 Macro F1-score higher than the
third-ranked method.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Ablation Study</title>
        <p>We evaluate the DTCM Module by comparing the model’s performance with and without this module.
For the model without the DTCM Module, the temporal modeling and feature aggregation components
were replaced by a simpler structure consisting of linear layers followed by average pooling, thereby
removing the dedicated temporal–class token interaction mechanism introduced in DTCM. As shown in
(10)
(11)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this work, we proposed a DITRTr model for 4D micro-expression recognition, which explicitly
integrates depth information extracted from facial 3D mesh sequences. By fine-tuning the pretrained
ResNet-18, the backbone network better accommodates single-channel depth maps. Furthermore, a set of
learnable class tokens was introduced into the Transformer architecture to enhance the discriminability
of temporal representations under a multi-label setting.</p>
      <p>Our method achieves competitive performance on the 4DMR benchmark, achieving second place
in the IJCAI 2025 challenge. These results underscore the efectiveness of incorporating depth spatial
information and temporal dependencies for subtle facial movement analysis.</p>
      <p>There are potential directions for further exploration. While our method extracts depth maps from
raw mesh sequences and applies downsampling to produce fixed-resolution representations, this process
may overlook fine-grained spatial cues that are critical for identifying subtle micro-expressions. More
adaptive depth feature encoding strategies—such as region-aware sampling or attention-based depth
refinement may ofer enhanced sensitivity to critical facial muscle movements.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents a novel DITRTr model to recognize 4D micro-expressions, based on depth
information. The proposed method designs the DSFE and DTCM modules to extract the spatial features and the
temporal correlation between depth image frames from the depth image, respectively, which efectively
captures subtle facial movements in facial 3D sequences (4D data). The experiments conducted on
the 4DME dataset validate the performance of our method, which ranks second in the 4DMR IJCAI
Workshop 2025 challenge.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the Natural Science Foundation of the Higher Education Institutions
of Jiangsu Province (Grant No.24KJB520022), in part by the Nanjing Science and Technology Innovation
Foundation for Overseas Students (Grant No. RK002NLX23004), in part by Natural Science Research
Start-up Foundation of Recruiting Talents of Nanjing University of Posts and Telecommunications
(Grant No.NY223030).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[17] M. Li, B. Huang, G. Tian, A comprehensive survey on 3d face recognition methods, Engineering</p>
      <p>Applications of Artificial Intelligence 110 (2022) 104669.
[18] K. Tian, L. Zeng, S. McGrath, Q. Yin, W. Wang, 3d facial expression recognition using deep feature
fusion cnn, in: 2019 30th Irish Signals and Systems Conference (ISSC), IEEE, 2019, pp. 1–6.
[19] H. Bouzid, L. Ballihi, 3d facial expression recognition using spiral convolutions and transformers, in:
2023 20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA),
IEEE, 2023, pp. 1–7.
[20] I. H. Trimech, A. Maalej, N. E. Ben Amara, Facial expression recognition using 3d points aware
deep neural network., Traitement du Signal 38 (2021).
[21] L. Kalapala, H. Yadav, H. Kharwar, S. Susan, Facial expression recognition from 3d facial landmarks
reconstructed from images, in: 2020 IEEE International Symposium on Sustainable Energy, Signal
Processing and Cyber Security (iSSSC), IEEE, 2020, pp. 1–5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Haggard</surname>
            ,
            <given-names>Ernest A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isaacs</surname>
            ,
            <given-names>Kenneth S.</given-names>
          </string-name>
          ,
          <article-title>Micromonlentary facial expressions as indicators of ego mechanisms in psychotherapy</article-title>
          , in: Methods of Research in Psychotherapy, Springer US, Boston, MA,
          <year>1966</year>
          , pp.
          <fpage>154</fpage>
          -
          <lpage>165</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ekman</surname>
          </string-name>
          , Lie Catching and Microexpressions, in: C. Martin (Ed.),
          <source>The Philosophy of Deception</source>
          , Oxford University Press,
          <year>2009</year>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>136</lpage>
          . URL: https://academic.oup.com/book/6899/chapter/ 151126752. doi:
          <volume>10</volume>
          .1093/acprof:oso/9780195327939.003.0008.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikäinen</surname>
          </string-name>
          , Facial Micro-Expressions:
          <article-title>An Overview</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>111</volume>
          (
          <year>2023</year>
          )
          <fpage>1215</fpage>
          -
          <lpage>1235</lpage>
          . URL: https://ieeexplore.ieee.org/document/10144523/. doi:
          <volume>10</volume>
          .1109/ jproc.
          <year>2023</year>
          .
          <volume>3275192</volume>
          , publisher: Institute of Electrical and Electronics
          <string-name>
            <surname>Engineers</surname>
          </string-name>
          (IEEE).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kauttonen</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Deep Learning for Micro-Expression Recognition: A Survey</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>2028</fpage>
          -
          <lpage>2046</lpage>
          . URL: https://ieeexplore. ieee.org/document/9915437/. doi:
          <volume>10</volume>
          .1109/TAFFC.
          <year>2022</year>
          .
          <volume>3205170</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , G. Lu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Zhang, Multi-information hierarchical fusion transformer with local alignment and global correlation for micro-expression recognition</article-title>
          ,
          <source>in: Proceedings of the 33rd ACM International Conference on Multimedia</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>5873</fpage>
          -
          <lpage>5882</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kpalma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          , Y.-J. Liu,
          <article-title>Video-based Facial MicroExpression Analysis: A Survey of Datasets, Features and Algorithms</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . URL: https://ieeexplore.ieee.org/document/9382112/. doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2021</year>
          .
          <volume>3067464</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          , G. Lu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, Geometric Graph Representation With Learnable Graph Structure and Adaptive AU Constraint for Micro-Expression Recognition</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1343</fpage>
          -
          <lpage>1357</lpage>
          . URL: https://ieeexplore.ieee.org/document/10345706/. doi:
          <volume>10</volume>
          .1109/TAFFC.
          <year>2023</year>
          .
          <volume>3340016</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , S. Cheng,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Behzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pantic</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, 4dme: A spontaneous 4d micro-expression dataset with multimodalities</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <fpage>3031</fpage>
          -
          <lpage>3047</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Casme ii: An improved spontaneous micro-expression database and the baseline evaluation</article-title>
          ,
          <source>PloS one 9</source>
          (
          <year>2014</year>
          )
          <article-title>e86041</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>A. K. Davison</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lansley</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Costen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          <string-name>
            <surname>Yap</surname>
          </string-name>
          ,
          <article-title>Samm: A spontaneous micro-facial movement dataset</article-title>
          ,
          <source>IEEE transactions on afective computing 9</source>
          (
          <year>2016</year>
          )
          <fpage>116</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pfister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikäinen</surname>
          </string-name>
          ,
          <article-title>A spontaneous micro-expression database: Inducement, collection and baseline</article-title>
          ,
          <source>in: 2013 10th IEEE International Conference and Workshops on Automatic face and gesture recognition (fg)</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>See</surname>
          </string-name>
          , R. C.-W. Phan,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <article-title>Lbp with six intersection points: Reducing redundant information in lbp-top for micro-expression recognition</article-title>
          , in: D.
          <string-name>
            <surname>Cremers</surname>
            , I. Reid,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Saito</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-H. Yang</surname>
          </string-name>
          (Eds.),
          <source>Computer Vision - ACCV 2014</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          , pp.
          <fpage>525</fpage>
          -
          <lpage>537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          , G. Lu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          , H. Liu,
          <article-title>Micro-expression recognition using local binary pattern from ifve intersecting planes</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>81</volume>
          (
          <year>2022</year>
          )
          <fpage>20643</fpage>
          -
          <lpage>20668</lpage>
          . URL: https: //link.springer.com/10.1007/s11042-022-12360-x. doi:
          <volume>10</volume>
          .1007/s11042-022-12360-x.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          , J. Liu,
          <article-title>CMNet: Contrastive Magnification Network for Micro-Expression Recognition</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>37</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>127</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/25083. doi:
          <volume>10</volume>
          . 1609/aaai.v37i1.
          <fpage>25083</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          , G. Lu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <article-title>Learning two groups of discriminative features for micro-expression recognition</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>479</volume>
          (
          <year>2022</year>
          )
          <fpage>22</fpage>
          -
          <lpage>36</lpage>
          . URL: https://linkinghub.elsevier.com/retrieve/pii/ S0925231221019433. doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2021</year>
          .
          <volume>12</volume>
          .088.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Micro-expression recognition based on contextual transformer networks</article-title>
          ,
          <source>The Visual Computer</source>
          <volume>41</volume>
          (
          <year>2025</year>
          )
          <fpage>1527</fpage>
          -
          <lpage>1541</lpage>
          . URL: https://link.springer.com/10.1007/ s00371-024-03443-x. doi:
          <volume>10</volume>
          .1007/s00371-024-03443-x.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>