<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DIANet: A Phase-Aware Dual-Stream Network for Micro-Expression Recognition via Dynamic Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vu Tram Anh Khuong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luu Tu Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thi Bich Phuong Man</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh Ha Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thi Duyen Ngo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, VNU University of Engineering and Technology</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Micro-expressions are brief, involuntary facial movements that typically last less than half a second and often reveal genuine emotions. Accurately recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. However, micro-expression recognition (MER) remains a challenging task due to the subtle and transient nature of facial cues and the limited availability of annotated data. While dynamic image (DI) representations have been introduced to summarize temporal motion into a single frame, conventional DIbased methods often overlook the distinct characteristics of diferent temporal phases within a micro-expression. To address this issue, this paper proposes a novel dual-stream framework, DIANet, which leverages phase-aware dynamic images - one encoding the onset-to-apex phase and the other capturing the apex-to-ofset phase. Each stream is processed by a dedicated convolutional neural network, and a cross-attention fusion module is employed to adaptively integrate features from both streams based on their contextual relevance. Extensive experiments conducted on three benchmark MER datasets (CASME-II, SAMM, and MMEW) demonstrate that the proposed method consistently outperforms conventional single-phase DI-based approaches. The results highlight the importance of modeling temporal phase information explicitly and suggest a promising direction for advancing MER.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Micro-expression</kwd>
        <kwd>micro-expression recognition</kwd>
        <kwd>dynamic image</kwd>
        <kwd>dual-stream network</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Micro-expressions (MEs) are brief, involuntary facial movements that typically last less than 0.5 seconds
and reveal genuine emotions that an individual may attempt to conceal. Unlike macro-expressions,
MEs are subtle and transient, often involving only localized muscle activations. These characteristics
make MEs dificult to detect by human observers and pose significant challenges for automatic
microexpression recognition (MER) systems. Nevertheless, accurate MER has substantial value in various
real-world applications such as lie detection, security screening, psychotherapy, and human–computer
interaction [1, 2, 3].</p>
      <p>Early research in MER primarily relied on handcrafted features designed to capture spatial texture
(e.g., Local Binary Patterns from Three Orthogonal Planes, LBP-TOP [4]) or optical strain [5], combined
with temporal analysis through optical flow [ 6]. While efective in controlled settings, these methods
often struggle with robustness under cross-subject variation or real-world noise and require extensive
feature engineering. In recent years, the emergence of deep learning and motion-based representations
has led to more compact and learnable frameworks for MER. Among them, the dynamic image (DI)
representation has gained popularity. Originally proposed in action recognition, DI summarizes the
temporal evolution of motion into a single image using rank pooling [7]. In the context of MER, DI-based
models such as LEARNet [8] demonstrate that DI can efectively encode subtle motion signals in a
suitable form for convolutional neural networks. Extensions of this idea include Active Image [9],
which highlights salient motion regions, and Afective Motion Imaging [ 10], which emphasizes
emotionrelevant motion features. These methods show that DI ofers an eficient way to encode facial motion
into a single frame.</p>
      <p>However, existing DI-based approaches typically treat the entire sequence as a whole, ignoring the
distinct temporal phases of MEs (i.e., onset-apex and apex-ofset). This holistic modeling overlooks the
asymmetric rise-and-fall dynamics of facial motion. Incorporating phase-specific representations could
better capture these patterns, yet remains underexplored in the literature. Phase-aware modeling has
been explored in a few recent works. For example, Liong et al. [11] emphasized the onset–apex segment
by computing optical flow between two key frames, while Zhang et al. [ 12] proposed apex-guided
representation learning. However, these approaches typically rely on hand-crafted segmentation or
optical flow and do not fully integrate phase modeling into a learnable deep framework. Moreover,
none of them explicitly leverage DI to encode motion separately within each phase.</p>
      <p>To address this gap, this paper proposes DIANet, the first dual-stream MER framework that uses
phase-specific dynamic images to separately model the onset–apex and apex–ofset intervals. By
processing these phases in parallel, the model captures both the rising and falling motion patterns of
MEs - subtle dynamics often overlooked in single-stream approaches. This phase-aware design enhances
representation learning and leads to more accurate recognition of subtle and transient expressions. To
the best of our knowledge, this is the first work to explicitly integrate phase-specific DI representations
into an end-to-end framework for micro-expression recognition.</p>
      <p>The remainder of the paper is organized as follows. Section 2 describes the proposed DIANet
architecture in detail. Next, section 3 describes the experimental setup and reports results on CASME-II,
SAMM, and MMEW. Finally, section 4 concludes the paper and discusses future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed method</title>
      <p>This paper introduces a dual-stream dynamic imaging (DI) framework to address the limitations of
conventional DI-based MER. By decomposing facial motion into two distinct temporal phases (i.e.,
onset-apex or apex-ofset), this paper enhances motion representation while suppressing noise. The
overall workflow is illustrated in Fig. 1.</p>
      <sec id="sec-2-1">
        <title>2.1. Phase-aware dynamic imaging</title>
        <p>Dynamic image is a technique that summarizes a video sequence into a single image, encoding both
spatial and temporal information into a static representation. It is commonly constructed using
Approximate Rank Pooling (ARP) [7], where each frame in a sequence is projected into a feature space and
combined with a weight proportional to its temporal position. Formally, given a sequence of 
consecutive frames 1, 2, . . . ,  and corresponding features ( ), the dynamic image * is computed as
follows:</p>
        <p>* = ∑︁  ( ),
=1
with   = 2 −  − 1.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where ( ) = , since raw pixel values are directly employed as frame-level features in each color
channel.
        </p>
        <p>The standard ARP assigns higher weights to later frames, assuming motion accumulates over time
- a valid assumption for full-length actions, but not always in the case of micro-expressions.
Microexpressions typically follow a bell-shaped intensity curve, with motion rising to a peak (apex) and then
declining. Applying ARP over the full sequence skews the resulting dynamic image toward the ofset
phase, underrepresenting early-phase motion where key emotional cues often occur.</p>
        <p>To overcome this limitation, this paper proposes a dual-phase dynamic imaging strategy that
separately models the onset–apex and apex–ofset phases using a tailored ARP formulation. Each segment
is used to construct a dedicated phase-specific DI.</p>
        <p>• DI-Onset: Captures the rising motion from onset to apex using the standard ARP with increasing
weights, as shown in Eq.2. This emphasizes frames closer to the apex, where expressive intensity
typically peaks.</p>
        <p>
          = ∑︁  ( ), with   = 2 −  − 1. (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
        <p>=1
where  indexes the frames in the onset-apex segment of length  .
• DI-Ofset: Encodes the declining motion from apex to ofset using reversed ARP with decreasing
weights defined as follows:</p>
        <p>= ∑︁  ′( ),
=1
with  ′ =  + 1 − 2.</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
Here,  indexes the frames in the apex–ofset segment of length  , and higher weights are assigned
to frames near the apex to highlight the transition back to a neutral state.
        </p>
        <p>This dual-phase formulation has two key advantages. First, it preserves the distinct motion
characteristics of each phase, allowing the model to learn more fine-grained, phase-specific features. Second,
it mitigates the temporal bias inherent in single-DI representations by balancing the representation
across both phases. In summary, by applying ARP in a phase-aware manner, our dual-phase dynamic
imaging method captures both the escalation and de-escalation of facial expressions, providing a more
comprehensive and balanced motion representation for MER.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dynamic Images Attention Network (DIANet)</title>
        <p>To fully exploit the asymmetric and complementary motion patterns present in micro-expression
sequences, this paper proposes a novel dual-stream classification architecture called Dynamic Images
Attention Network (DIANet) (as illustrated in Fig 1). Unlike conventional approaches that rely on
a single dynamic image over the entire sequence, DIANet introduces a phase-aware formulation by
explicitly modeling the two key motion intervals: onset–apex (rising phase) and apex–ofset (fading
phase). Each phase is encoded into a dynamic image, referred to as DI-Onset and DI-Ofset, which are
processed independently in parallel streams.</p>
        <p>Each stream in DIANet employs a shared backbone based on a modified EficientNetV2[ 13]
architecture. The classification head is removed to retain only high-level feature extraction, ensuring consistency
and parameter eficiency across both branches. This backbone was selected based on its superior
performance in the ablation study (see section 3.6), ofering a strong balance between compactness and
representational power - crucial for handling subtle motion in limited micro-expression data.</p>
        <p>For each input stream, the backbone outputs a 512-dimensional feature vector:  1 for the onset–apex
phase and  2 for the apex–ofset phase. To bridge the semantic gap and promote information exchange
between the two phases, DIANet introduces a novel cross-attention fusion module, as illustrated in
Fig. 2. This module enables bidirectional interaction between DI-Onset and DI-Ofset features through
learnable query-key-value projections. Each feature vector attends to the other, allowing the model
to focus on salient motion cues across both phases. The resulting attention-enhanced features are
concatenated and passed through a projection layer to form a unified representation.</p>
        <p>The fused feature is then passed through a lightweight multilayer perceptron (MLP) classifier,
consisting of two fully connected layers with ReLU activation and dropout regularization, to predict the
ifnal emotion class. This architectural design balances expressiveness and eficiency, making it suitable
for micro-expression recognition under limited data conditions.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Loss function</title>
        <p>To enhance representational coherence between the two temporal phases, this paper introduces a
consistency regularization term that encourages alignment between the features learned from the
onset–apex and apex–ofset streams. Specifically, a cosine-based consistency loss is applied to penalize
discrepancies between the corresponding feature vectors. To the best of our knowledge, this is the first
application of cross-phase regularization in the context of micro-expression recognition using dynamic
image representations.</p>
        <p>
          The consistency loss is defined as the average cosine dissimilarity across a mini-batch:
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
=1

ℒcons = 1 ∑︁ (︁ 1 − cos ︁( 1(), 2())︁
where  () and 2() denote the feature vectors from the DI-Onset and DI-Ofset streams for the -th
1
sample, and  is the batch size.
        </p>
        <p>The total training objective combines the standard cross-entropy classification loss ℒCE with the
proposed consistency loss:</p>
        <p>ℒtotal = ℒCE +  · ℒ cons,
where  is a regularization coeficient that balances classification accuracy and cross-phase feature
alignment. This joint objective promotes both discriminative learning and temporal consistency, leading
to more robust and generalizable micro-expression recognition.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>To evaluate the efectiveness of the proposed method, experiments are conducted on three publicly
available benchmark datasets commonly used in micro-expression recognition:
• CASME-II[14]: This dataset consists of 255 spontaneous micro-expression samples from 26
subjects. Recordings were captured under laboratory conditions at a high frame rate of 200 fps,
allowing precise temporal localization of facial movements. All samples were manually annotated
with emotion labels and apex frames.
• SAMM[15]: The SAMM dataset contains 159 high-resolution micro-expression samples from
32 participants. Similar to CASME-II, the recordings were captured at 200 fps in a controlled
environment. Emotion annotations were provided by expert coders based on FACS (Facial Action
Coding System) criteria.
• MMEW [16]: The MMEW dataset includes 300 spontaneous micro-expression samples collected
in more naturalistic settings. Videos were recorded at 90 fps and annotated with six emotion
categories. MMEW introduces more variation in head pose and lighting, making it a challenging
benchmark for generalization.</p>
        <p>To ensure balanced evaluation, emotion classes with fewer than 10 samples are excluded. After
ifltering, CASME-II is reduced to five categories (i.e., disgust, happiness, surprise, repression, others),
SAMM to five categories (i.e., happiness, surprise, anger, contempt, others) and MMEW to six categories
(i.e., happiness, sadness, fear, others, surprise, disgust). To improve generalization and address data
scarcity, standard data augmentation techniques are employed during training. These include horizontal
lfipping and random in-plane rotations within the range of ±10 degrees. All video sequences are
temporally normalized and resized to a fixed spatial resolution before being processed into dynamic
image representations.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Protocol</title>
        <p>The evaluation follows the Leave-One-Subject-Out (LOSO) cross-validation protocol, which is commonly
used in the MER field to assess subject-independent generalization. In each iteration, the samples from
one subject are reserved for testing, while the remaining data are used for training.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Metric</title>
        <p>Model performance is evaluated using overall Accuracy (Acc), which reflects the proportion of correctly
classified samples over the total number of samples, as shown in Eq 6:</p>
        <p>Acc =</p>
        <p>Number of Correct Predictions</p>
        <p>Total Number of Samples
(6)</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Implementation Details</title>
        <p>All experiments are conducted using PyTorch. Input dynamic images are resized to 224×224 and
normalized. Models are trained for 50 epochs using the Adam optimizer with a learning rate of 10−4
and a batch size of 32. Early stopping is applied based on validation loss. The training objective includes
the consistency loss described in the section 2.3 to encourage alignment between phase-specific feature
representations.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Results</title>
        <p>Table 1 presents a comparative analysis of the proposed method (DIANet) against recent state-of-the-art
approaches that utilize dynamic images or DI-inspired representations. The evaluation is conducted
across three widely used micro-expression datasets: CASME-II, SAMM, and MMEW.</p>
        <p>Our proposed method DIANet achieves the highest accuracy on two of the three datasets: 68.89% on
SAMM and 64.24% on MMEW, demonstrating its robustness across both high-resolution and naturalistic
micro-expression scenarios. These results highlight the advantage of phase-aware modeling in capturing
localized motion patterns that are often overlooked in single-phase or holistic DI-based approaches. On
CASME-II dataset, DIANet achieves a competitive accuracy of 70.00%, demonstrating a performance
comparable to recent advanced methods such as GEME [17] (75.20%) and FDCN [18] (73.09%). GEME
leverages gender information as an auxiliary input to capture identity-dependent dynamics, but at the
cost of increased model complexity and reliance on demographic metadata, which may raise practical
and ethical concerns. FDCN combines dynamic images with optical flow to enrich motion features, yet
its multi-stream architecture introduces additional computational overhead and modality alignment
challenges.implementation overhead. In contrast, DIANet relies solely on phase-specific dynamic
images derived from the original video sequence, requiring no auxiliary labels or multi-modal fusion.
By explicitly separating motion into the onset-apex and apex-ofset phases, DIANet learns more
finegrained, phase-sensitive features that are often lost in conventional DI-based approaches. This design
not only simplifies the model architecture but also improves interpretability and portability across
datasets.
* This results are from our re-implementation with the model provided by the author</p>
        <p>Compared to LEARNet[19], a widely used DI-based baseline, DIANet yields substantial improvements:
+11.22% on CASME-II, +9.63% on SAMM, and +15.28% on MMEW. These gains confirm that explicitly
modeling phase transitions enhances the model’s ability to capture expressive facial motion. Other
DI-motivated methods, such as OrigiNet[9] and AfectiveNet [ 10], also fall behind in generalization,
with particularly low performance on SAMM (e.g., 34.89%), suggesting that simple enhancements over
DI are insuficient for cross-dataset robustness.</p>
        <p>In summary, the consistent performance of DIANet across diverse datasets without reliance on
handcrafted rules, demographic information, or multi-modal fusion, demonstrates its efectiveness
and generalization. The results underscore the importance of phase-aware modeling as a lightweight
yet powerful strategy for advancing micro-expression recognition in both constrained and real-world
environments.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Ablation study</title>
        <sec id="sec-3-6-1">
          <title>3.6.1. Backbone selection for DIANet</title>
          <p>The performance of DIANet is influenced not only by its dual-stream, phase-aware design but also by
the feature extraction backbone used in each stream. To investigate the impact of backbone architecture,
we evaluate four representative convolutional backbones: ResNet18 [20], EficientNetV2 [ 13],
ConvNeXt [21], and MobileViT [22]. These models are selected based on their widespread use, architectural
diversity, and relevance to tasks involving subtle motion recognition under data constraints. Table 2
illustrates the performance of DIANet using diferent backbone architectures across three benchmark
datasets.</p>
          <p>All backbones are integrated into the same DIANet framework and trained under identical
experimental settings. Each model processes a pair of dynamic images corresponding to the onset–apex
and apex–ofset phases, and outputs are fused via the same cross-attention mechanism. The goal is to
isolate the efect of the backbone itself on performance across three micro-expression datasets.
• ResNet18 serves as a strong and widely adopted baseline. It provides a good trade-of between
depth and computational cost, making it suitable for MER tasks with limited training data. On
MMEW, it surprisingly achieves the highest accuracy (64.94%), possibly due to its robustness to
noisy and unconstrained environments.
• EficientNetV2 is a recent lightweight architecture that incorporates compound scaling and
advanced training optimizations. It achieves the best performance on both CASME-II (70.00%)
and SAMM (68.89%), highlighting its superior capacity to extract discriminative features from
phase-specific DIs. Its consistent results across datasets suggest that it ofers a strong balance
between eficiency and representational power.
• ConvNeXt adapts design elements from vision transformers into a CNN framework. Although it
has shown promising results in large-scale classification tasks, its relatively lower performance
here (e.g., 59.17% on CASME-II) indicates that deeper and heavier models may not always
generalize well in low-data, fine-grained settings such as MER.
• MobileViT combines the strengths of CNNs and transformers in a compact form, making it
attractive for lightweight applications. However, its performance lags behind other models across
all datasets, possibly due to underfitting or dificulties in learning temporal-localized features
from DIs.</p>
          <p>Overall, EficientNetV2 provides the best trade-of between accuracy and eficiency. Its strong
performance on both controlled (CASME-II, SAMM) and in-the-wild (MMEW) datasets suggests it is
better suited for capturing the nuanced spatiotemporal features present in dynamic images of
microexpressions. In contrast, larger or transformer-based backbones like ConvNeXt and MobileViT may
require more data or diferent training strategies to be efective in this domain. These findings support
the use of modern, lightweight CNNs as backbones in MER, especially when paired with task-specific
representations like phase-separated DIs.</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>3.7. Attention block selection for DIANet</title>
        <p>To assess the efectiveness of our proposed Cross Attention Fusion Block, we compare it with a simplified
variant named Simple Attention Block, which uses a basic dot-product attention and residual fusion.</p>
        <p>As shown in Table 3, replacing the Cross Attention Fusion Block with Simple Attention Block
results in a notable performance drop across all datasets: from 70.00% to 66.42% on CASME-II, from
68.89% to 63.16% on SAMM, and from 64.24% to 50.08% on MMEW. The performance degradation is
especially pronounced on MMEW, which contains greater variability in pose and lighting. This suggests
that simple attention fails to capture the nuanced interactions between onset–apex and apex–ofset
streams under challenging conditions. In contrast, the Cross Attention Fusion Block allows each
stream to dynamically attend to salient features in the other via learnable query-key-value projections,
promoting rich bidirectional interaction and context-aware fusion. This leads to more discriminative
and phase-sensitive feature representations, which are critical in recognizing subtle micro-expressions.</p>
        <p>These findings demonstrate that while simple attention mechanisms ofer computational simplicity,
they are insuficient for modeling the asymmetric and complementary dynamics of micro-expression
phases. The proposed cross-attention strategy, albeit slightly more complex, substantially improves
recognition performance and justifies its integration into DIANet.</p>
        <sec id="sec-3-7-1">
          <title>3.7.1. Efect of phase-wise dynamic images</title>
          <p>
            To assess the contribution of phase-specific motion modeling, we conduct an ablation study comparing
diferent input configurations within the DIANet framework, as summarized in Table 4. Specifically, we
evaluate three types of inputs: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) standard dynamic images generated from the entire sequence (as
used in LEARNet [19]), (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) dynamic images computed separately for each temporal phase (DI-Onset and
DI-Ofset), and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) a dual-stream setting where both DI-Onset and DI-Ofset are processed in parallel.
* This results are from our re-implementation with the model provided by the author
          </p>
          <p>In the single-stream setting, using phase-specific DIs (either DI-Onset or DI-Ofset) yields better
performance than using a holistic dynamic image. For instance, on CASME-II, DI-Onset improves
accuracy from 56.67% (standard DI) to 60.00%, while on MMEW, it raises accuracy from 48.96% to 52.78%.
Similarly, DI-Ofset provides the best single-phase result on SAMM (61.48%), outperforming both the
standard DI and DI-Onset. These improvements indicate that motion restricted to a specific phase,
either the rising (onset–apex) or falling (apex–ofset) interval, contains more discriminative features for
recognizing micro-expressions than aggregating motion over the entire sequence. This supports our
initial hypothesis that the dynamics of MEs follow an asymmetric temporal pattern, and treating the
sequence holistically may dilute critical motion cues.</p>
          <p>In the dual-stream configuration, DIANet processes both DI-Onset and DI-Ofset simultaneously,
with a cross-attention fusion module learning to integrate complementary information from the two
phases. This setup achieves the best performance across all datasets: 70.00% on CASME-II, 68.89% on
SAMM, and 64.24% on MMEW. Notably, the dual-stream DIANet outperforms all single-stream variants,
including both phase-specific and standard DI inputs.</p>
          <p>These results provide strong empirical evidence for the efectiveness of phase-aware modeling in
MER. The dual-stream input not only leverages the unique temporal properties of each phase but also
enables the model to learn richer and more balanced representations of facial motion. The significant
performance gap between the single-phase and dual-phase settings further validates the design choice
of using DI-Onset and DI-Ofset as complementary inputs within the DIANet architecture.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This paper has proposed DIANet, the first dual-stream micro-expression recognition framework that
leverages phase-aware dynamic images to model facial motion with greater temporal precision. By
explicitly separating the onset–apex and apex–ofset phases, the proposed approach captures
complementary motion dynamics that are often overlooked in holistic or single-phase representations.
The two streams are integrated through a cross-attention mechanism and guided by a consistency
objective, enabling the network to learn more discriminative and temporally aligned features. Extensive
experiments conducted on three benchmark datasets (i.e., CASME-II, SAMM, and MMEW) demonstrate
the efectiveness of the proposed method. DIANet achieves 70.00%, 68.89%, and 64.24% accuracy on these
datasets, respectively, outperforming or matching state-of-the-art approaches without requiring
auxiliary modalities or complex fusion schemes. These results underscore the importance of phase-aware
modeling in micro-expression recognition. The proposed framework ofers a lightweight yet powerful
solution that generalizes well across both controlled and unconstrained settings. Future work may
explore extending phase-aware modeling to other motion representations and incorporating temporal
uncertainty in apex estimation.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work has been supported by HORIZON-MSCA-SE-2022 PhySU-Net 241 project ACMod (grant
101130271).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword, Improve writing style. After using this tool/service, the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.
[6] T. Pfister, X. Li, G. Zhao, M. Pietikäinen, Recognising spontaneous facial micro-expressions, in:</p>
      <p>International Conference on Computer Vision (ICCV), 2011, pp. 1449–1456.
[7] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, Dynamic image networks for action recognition, in:
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp.
3034–3042.
[8] X. Li, Q. Wu, G. Zhao, X. Hong, Deep learning for micro-expression recognition: A survey, IEEE</p>
      <p>Transactions on Afective Computing (2022).
[9] G. Verma, A. Dhall, A non-ordinal approach to micro-expression recognition, in: Proceedings
of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp.
2436–2445.
[10] G. Verma, A. Dhall, Afectivenet: Afective-motion feature learning for micro-expression
recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, 2021, pp. 2644–2651.
[11] S. Liong, J. See, K. Wong, R. Phan, N. Cheung, Less is more: Micro-expression recognition with
apex frame, Image and Vision Computing 59 (2017) 29–39.
[12] K. Zhang, F. Wang, G. Liu, H. Li, Y. Fu, Facial micro-expression recognition based on apex
frameguided cross-attention network, in: Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV), 2021, pp. 14361–14370.
[13] M. Tan, Q. Le, Eficientnetv2: Smaller models and faster training, in: Proceedings of the 38th</p>
      <p>International Conference on Machine Learning (ICML), PMLR, 2021.
[14] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, X. Fu, Casme ii: An improved spontaneous
micro-expression database and the baseline evaluation, PloS one 9 (2014) e86041.
[15] A. K. Davison, C. Lansley, N. Costen, K. Tan, M. H. Yap, Samm: A spontaneous micro-facial
movement dataset, IEEE Transactions on Afective Computing 9 (2018) 116–129. doi: 10.1109/
TAFFC.2016.2573832.
[16] X. Ben, Y. Ren, J. Zhang, S.-J. Wang, K. Kpalma, W. Meng, Y.-J. Liu, Video-based facial
microexpression analysis: A survey of datasets, features and algorithms, IEEE transactions on pattern
analysis and machine intelligence 44 (2021) 5826–5846.
[17] X. Nie, M. A. Takalkar, M. Duan, H. Zhang, M. Xu, Geme: Dual-stream multi-task gender-based
micro-expression recognition, Neurocomputing 427 (2021) 13–28.
[18] J. Tang, L. Li, M. Tang, J. Xie, A novel micro-expression recognition algorithm using dual-stream
combining optical flow and dynamic image convolutional neural networks, Signal, Image and
Video Processing 17 (2023) 769–776.
[19] M. Verma, S. K. Vipparthi, G. Singh, S. Murala, Learnet: Dynamic imaging network for micro
expression recognition, IEEE Transactions on Image Processing 29 (2019) 1618–1627.
[20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2022, pp. 11976–11986.
[22] S. Mehta, M. Rastegari, Mobilevit: Light-weight, general-purpose, and mobile-friendly vision
transformer, in: International Conference on Learning Representations (ICLR), 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhushan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Goecke</surname>
          </string-name>
          ,
          <article-title>Spotting facial micro-expressions in long videos using spatiotemporal recurrent convolutional networks</article-title>
          ,
          <source>in: International Conference on Automatic Face and Gesture Recognition (FG)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Fast facial expression recognition using local binary patterns, directional map features and a boosted classifier</article-title>
          ,
          <source>in: IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Polikovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kameda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <article-title>Facial micro-expressions recognition using high speed camera and 3d-gradient descriptor</article-title>
          ,
          <source>in: 3rd International Conference on Crime Detection and Prevention (ICDP)</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pfister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pietikainen</surname>
          </string-name>
          ,
          <article-title>Spontaneous micro-expression recognition using lbp-top</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>4</volume>
          (
          <year>2013</year>
          )
          <fpage>146</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>See</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Phan</surname>
          </string-name>
          ,
          <article-title>Optical strain based recognition of subtle emotions</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>175</volume>
          (
          <year>2016</year>
          )
          <fpage>830</fpage>
          -
          <lpage>838</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>