1. Introduction

A Micro-Expression Recognition Method Based on an Uncertainty -Aware Mixing Strategy and Multimodal Fusion

Qian Gao

Weijia Feng

weijiafeng@tjnu.edu.cn 2

Jia Guo

Jiayi An

Xiaofeng Wang

Yuanxu Chen

chenyuanxu641@pingan.com.cn 0 0 Ping An Technology Shenzhen , Rm1201, Bld B, Pingan IFC, Xinyuan South Rd, Chaoyang District, Beijing, 100027 , China 1 Tianjin Key Laboratory for Advanced Mechatronic System Design and Intelligent Control, Tianjin University of Technology , Tianjin , China 2 Tianjin Normal University , No.393, Binshui West Road, Xiqing District, Tianjin , 300387 , China

2025

Micro-expression recognition, a critical research direction in afective computing, holds significant value due to its wide-ranging applications in real-world scenarios such as interrogations, clinical diagnostics, and business negotiations. Currently, micro-expression datasets are limited in scale and challenging to annotate. The significant imbalance in sample sizes across diferent types of micro-expressions causes models to bias toward majority classes during training, while minority class samples receive insuficient attention. This results in overfitting and poor generalization performance in existing recognition methods. Furthermore, most methods rely solely on local information from micro-expression sequences, overlooking certain dynamic features, which adversely impacts recognition performance. To address potential overfitting issues, we propose a micro-expression recognition method based on uncertainty awareness and multimodal fusion. By integrating uncertainty estimation to weight mixed samples, our approach guides the multi-model to focus more on underperforming samples. Additionally, recognition eficiency is further enhanced by incorporating optical flow parameters from micro-expression images. Experimental validation demonstrates that our method achieves significant improvements across multiple key metrics.

eol>Micro-expression uncertainty Multi-model

1. Introduction

Micro-expression recognition (MER) is a key area in afective computing, holding great significance due to its wide applications in fields such as interrogation, clinical diagnosis, and business negotiations. Micro-expressions are brief and subtle facial expressions that typically occur when individuals attempt to conceal their true emotions. They last for a very short duration (usually no more than 0.5 seconds), which makes MER an especially challenging task [ 1 ]. In recent years, with the development of deep learning, this field has made remarkable progress. However, current micro-expression datasets are limited in size and dificult to annotate, making model training prone to overfitting and poor generalization [ 2 ]. For instance, datasets like CASME, CASME II, and SMIC contain a limited number of samples, which constrains the performance of deep learning models [ 3 ]. Additionally, some micro-expression samples are dificult to classify due to indistinct features or high similarity with other classes [ 4 ]. Therefore, developing more efective data augmentation methods and feature learning strategies to improve recognition accuracy and robustness is a key research objective. For example, the MR-UAMF method addresses class imbalance through uncertainty-aware mixing [ 5 ] and significantly improves the recognition accuracy of minority classes. Meanwhile, other studies have introduced attention mechanisms or improved neural network architectures [6] to enhance focus on critical features and thus improve recognition performance.

Existing micro-expression datasets are small and hard to annotate, causing overfitting and limited model generalization [7]. Many current methods rely on handcrafted features like LBP and LBP-TOP. While simple and efective, they struggle with the subtle dynamics of micro-expressions [ 6], limiting their ability to leverage deep learning’s full potential [ 3 ]. Furthermore, existing approaches often fail to adequately extract dynamic features, as they don’t make good use of temporal sequence information, which negatively impacts recognition accuracy and robustness.

Micro-expression recognition (MER) methods can be classified into three categories: The first method is traditional feature-based method. These rely on handcrafted features such as LBP, LBP-TOP, Histogram of Oriented Gradients (HOG), and optical flow. While these approaches are straightforward and eficient, they struggle with capturing subtle and dynamic changes. For example, Li et al. [8] proposed an LBP-TOPbased method that integrates temporal and spatial features but performs poorly with complex dynamic changes. The second method is deep learning-based method. These mainly utilize Convolutional Neural Networks (CNNs) and their variants like 3D CNNs to automatically learn features. These methods perform well in feature learning but require large amounts of annotated data, making them prone to overfitting due to limited dataset sizes. For instance, Zhang et al. [ 3 ] introduced a 3D CNN method using a multi-stream structure to capture spatiotemporal features but encountered overfitting on small datasets. The last method is data augmentation methods. These include Generative Adversarial Networks (GANs) and synthetic data generation techniques. While they expand datasets, issues such as authenticity and potential data bias remain. For example, Wang et al. [9] used GANs to generate synthetic microexpression data, efectively expanding the dataset, though the realism of the generated data still requires improvement. Some studies have explored the use of attention mechanisms in MER. For instance, Hao et al proposed a hierarchical spatiotemporal attention mechanism that can automatically focus on key regions and time segments of micro-expressions, significantly improving accuracy. Despite progress, research gaps remain. Dataset sizes are still small and annotations dificult, leading to overfitting and limited generalization. Additionally, dynamic features are underutilized, and data imbalance remains a challenge. MR-UAMF, for example, addresses imbalance through uncertainty-aware mixing, improving minority class accuracy [ 1, 2, 4 ].

To address the aforementioned challenges, we propose a micro-expression recognition method based on uncertainty awareness and multimodal fusion(MR-UAMF). This approach weights mixed samples based on their uncertainty, encouraging the model to focus more on samples with lower performance, thereby mitigating overfitting. A multimodal model is used to process optical flow and image features separately, fully leveraging the dynamic and spatial information of micro-expressions to enhance accuracy and robustness. By integrating uncertainty quantification, micro-attention mechanisms, and 3D CNN, this approach ofers a novel perspective and method for micro-expression recognition, advancing the field. Extensive experiments across multiple datasets validate the efectiveness and superiority of our approach.

The remainder of this paper is structured as follows: Section 2 describes the proposed framework of MR-UAMF, including the overall structure, core algorithms, and implementation details. Section 3 descriptions, and evaluation metrics. Section 4 interprets experimental results and discusses academic implications and potential applications. Section 5 summarizes our contributions and outlines future research directions.

2. Methodology 2.1. Overall Framework

Our proposed framework integrates optical flow feature extraction, an focused uncertainty-aware mixing strategy(FU-MIX), a micro-attention mechanism, and a shallow triple-stream 3D CNN . The goal is to efectively capture both spatial and temporal features of micro-expressions while addressing data imbalance and overfitting.

2.2. Optical Flow Feature Extraction

Given the brief and subtle nature of micro-expressions, we first extract optical flow features from the video sequence to obtain motion information.

We compute optical flow-guided features using the onset frame and the apex frame. The optical flow ifeld between the two frames is represented as a tuple:

= {((, ), (, )) | = 1, 2, . . . , ; = 1, . . . , } where and denote the width and height of the frame, and (, ) and (, ) are the horizontal and vertical components of , respectively.

We also compute optical strain to approximate the intensity of facial deformation: The magnitude of the optical strain is:

1 = 2 ︀[ ∇ + (∇) ]︀ ,

with = [, ] |,| = ︂( )︂ 2 (︂ )︂ 2 +

+ 1 (︂ 2 + )︂ 2 Appending optical strain to the optical flow field

, we form the triplet: Θ =

{, , } ∈ R3 Each video thus yields three types of optical flow-based representations: horizontal component , vertical component , and optical strain .

2.3. FU-MIX: Foucs Uncertainty-Aware Mixing

Uncertainty Estimation To enhance robustness and generalization, we adopt the MR-UAMF strategy, which estimates sample uncertainty using Bayesian sampling from the model’s posterior ( ; ). For a given sample , its uncertainty is defined as: = ∫︁

(, ˆ ())( ; ) where (, ˆ ()) indicates correct classification. We approximate this using Monte Carlo sampling: ≈ 1 ∑=︁1 (, ˆ ()) where is sampled by minimizing expected risk. In practice, historical training trajectory information is used for approximation.

Weighted Mixed Sample Generation uncertainty:

Each sample is assigned a weight proportional to its = + where is a hyperparameter and is a small constant to ensure > 0. Mixed samples are generated as: , = + (1 − ) , , = + (1 − ) where ∼ Beta(, ). The weighted loss function is:

E(,),(,) [ℓ (, , , ) + (1 − )ℓ(, , , )]

2.4. Micro-Attention Mechanism

We adopt a parameter-eficient residual architecture with self-learned multi-scale features to compute attention maps. Given input ∈ R× × , three convolutional layers (1× 1, 3× 3, and 5× 5) produce feature maps {1, 2, 3}.

These are concatenated to form ′ ∈ R(1+2+3)× × , and the average feature map is generated by: where * is a 1× 1 convolution kernel. The residual output is: The final output with attention is: () = 1 ∑︁′ ′ =1

(′ * * ) () = 1 + 3 () = () · ( ()) where is a normalization function. If () ≈ 0, the attention influence is minimized. 2.5. 3D CNN We design a shallow triple-stream 3D CNN to learn from the optical flow cube Θ . The input is resampled to 28 × 28 × 3. Each stream includes: • One 3D convolutional layer with kernel counts of 3, 5, and 8, respectively; • One max-pooling layer.

Outputs from the three streams are concatenated along the channel axis, followed by a 2 × 2 average pooling layer. A fully connected layer with 400 nodes abstracts the features, and a softmax layer classifies the output into three compound emotion categories.

2.6. Implementation Details

We use the CASME1, CASME2, SAMM, and SMIC datasets for training and evaluation, splitting each dataset into 80% training and 20% testing. MR-UAMF is employed as a data augmentation technique, assigning greater weights to underperforming samples based on uncertainty estimates. This mitigates overfitting and enhances recognition of minority expression categories.

The 3D CNN extracts optical flow features through three parallel convolutional streams, and the microattention mechanism highlights important regions via adaptive residual weighting. These modules work synergistically to improve feature discriminability, thereby enhancing the model’s performance and robustness in micro-expression recognition tasks.

3. Experiments 3.1. Datasets and Evaluation Metrics

The datasets used in this paper are as follows. The first dataset is SMIC. It’s the earliest spontaneous micro-expression dataset, featuring recordings from three camera types: High-Speed (HS) at 100 fps, and Visual (VIS) and Near-Infrared (NIR) at 25 fps. This study uses only HS camera data, with 164 samples from 16 participants across three classes: Negative, Positive, and Surprise.

The second dataset is CASME. It was collected in a controlled lab at 60 fps and has 195 samples from 19 subjects. However, it has class imbalance. We use 154 samples across four classes: Disgust, Repression, Surprise, and Tense.

The third dataset is CASME II. It enhances CASME with higher resolution (200 fps, 280 × 340 pixels), featuring 248 samples from 26 subjects across five classes: Disgust, Happiness, Repression, Surprise, and Others.

The last dataset is SAMM.The SAMM dataset, gathered in a well-lit, stable setting with a grayscale camera at 200 fps and a 2040 × 1088 resolution, comprises 159 samples from 32 diverse participants. After excluding classes with under 10 samples, like Fear and Sadness, we utilize 134 samples spanning ifve classes: Anger, Contempt, Happiness, Surprise, and Others.

To comprehensively evaluate the model’s performance, we adopt common classification metrics including Accuracy, Precision, Recall, and F1-score. Given the class imbalance inherent in microexpression datasets, we also report two additional metrics: Unweighted F1-score (UF1): the average of class-wise F1-scores. Unweighted Average Recall (UAR): the average of class-wise recall rates. These metrics ofer a more balanced view of model performance, especially in imbalanced settings.

3.2. Comparative Experiments

This section compares the proposed method with existing micro-expression recognition (MER) methods across four widely used datasets: SMIC, CASME, CASME II, and SAMM. The compared methods are categorized as follows: LBP-TOP(Local Binary Patterns from Three Orthogonal Planes), 3DHOG (Three-Dimensional Histogram of Oriented Gradients), HOOF (Histogram of Oriented Optical Flow), OFF-ApexNet, STSTNet, Dual-Inception, MACNN, Micro-Attention, Mini-AORCNN.

All methods were evaluated under identical settings to ensure fairness: the same number of samples, class labels, and K-fold cross-validation protocols were used.

We reproduced results for LBP-TOP, 3DHOG, and HOOF under consistent experimental conditions. We also reproduced the performance of recent deep learning-based MER methods. All classifiers were implemented using Support Vector Machines (SVM). Our method outperforms these handcrafted baselines significantly, demonstrating the advantages of deep learning and uncertainty-aware modeling in handling the subtle and dynamic nature of micro-expressions [ 1, 2, 4 ]. These results demonstrate that our focused uncertainty-aware method consistently outperforms or matches the state-of-the-art across all datasets. The simplicity of our design enables robust and discriminative learning even under limited data scenarios—a significant advantage for micro-expression datasets, which are typically small in size. Moreover, uncertainty modeling improves generalization by enhancing the model’s ability to handle non-linearities and minority class variations in real-world applications.

Accuracy 0.4728 0.4789 0.4976

3.3. Ablation Studies

To evaluate the contribution of each component in our proposed micro-expression recognition (MER) framework, we conducted a series of ablation experiments.

The first is baseline model.The base model without any enhancements, captures only fundamental features and struggles with the complex variations inherent in micro-expressions. It achieves an accuracy of only 69%, highlighting its limited capability in handling subtle emotional cues.

The second dataset is with uncertainty-aware mixing MR-UAMF. Integrating the focus uncertaintyaware mixing strategy improves the model’s adaptability to complex expression dynamics. The accuracy increases by 10%, demonstrating the efectiveness of uncertainty-awareness in enhancing model robustness and mitigating overfitting.

The third dataset is with multimodal feature fusion. When the multimodal fusion module is added—combining both optical flow and spatial image features—the accuracy further improves by 5%, achieving the best performance. This indicates that multimodal fusion enriches feature representations by integrating both temporal and spatial cues.

The last dataset is With Additional Modules. Gradually incorporating auxiliary components such as the micro-attention mechanism and the shallow triple-stream 3D CNN architecture leads to incremental gains in accuracy. These modules help the model focus on salient spatiotemporal features and extract multi-scale representations efectively.

The ablation study clearly shows that: Each module contributes positively to the final performance; The MR-UAMF strategy plays a central role in improving both accuracy and generalization; Multimodal fusion and attention mechanisms further boost model performance by enhancing feature expressiveness. These results validate the rationality and synergy of the components in our framework.

4. Discussion

By examining the table data, it’s evident that the proposed method with MR-UAMF excels on the SMIC dataset, achieving significantly higher metrics than other deep learning methods. On the other three datasets, its performance is comparable to those methods. Notably, the optimization algorithm demonstrates exceptional performance overall. Specicfially, the proposed method outperforms other deep learning approaches across all datasets. It achieves an accuracy of 0.8497 on SMIC and the highest accuracy of 0.8388 on SAMM. Table 2 shows it attains the highest accuracy of 0.8117 on CASME. On CASME II, as per Table 3, the accuracy reaches 0.8359. In summary, compared to state-of-the-art methods, the proposed MR-UAMF performs on par with existing algorithms in most cases, validating its efectiveness for micro-expression recognition. Thus, the MR-UAMF is better suited for nonlinear problems like micro-expression image recognition, enhancing the model’s nonlinear fitting ability and ensuring superior real-world performance.

5. Conclusion

This paper proposes a micro-expression recognition method based on uncertainty awareness and multimodal fusion (MR-UAMF), aiming to address key challenges in the field of micro-expression recognition , including limited dataset sizes, class imbalance , overfitting , and insuficient utilization of dynamic features . By introducing a focused uncertainty -aware mixing strategy (FU-MIX), our method weights samples based on their uncertainty , guiding the model to focus more on underperforming samples , thereby efectively mitigating overfitting and enhancing recognition performance for minority classes. Furthermore , by integrating optical flow parameters and spatial features from micro-expression images, our approach fully leverages both dynamic and static information , further improving recognition accuracy and model robustness . The synergistic efect of the micro-attention mechanism and a shallow triple-stream 3D convolutional neural network enables the model to eficiently extract multi-scale spatiotemporal features, achieving superior performance on complex micro-expression data. Experimental results demonstrate that MR-UAMF achieves significant performance improvements across four widely used micro-expression datasets: SMIC, CASME, CASME II, and SAMM. Ablation studies further validate the contributions of each component , with the uncertainty -aware mixing strategy and multimodal feature fusion contributing to accuracy improvements of 10% and 5%, respectively , confirming the positive synergistic impact of these components on overall performance.

Moving forward, we plan to explore micro-expression recognition in video data, investigating the impact of temporal sequence information on recognition performance.

Acknowledgements

This study was funded by the NSFC (Grant No . 61602345 , 62002263 );National Key Research and Development (Grant No. 2019YFB2101900); TianKai Higher Education Innovation ParkEnterprise RD Special Project (Grant No. 23YFZXYC 00046);Tianjin Science and Technology Program Projects (Grant No. 24YDTPJC00630).

Declaration of Generative AI

During the preparation of this work, the author utilized KIMI and Deepseek for grammar and spellchecking , as well as for text translation and paraphrasing (including translation and rewriting ). After using these tools , the author reviewed and edited the content as needed and takes full responsibility for the publication's content. [6] B. Song, K. Li, Y. Zong, J. Zhu, W. Zheng, Recognizing spontaneous micro-expression using a three -stream convolutional neural network, IEEE Access 7 (2019) 184537–184551. [7] B. Xia , W . Wang , S. Wang , E. Chen , Learning from macro -expression : A micro -expression recognition framework , Proceedings of the 28th ACM International Conference on Multimedia ( 2020) 2936–2944. [8] Y. Liu , et al., A main directional mean optical flow feature for spontaneous micro -expression recognition, IEEE Transactions on Afective Computing 7 (2015) 299–310. [9] M. Peng, Z. Wu, Z. Zhang, T. Chen, From macro to micro expression recognition : Deep learning on small datasets using transfer learning, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (2018) 657–661. [10] G. Zhao , M. Pietikainen , Dynamic texture recognition using local binary patterns with an applica tion to facial expressions , IEEE Transactions on Pattern Analysis and Machine Intelligence 29 ( 2007) 915–928. [11] S. Polikovsky, Y. Kameda, Y. Ohta, Facial micro expressions recognition using high speed camera and 3d-gradient descriptor, in: IET Conference, 2009, p. 5. [12] Y.-J. Liu, J.-K. Zhang , W.-J. Yan, S.-J. Wang , G. Zhao , X. Fu, A main directional mean optical flow feature for spontaneous micro-expression recognition, IEEE Transactions on Afective Computing 7 ( 2016) 299–310. doi:10.1109/TAFFC.2015.2485205. [13] Y. S. Gan , S.-T. Liong , W .-C. Yau , Y.-C. Huang , L.-K. Tan , Of -apexnet on micro -expression recognition system, Signal Processing : Image Communication 74 (2019) 129–139. doi:10.1016/j. image.2019.02.005. [14] S.-T. Liong, Y. S. Gan, J. See, H.-Q. Khor, Y.-C. Huang, Shallow triple stream three-dimensional cnn (ststnet ) for micro -expression recognition , in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–5. [15] L. Zhou, Q. Mao, L. Xue, Dual-inception network for cross-database micro-expression recognition, in: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), IEEE, 2019, pp. 1–5. [16] Z. Lai, R. Chen , J. Jia, Y. Qian , Real -time micro expression recognition based on resnet and atrous convolutions, Journal of Ambient Intelligence and Humanized Computing (2020). doi:10.1007/ s 12652-020-01779-5. [17] C. Wang, M. Peng, T. Bi, T. Chen, Micro-attention for micro-expression recognition, Neurocomputing 410 (2020) 354–362. doi:10.1016/j.neucom.2020.06.005. [18] L. Feng, Z. Jiahao, Q. Jiayin, Lightweight micro-expression recognition architecture based on bottleneck transformer, Computer Science 49 (2022) 370–377. doi:10.11896/jsjkx.210500023.

[1]

Li ,

Wei ,

Liu ,

Kauttonen , G . Zhao, Deep learning for micro-expression recognition: A survey , IEEE Transactions on Afective Computing 13 ( 2022 ) 2028 - 2046 .

[2]

Zhang ,

Chai , A review of research on micro-expression recognition algorithms based on deep learning , Neural Computing and Applications 36 ( 2024 ) 17787 - 17828 .

[3]

Wang ,

Zhang , W. Luo,

Sankaranarayana , Htnet for micro-expression recognition , Neuro - computing 602 ( 2024 ) 128196 .

[4]

Zhang ,

Hong ,

Arandjelović , G. Zhao , Short and long range relation based spatio -temporal transformer for micro-expression recognition , IEEE Transactions on Afective Computing 13 ( 2022 ) 1973 - 1985 .

[5]

Han ,

Liang ,

Yang , Umix : Improving importance weighting for subpopulation shift via uncertainty-aware mixup , arXiv preprint arXiv:2209.08928 ( 2022 ). URL: https://arxiv.org/abs/2209. 08928.