<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Foundation Models and 3D Facial Reconstruction for Micro-Expression Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Le Cong Thuong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Chau Nguyen-Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tu Nguyen Luu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thi Duyen Ngo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh Ha Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Engineering and Technology, Vietnam National University</institution>
          ,
          <addr-line>144 Xuan Thuy Street, Cau Giay, Hanoi 11300</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Micro-expressions, characterized by their subtle intensity and brief duration, present significant challenges for automatic recognition systems. Traditional 2D image-based methods often struggle with variations in illumination, pose, and occlusions, limiting their efectiveness. To address these challenges, we propose a hybrid framework that integrates large-scale vision foundation models with advanced 3D facial reconstruction techniques. By combining transferable visual embeddings from models such as RADIOv2.5, and SigLIPv2 with low-dimensional expression coeficients from 3D pipelines like SMIRK, FaceVersev4, and 3DDFAv3, our approach is designed to capture both the rich appearance information from 2D frames and the explicit geometric information from 3D reconstructions crucial for micro-expression analysis. Evaluated on the low-data regime of the 4DME dataset of the public Kaggle Micro-Expression Challenge, the proposed method outperforms every single-modality baseline; one configuration achieves a top-three leaderboard ranking. These findings underscore the synergy between appearance-centric pre-training and geometry-aware modelling, establishing a robust baseline for multimodal micro-expression analysis.1</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;micro-expression classification</kwd>
        <kwd>vision foundation models</kwd>
        <kwd>3d facial reconstruction models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The automatic recognition of micro-expressions—fleeting, involuntary facial movements that betray
genuine human emotion—presents a formidable challenge with profound implications for domains
ranging from clinical psychology to national security [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Their notoriously short duration, subtle
intensity, and the inherent scarcity of high-quality annotated data have historically impeded the
development of robust recognition systems. Traditional approaches, which primarily operate on
2D image sequences, are often brittle, struggling to disentangle meaningful expressive cues from
confounding variations in head pose, illumination, and occlusions, thus limiting their accuracy and
generalizability in real world settings.
      </p>
      <p>
        Two powerful yet largely independent streams of research ofer a promising path forward. First,
advances in 3D facial modeling provide a robust mechanism to overcome the limitations of 2D analysis.
Parametric models like FLAME [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], built from thousands of 3D scans [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], enable the decomposition
of faces into interpretable shape, expression, and pose parameters. State-of-the-art reconstruction
techniques such as SMIRK [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], FaceVerse [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and 3DDFA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] leverage these models to infer detailed
3D geometry and dynamics from standard 2D images, efectively normalizing for pose and lighting
variations. The emergence of specialized datasets like 4DME [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which provides high-fidelity 4D
captures of spontaneous micro-expressions, is crucial for training and validating these geometry-aware
methods.
      </p>
      <p>
        Second, the paradigm of foundation vision models, including powerful architectures like CLIP [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
DINOv2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], RADIOv2.5 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and SigLIPv2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], has revolutionized visual representation learning.
Pre-trained on web-scale datasets, these models learn exceptionally rich and generalizable features
capable of capturing nuanced visual patterns, making them ideal candidates for detecting the subtle
facial movements characteristic of micro-expressions.
      </p>
      <p>While both 3D reconstruction and foundation models ofer compelling advantages, they have largely
been explored in isolation for this task. The central hypothesis of this work is that a symbiotic fusion
of these two paradigms can unlock new levels of performance in micro-expression recognition. We
posit that by conditioning powerful foundation models on explicit 3D facial geometry, we can create a
system that is not only more accurate but also more robust. This leads to our primary research question:
Can the integration of foundation vision models with explicit 3D facial attributes significantly improve
the accuracy and generalizability of micro-expression classification?</p>
      <p>To this end, our primary contributions are as follows:
1. We present a comprehensive benchmark that systematically evaluates the interplay between
leading foundation vision models (RADIOv2.5, SigLIPv2) and state-of-the-art 3D facial reconstruction
techniques (SMIRK, FaceVerse, 3DDFA). Using the F1-score on the 4DME dataset, our analysis
provides critical insights into the most efective combinations for micro-expression classification.
2. We introduce an integrated framework that achieves high performance, validated by a top-three
placement in a recent Kaggle micro-expression classification challenge. This result underscores
competitive edge of our proposed fusion of geometric and visual representation learning.</p>
      <p>This study marks a significant step towards developing more reliable and principled micro-expression
recognition systems, paving the way for more robust systems suitable for applications in
humancomputer interaction and mental health diagnostics.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Recent progress in computer vision, particularly foundation models and 3D facial reconstruction
techniques, significantly advances micro-expression recognition by addressing challenges related to
subtle visual cues and pose variations.</p>
      <sec id="sec-2-1">
        <title>Foundation Models for Image Representation</title>
        <p>
          Modern foundation models, pre-trained on large-scale datasets, provide robust generalizable features
critical for micro-expression analysis. While models such as CLIP [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], DINOv2 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and SigLIP [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
demonstrate strong general performance, recent advances such as RADIOv2.5 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and SigLIPv2 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
stand out as current state-of-the-art (SOTA) approaches. RADIOv2.5 employs a powerful Vision
Transformer (ViT) to extract holistic and detailed dense visual embeddings, whereas SigLIPv2 enhances
multilingual vision-language alignment, excelling in zero-shot classification and feature transfer, thus
may help efectively capturing subtle micro-expression cues.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>3D Facial Reconstruction Models</title>
        <p>
          Advanced 3D reconstruction methods mitigate issues inherent in 2D analyses, such as illumination
and pose variations. Prominent models like SMIRK [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], employing a self-supervised neural synthesis
approach, and FaceVerse [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], with its fine-grained detail-controllable 3D Morphable Model (3DMM), are
particularly efective for capturing subtle facial expressions. Additionally, robust frameworks such as
DDFA [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], using cascaded CNNs, provide consistent and precise 3D face alignment critical for analyzing
subtle facial deformations and micro-expressions.
        </p>
        <p>While these two fields have progressed in parallel, the optimal strategy for fusing state-of-the-art
foundation models with diverse 3D reconstruction pipelines for micro-expression recognition remains
an open question. Our work directly addresses this gap by systematically evaluating and proposing an
efective fusion architecture.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our framework tackles micro-expression recognition in two stages (Fig. 1). First, we distil a hybrid
representation that marries geometry-aware 3D expression coeficients with appearance-rich 2D
embeddings. Second, a lightweight attention-based temporal encoder models the evolution of these features
and yields the final class probabilities.</p>
      <sec id="sec-3-1">
        <title>3.1. Hybrid Feature Extraction</title>
        <p>To faithfully capture the low-intensity, transient muscle activations that characterise micro-expressions,
we fuse (i) 3D expression parameters that preserve subtle geometric displacements and (ii) 2D
appearance cues that encode photometric and texture patterns.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. 3D expression coeficients</title>
          <p>
            Given an input frame, a generic 3D Morphable Model (3DMM) decomposes the face into shape ( ),
expression ( ), and pose ( ). Because micro-expressions are chiefly conveyed through deformations of
the facial musculature, we retain only the expression vector  ∈ R3 . To evaluate the influence of the
underlying reconstruction engine, we extract  with three complementary 3DMM pipelines:
• FaceVersev4 [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] (3 = 171) — a high-capacity PCA model that excels at fine-grained geometry.
• 3DDFAv3 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] (3 = 64) — a cascaded CNN regressor designed for real-time tracking.
• SMIRK [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] (3 = 50) — a self-supervised network explicitly optimised for micro-expression
cues.
          </p>
          <p>While these pipelines produce expression vectors ( ) of varying dimensions, they are all designed
to represent facial muscle activations as low-dimensional blendshape coeficients. Our framework is
designed to be agnostic to the specific 3D basis, learning modality-specific dynamics for each.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. 2D appearance embeddings</title>
          <p>
            We complement geometry with holistic appearance features extracted by Vision Transformers (ViTs).
For each frame, the [CLS] token is harvested from two large-scale, pre-trained models:
• RADIOv2.5 [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] (2 = 1538) — trained on a broad corpus, capturing long-range dependencies.
• SigLIPv2 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] (2 = 1024) — jointly vision-language pre-trained, sensitive to localised
appearance changes.
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Temporal Pooling and Fusion</title>
          <p>
            Let the per-frame appearance feature be f2D ∈ R2D and the geometric feature be f3D ∈ R3D . Each

modality’s sequence of features, F = {f}=1, is processed by a dedicated Multi-Query Attention (MQA)
block [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] to model temporal dynamics.
          </p>
          <p>Multi-Query Attention. We use Multi-Query Attention (MQA) to eficiently summarize the temporal
dynamics within each feature sequence. MQA reduces the computational complexity of standard
attention by utilizing a single key (K) and value (V) projection, which is shared across all query heads.</p>
          <p>Given an input sequence F ∈ R ×  , where  is the sequence length, it is first projected into key and
value matrices, K = FW and V = FW , where W , W ∈ R×  . A separate set of learnable
query vectors, forming a query matrix Q ∈ R×  , then interact with this shared representation. The
number of queries, , is a modality-specific hyperparameter, denoted as 3D for geometry and 2D
for appearance.</p>
          <p>Attention(Q, K, V) = softmax
︂( QK )︂
√</p>
          <p>V
The output is a matrix of  feature vectors, with each vector representing a diferent learned summary
of the sequence. We then apply mean pooling across these vectors to produce a single, fixed-size feature
descriptor (z2D and z3D) for each modality, efectively capturing its temporal characteristics.
Learnable Gating. To fuse the modalities, two trainable gating parameters  = ( 3D,  2D) are used
to compute a dynamic weighting. These are softmax-normalised,  = softmax( ), and the final
descriptor is the weighted sum: z˜ = 3Dz3D + 2Dz2D.</p>
          <p>Prediction head. Finally, a lightweight MLP takes the fused descriptor z˜ and produces class logits:
y^ = Linear︁( Drop(︀ LN(︀ GELU(︀ Linear(z˜))︀ ︁) .</p>
          <p>Here LN and Drop denote Layer Normalisation and dropout, respectively. The trainable components of
this fusion architecture are highlighted in Fig. 1.
(1)
(2)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental setup</title>
        <p>
          Dataset. We follow the oficial protocol of the 4DMR subset of 4DME [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]: 100 micro-expression
sequences (train) and 28 sequences (test) from 24 culturally diverse subjects, each already trimmed to
the active interval. To manage class imbalance within the multi-label setting, we followed a stratified
3-fold cross-validation strategy that preserves the distribution of label combinations in each split [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].1
1Specifically, we used the MultilabelStratifiedKFold implementation from the iterative-stratification Python library, available
at https://github.com/trent-b/iterative-stratification.
        </p>
        <p>
          Pre-processing. Each frame is vertically halved to expose left–right asymmetry, then a MediaPipe
face detector [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] on the first frame fixes a square crop that is reused for the entire clip, guaranteeing
temporal alignment. Crops are resized to 1024× 1024 (Lanczos) before 2D feature extraction.
Metric. Unless stated otherwise we report Macro-F1—the unweighted mean of per-class F1—averaged
over three folds.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation details</title>
        <p>All models share the optimiser and schedule of Table 1. Sequences are uniformly resampled to 18
frames; nearest-frame duplication fills shortages. Feature-level dropout regularises both modalities. All
models were trained on a single NVIDIA RTX 3090 GPU.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>Our primary results on the 4DMR are summarized in Table 2. The findings clearly demonstrate the
superiority of our proposed hybrid fusion model.
Unimodal baselines. Within the 3D family, FaceVersev4 dominates, underscoring the value of a
rich PCA basis for capturing sub-millimetre vertex motion. For the appearance branch, RADIOv2.5
leads. This is further supported by our ablation study (Table 3), where a naïve concatenation of features
processed by a single temporal encoder performs poorly (36.23% F1), likely due to over-parameterisation
and the model’s inability to learn modality-specific temporal dynamics.</p>
        <p>Hybrid fusion. Combining the two best unimodal encoders with our dual-pool and softmax gate
architecture achieves the highest mean F1-score (54.79%) and, more importantly, demonstrates superior
training stability. The variance is reduced to nearly a third of that of the FaceVerse-only model,
confirming that appearance features add complementary information and lead to a more robust system,
even in the low-data regime.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Fusion–strategy ablation</title>
        <p>Take-aways. (i) Processing each modality with its own temporal attention pool before fusion provides a
massive performance gain over a naïve concatenation. (ii) Adding a simple, learnable gating mechanism
provides a further significant bump in accuracy while substantially stabilising training (i.e., reducing
variance), yielding the best accuracy-variance trade-of with minimal computational overhead.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This work has tackled a longstanding challenge in micro-expression recognition by rigorously assessing
the integration of high-fidelity 3D reconstruction with state-of-the-art vision-foundation models. Our
extensive benchmark delivers a critical insight: 3D geometric parameters and 2D appearance
embeddings ofer complementary, rather than overlapping, information, forming a cornerstone for robust
classification. This finding emphasizes the value of a multimodal strategy to fully capture the nuanced
dynamics of micro-expressions.</p>
      <p>
        While our study benchmarks two powerful foundation models, a limitation is that we have not
explored the full breadth of available 2D feature extractors. A valuable direction for future work is a
more systematic investigation across diferent families of models. This could include other
generalpurpose self-supervised models (e.g., SimCLR [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], MAE [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], DINOv2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), supervised models (e.g.,
ViT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], ConvNeXt [19]). Furthermore, incorporating features from models pre-trained specifically on
the face and expression domain, such as FaRL [20], and SVFAP [21], could yield significant performance
gains by leveraging domain-specific knowledge.
      </p>
      <p>Future work will also focus on enhancing model interpretability by visualizing attention maps against
psychological cues like FACS Action Units. We will extend this framework to other subtle behavior
analysis tasks, such as pain or deception detection, and evaluate its robustness in data-scarce, zero-shot
learning contexts.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was supported by the HORIZON-MSCA-SE-2022 PhySU-Net 241 project ACMod (grant
101130271).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT-o3, and Gemini-2.5 to rephrase
sentences and paragraphs in order to improve clarity, conciseness, and style. After using this tool, the
author(s) carefully reviewed and edited the content as needed and take full responsibility for the final
version of the publication.
[19] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in:
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.
11976–11986.
[20] Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, F. Wen, General
facial representation learning in a visual-linguistic manner, in: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2022, pp. 18697–18709.
[21] L. Sun, Z. Lian, K. Wang, Y. He, M. Xu, H. Sun, B. Liu, J. Tao, Svfap: Self-supervised video facial
afect perceiver, IEEE Transactions on Afective Computing (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. J. R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bhanu</surname>
          </string-name>
          ,
          <article-title>Micro-expression classification based on landmark relations with graph attention convolutional network</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1511</fpage>
          -
          <lpage>1520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <article-title>Learning a model of facial shape and expression from 4d scans</article-title>
          .,
          <source>ACM Trans. Graph</source>
          .
          <volume>36</volume>
          (
          <year>2017</year>
          )
          <fpage>194</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tewari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wuhrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zollhoefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Beeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kortylewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Romdhani</surname>
          </string-name>
          , et al.,
          <article-title>3d morphable face models-past, present, and future</article-title>
          ,
          <source>ACM Transactions on Graphics (ToG) 39</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Retsinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Filntisis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Danecek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. F.</given-names>
            <surname>Abrevaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roussos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bolkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maragos</surname>
          </string-name>
          ,
          <article-title>3d facial expressions through analysis-by-neural-synthesis</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2490</fpage>
          -
          <lpage>2501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Faceverse: a fine-grained and detail-controllable 3d face morphable model from a hybrid dataset</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>20333</fpage>
          -
          <lpage>20342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <article-title>3d face reconstruction with the geometric guidance of facial part segmentation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1672</fpage>
          -
          <lpage>1682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , S. Cheng,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Behzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zafeiriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pantic</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhao, 4dme: A spontaneous 4d micro-expression dataset with multimodalities</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <fpage>3031</fpage>
          -
          <lpage>3047</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: International conference on machine learning,
          <source>PmLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darcet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moutakanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szafraniec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khalidov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Haziza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Nouby</surname>
          </string-name>
          , et al.,
          <article-title>Dinov2: Learning robust visual features without supervision</article-title>
          ,
          <source>arXiv preprint arXiv:2304.07193</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hongxu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kautz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          , P. Molchanov,
          <year>Radiov2</year>
          .
          <article-title>5: Improved baselines for agglomerative vision foundation models</article-title>
          ,
          <source>in: Proc. CVPR</source>
          , volume
          <volume>2</volume>
          ,
          <year>2025</year>
          , p.
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschannen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gritsenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Naeem</surname>
          </string-name>
          , I. Alabdulmohsin,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          , et al.,
          <source>Siglip</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Multilingual</surname>
          </string-name>
          vision
          <article-title>-language encoders with improved semantic understanding, localization, and dense features</article-title>
          ,
          <source>arXiv preprint arXiv:2502.14786</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , L. Beyer,
          <article-title>Sigmoid loss for language image pre-training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>11975</fpage>
          -
          <lpage>11986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <article-title>Fast transformer decoding: One write-head is all you need</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02150</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sechidis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <surname>I. Vlahavas</surname>
          </string-name>
          ,
          <article-title>On the stratification of multi-label data</article-title>
          ,
          <source>in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lugaresi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McClanahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Uboweja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Chang</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Yong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Mediapipe: A framework for building perception pipelines</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>08172</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>A simple framework for contrastive learning of visual representations</article-title>
          ,
          <source>in: International conference on machine learning</source>
          ,
          <source>PmLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1597</fpage>
          -
          <lpage>1607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Masked autoencoders are scalable vision learners</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>16000</fpage>
          -
          <lpage>16009</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>