<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hierarchical transformer based learning for robust face anti-spoofing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhanseri Ikram</string-name>
          <email>ikram.zhanseri@outlook.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bauyrzhan Omarov</string-name>
          <email>bauyrzhan313@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Batyrkhan Omarov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Al-Farabi Kazakh National University</institution>
          ,
          <addr-line>Almaty</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>34/1 Manas St., Almaty, 050000</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Face anti-spoofing is a critical challenge in biometric authentication systems, requiring robust methods to effectively distinguish between genuine and fraudulent attempts. The current study presents a Hierarchical Transformer-Based Learning (HTBL) framework designed to tackle challenges across diverse environmental conditions and attack modalities. The architecture combines a Vision Transformer encoder for global context capture with a Swin Transformer for local feature refinement, supported by intermediate convolutional layers. The evaluations on the OULU-NPU dataset validate the HTBL framework across standardized protocols assessing generalization to new environments, attack instruments, and sensor inputs. The method achieves state-of-the-art performance, particularly in complex generalization scenarios. Feature visualization using Principal Component Analysis supports the quantitative results, illustrating the refinement of discriminative capabilities throughout the network stages. The HTBL framework demonstrates strong generalization across varied conditions, addressing a significant limitation in current face anti-spoofing systems. Additionally, its balanced performance across error rate metrics indicates practical applicability, positioning the HTBL framework as a promising advancement in face anti-spoofing technology with important implications for biometric authentication security in real-world scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Face Anti Spoofing</kwd>
        <kwd>face liveness</kwd>
        <kwd>machine learning</kwd>
        <kwd>computer vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Face anti-spoofing systems have emerged as a critical component in biometric authentication,
addressing the escalating threat of presentation attacks in facial recognition technologies. The
proliferation of sophisticated spoofing techniques, including printed photographs, digital displays,
and 3D masks, has necessitated the development of robust countermeasures to safeguard the
integrity of facial authentication systems. In recent years, deep learning approaches have
demonstrated remarkable efficacy in discerning genuine faces from fraudulent presentations,
surpassing traditional handcrafted feature-based methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Among the many of deep learning architectures, Convolutional Neural Networks (CNNs) have
been predominantly employed for face anti-spoofing tasks, applying their capacity to extract spatial
features from facial images [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, CNNs exhibit limitations in capturing long-range
dependencies and hierarchical relationships within facial structures, which are crucial for
distinguishing subtle spoofing artifacts. To address these shortcomings, attention mechanisms and
transformer architectures have been introduced, revolutionizing various computer vision tasks,
including face anti-spoofing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The transformer architecture, initially proposed for natural
language processing tasks, has demonstrated exceptional performance in modeling sequential data
and capturing global contextual information [4]. The self-attention mechanism inherent in
transformers enables the model to weigh the importance of different facial regions dynamically,
potentially enhancing the detection of spoofing cues across diverse attack types. Nevertheless, the
application of transformers to face anti-spoofing presents unique challenges, particularly in terms of
computational efficiency and the need for hierarchical feature representation to capture both
finegrained textures and global facial structures.
      </p>
      <p>The present study introduces a novel HTBL framework for robust face anti-spoofing. The
proposed approach uses a hierarchical transformer architecture to capture multi-scale facial features
and long-range dependencies, facilitating the detection of sophisticated presentation attacks. By
integrating a hierarchical structure, the model efficiently processes facial images at different
resolutions, enabling the extraction of both local and global spoofing cues. Furthermore, the HTBL
framework uses a robust training pipeline and data augmentation techniques to improve
generalization across diverse spoofing scenarios and environmental conditions.</p>
      <p>The remainder of this paper is organized as follows: Section 2 provides an overview of related
works in face anti-spoofing and transformer-based approaches. Section 3 delineates the proposed
HTBL framework, elucidating its architectural components, training methodology, dataset
information and experimental setup. Section 4 presents the experimental results, followed by an
indepth analysis and discussion in Section 5. Finally, last section concludes the study and outlines
future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Face anti-spoofing has been an active area of research in biometric security, with numerous
approaches proposed to combat evolving presentation attack techniques. The literature in this
domain can be broadly categorized into traditional handcrafted feature-based methods and deep
learning approaches.</p>
      <p>Early face anti-spoofing techniques relied on handcrafted features to distinguish between genuine
and spoofed faces. Texture analysis played a pivotal role in these approaches, with Local Binary
Patterns (LBP) emerging as a popular descriptor for capturing micro-texture variations [5].
Extensions of LBP, such as LBP-TOP for spatio-temporal analysis, were proposed to use motion cues
in video-based anti-spoofing [6]. Additionally, color space analysis and image quality assessment
metrics were explored to detect artifacts introduced by printing or display devices [7]. While
handcrafted features demonstrated efficacy in controlled environments, their performance degraded
significantly under varying illumination conditions and against high-quality spoofing attacks.
Moreover, the manual design of features limited their adaptability to novel attack types, necessitating
the exploration of more sophisticated approaches. The advent of deep learning ushered in a new era
of face anti-spoofing research, with Convolutional Neural Networks (CNNs) at the forefront. CNNs
exhibited remarkable performance in learning discriminative features directly from facial images,
obviating the need for manual feature engineering [8]. Various CNN architectures, including
AlexNet, VGGNet, and ResNet, were adapted for face anti-spoofing tasks, demonstrating superior
performance compared to traditional methods [9]. To improve the temporal modeling capabilities of
CNNs, researchers incorporated recurrent architectures such as Long Short-Term Memory (LSTM)
networks for video-based anti-spoofing [10]. These hybrid CNN-LSTM models captured both spatial
and temporal cues, improving robustness against video replay attacks. Despite their success,
CNNbased approaches faced challenges in capturing long-range dependencies and hierarchical
relationships within facial structures. To address these limitations, attention mechanisms were
introduced to focus on salient facial regions and potential spoofing artifacts [11, 12, 13]. While
ViTbased models demonstrated promising results, they faced challenges in capturing fine-grained facial
textures crucial for spoofing detection. To address these limitations, hybrid CNN-transformer
architectures were proposed, combining the strengths of both paradigms [14]. These models applied
CNNs for low-level feature extraction and transformers for high-level semantic modeling. However,
the computational complexity of full-image transformer processing remained a significant challenge,
particularly for real-time anti-spoofing applications. Recent advancements in efficient transformer
designs, such as the Swin Transformer [15], have paved the way for more effective hierarchical
processing of visual data. Domain shift, arising from variations in camera devices, lighting
conditions, and presentation attack instruments, often leads to performance degradation when
models are deployed in real-world scenarios [16, 17, 18].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <p>Current section describes the methodological architecture applied in the development of an advanced
face anti-spoofing system for biometric authentication. The proposed approach is visually
represented in Figure 1, which illustrates a structured flowchart encompassing each stage from input
image processing to the final classification decision.</p>
      <sec id="sec-3-1">
        <title>3.1. Problem statement</title>
        <p>The goal of the research is to develop robust mechanisms that can accurately differentiate between
genuine face presentations and spoofing attempts. Effective face anti-spoofing architectures are
crucial to improve the security of facial recognition systems, ensuring that only genuine users are
authenticated while preventing unauthorized access by impostors using fake representations.
Given an input face image I ∈ RH ×W ×C, where H denote height, W width, and C the number of
channels, respectively, the objective is to learn a function f : RH ×W ×C → {0 , 1 } such that:
f ( I )={1 , if I is a genuine face presentation
0 , if I is a spoofing attempt
(1)
The function f must effectively map the high-dimensional input space of facial images to a binary
decision space, distinguishing between authentic biometric samples and fraudulent presentations.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Proposed method</title>
        <p>The proposed HTBL framework in Figure 1 addresses the face anti-spoofing problem through a novel
architecture combining hierarchical feature extraction, transformer-based processing, and
multiscale fusion. The method contains several key components.</p>
        <p>The input image I with sizes 224 × 224 × 3 is divided into a grid of N = 784 non-overlapping
patches, each of size P × P = 8× 8. These patches are flattened and linearly projected to obtain a
sequence of patch embeddings.
where D = 768 is the embedding dimension.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], X ∈ RN × D
x1T
        </p>
        <p>T
X = x
…
xTN
(2)</p>
        <p>Learnable position embeddings E pos∈ R N × D are added to integrate spatial information</p>
        <p>
          X '= X + E pos
The embedded sequence X' is processed by a transformer encoder consisting of L =12 layers. Each
layer uses multi-head self-attention (MSA) and feed-forward network (FFN) modules:
The transformer encoder outputs are used to construct feature maps. Low Level features block
( Flow ) is taken by concatenating layers [
          <xref ref-type="bibr" rid="ref2 ref3">2,3,4,5</xref>
          ] and High Level features block ( Fhigh ) is taken from
layers [6,7,8,9], each block of size batch × 784 × 3072 after concatenation.
        </p>
        <p>Slow=SwinTransformer ( Conv (( Flow ) ))</p>
        <p>Shigh=SwinTransformer ( Conv (( Fhigh) ))
Firstly, those blocks are passed through Conv2d layer. After two separate Swin Transformer modules
process the low-level and high-level feature maps, capturing multi-scale contextual information.
Swin Transformer uses shifted windows for efficient self-attention computation and hierarchical
feature learning. Slow and Shigh output batch × 784 × 768 feature maps. Mean function is applied for
each map and results are averaged to further feed into the Sigmoid function.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Metrics</title>
        <p>APCER, BPCER, and ACER are metrics used in biometric system performance evaluation, especially
in systems involving fingerprint recognition, facial recognition, or other biometric authentication
methods [19]. This research also applied aforementioned metrics to reflect the model results.
(3)
(4)
(5)</p>
        <sec id="sec-3-3-1">
          <title>Number of False Accepts</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Total Number of Attack Presentations</title>
          <p>BPCER=</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Number of False Rejects</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>Total Number of Genuine Presentations</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>APCER + BPCER ACER=</title>
          <p>2
(6)
(7)
(8)</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Dataset</title>
        <p>The OULU-NPU dataset [20] is a widely recognized benchmark in face anti-spoofing research,
designed to address the challenges of generalization across different environmental conditions and
attack types. The dataset consists of 4950 video recordings of genuine face presentations and
presentation attacks, captured using six different mobile devices with front-facing cameras. The
dataset includes 55 subjects, with recordings conducted in three distinct sessions with varying
illumination and background conditions. The presentation attacks in OULU-NPU encompass two
types: print attacks and video-replay attacks. Print attacks use high-quality printed photographs of
the subjects, while video-replay attacks employ high-resolution digital videos displayed on electronic
screens. OULU-NPU is structured around four protocols, each designed to evaluate specific aspects of
face anti-spoofing systems. Each video in the dataset is 5 seconds long, recorded at 30 frames per
second, resulting in 150 frames per video. However, the current research uses only few of those
frames during the inference, mainly 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th,100th and averages their
results to get the final decision probability.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Experimental setup</title>
        <p>In this study, we conducted our experiments using an NVIDIA RTX 3090 GPU with 24GB of VRAM,
which provided the necessary computational power for efficient training and inference. The training
was performed using a batch size of 8, a configuration chosen to balance between memory usage and
training efficiency. For optimization, we employed the AdamW optimizer. The model was trained for
a total of 40 epochs, which was determined to be sufficient for convergence based on preliminary
experiments. The learning rate was initialized at 0.00001 and was adjusted during training using a
CosineAnnealingLR scheduler. The scheduler is capable to gradually reduce the learning rate,
thereby facilitating smooth convergence and helping to avoid local minima. To improve
generalization Horizontal Flip, Random Contrast, Random Gamma, Random Brightness, and
Distortion based geometric augmentations were applied.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment results</title>
      <p>The proposed method was evaluated on the OULU-NPU dataset using its four standard protocols,
which assess different aspects of face anti-spoofing generalization. Table 1 presents a comparison of
our approach against four other methods: Auxiliary [21], Disentangle [22], DC-CDN [23], and
NASFAS [24].</p>
      <p>In Protocol 1, which evaluates generalization across unseen environmental conditions, our
method demonstrates high performance with an ACER of 0.8%, outperforming the next best method
NAS-FAS by a significant margin. Notably, our approach achieves a perfect APCER of 0%, indicating
excellent capability in detecting presentation attacks, albeit with a slightly higher BPCER compared
to some competitors. For Protocol 2, which tests generalization across unseen attack devices, our
method shows competitive performance with an ACER of 1.4%. While not achieving the lowest error
rates, it maintains a balanced performance across both APCER and BPCER, suggesting robust
generalization capabilities. Protocol 3 assesses generalization across unseen input sensors,
presenting a more challenging scenario reflected in the higher error rates across all methods. Our
approach achieves the lowest ACER of 1.5±0.6%, demonstrating significant cross-sensor
generalization compared to other methods. The results demonstrate particular strengths in handling
unseen environmental conditions (Protocol 1) and the most challenging combined scenario (Protocol
4).</p>
      <p>The efficiency of our hierarchical transformer-based approach for face anti-spoofing is further
evidenced through principal component analysis (PCA) visualizations of the feature representations
at two critical stages of the network. Figure 2 illustrates the PCA projection of features immediately
after the ViT encoder. The plot reveals a curved manifold structure, with live faces (blue) distinctly
separated from various types of fake faces. However, there is notable overlap among different
spoofing attack types (printed, video replay). This suggests that while the ViT encoder successfully
distinguishes genuine from spoofed presentations, it struggles to differentiate between specific attack
modalities. Figure 2 presents the PCA visualization after processing through the convolutional block
and Swin Transformer. The transformation in feature space is striking. Live faces are now tightly
clustered and distinctly separated from all spoofing attacks. Moreover, there is improved
discrimination between different types of fake faces, although some overlap persists.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The findings from our investigation into the HTBL framework substantiate its effectiveness in
tackling the complex challenges of face anti-spoofing. Analyzing these results in the context of the
broader landscape of anti-spoofing research reveals several critical insights and potential areas for
future exploration. Our method demonstrated strong performance across diverse protocols,
particularly in unseen environments and combined challenge scenarios, highlighting its exceptional
generalization abilities. The success of our model in generalizing across different scenarios can be
largely attributed to the combined strengths of global context capture by the ViT encoder and local
feature refinement by the Swin Transformer. Previous approaches often struggled to simultaneously
capture both global and local spoofing cues under varying environmental conditions [25], which was
addressed by our architectural synergy approach. Its ability to maintain high performance across
different environmental conditions, attack types, and sensor inputs aligns with the industry's
growing demand for adaptive and resilient security solutions [26].</p>
      <p>In summary, the HTBL framework represents a significant leap forward in face anti-spoofing
technology, effectively applying the strengths of transformer architectures and multi-scale feature
learning to overcome key challenges in generalization and robustness. While the results are highly
encouraging, they also point to new avenues for further research and refinement.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The HTBL framework presented in our study shows a significant progress in face anti-spoofing
technology. By synergistically combining Vision Transformer encoder with Swin Transformers, our
approach effectively addresses the complex challenges of distinguishing genuine face presentations
from sophisticated spoofing attempts across diverse environmental conditions and attack modalities.
The framework's ability to generalize across such varied conditions underscores its potential for
realworld deployment in biometric authentication systems. The progressive refinement of feature
representations, as visualized through Principal Component Analysis, provides insight into the
hierarchical learning process. The clear separation between live and fake face representations in the
final stages of our network architecture corroborates the quantitative performance metrics and
highlights the effectiveness of our multi-scale feature extraction approach. Future research directions
include the integration of temporal information for video-based anti-spoofing and exploration of
cross-dataset generalization.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Google Gemini in order to: Grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.
[4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... &amp; Polosukhin, I.
(2017). Attention is all you need. In Advances in neural information processing systems (pp.
5998-6008).
[5] Määttä, J., Hadid, A., &amp; Pietikäinen, M. (2011). Face spoofing detection from single images using
micro-texture analysis. In 2011 International Joint Conference on Biometrics (IJCB) (pp. 1-7).</p>
      <p>IEEE.
[6] de Freitas Pereira, T., Anjos, A., De Martino, J. M., &amp; Marcel, S. (2014). LBP-TOP based
countermeasure against face spoofing attacks. In Asian Conference on Computer Vision (pp.
121-132). Springer, Cham.
[7] Galbally, J., Marcel, S., &amp; Fierrez, J. (2014). Image quality assessment for fake biometric detection:
Application to iris, fingerprint, and face recognition. IEEE transactions on image processing,
23(2), 710-724.
[8] Yang, J., Lei, Z., &amp; Li, S. Z. (2014). Learn convolutional neural network for face anti-spoofing.</p>
      <p>arXiv preprint arXiv:1408.5601.
[9] Nagpal, C., &amp; Dubey, S. R. (2019). A performance evaluation of convolutional neural networks
for face anti spoofing. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp.
1-8). IEEE.
[10] Xu, Z., Li, S., &amp; Deng, W. (2015). Learning temporal features using LSTM-CNN architecture for
face anti-spoofing. In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR) (pp.
141145). IEEE.
[11] George, A., &amp; Marcel, S. (2021). Learning one class representations for face presentation attack
detection using multi-channel convolutional neural networks. IEEE Transactions on
Information Forensics and Security, 16, 361-375.
[12] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., &amp; Shah, M. (2021). Transformers in
vision: A survey. ACM Computing Surveys (CSUR), 54(10), 1-41.
[13] Yu, Z., Qin, Y., Li, X., Wang, Z., Zhao, C., Lei, Z., &amp; Zhao, G. (2021). Multi-modal face
antispoofing based on central difference networks and dual-cross pattern attention. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6488-6497).
[14] Liu, S. I., Yeh, P. C., Fu, X., &amp; Wu, H. T. (2022). Transformer-based multi-scale feature fusion for
face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 19228-19237).
[15] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... &amp; Guo, B. (2021). Swin transformer:
Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (pp. 10012-10022).
[16] Shao, R., Lan, X., Li, J., &amp; Yuen, P. C. (2019). Multi-adversarial discriminative deep domain
generalization for face presentation attack detection. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 10023-10031).
[17] Jia, Y., Zhang, J., Shan, S., &amp; Chen, X. (2020). Single-side domain generalization for face
antispoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 8484-8493).
[18] Shao, R., Lan, X., &amp; Yuen, P. C. (2019). Regularized fine-grained meta face anti-spoofing. In</p>
      <p>Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 4804-4811).
[19] ISO/IEC 30107-3:2023. (2023). Information technology — Biometric presentation attack detection
— Part 3: Testing and reporting (Edition 2). International Organization for Standardization.
[20] Boulkenafet, Z., Komulainen, J., Li, L., Feng, X., &amp; Hadid, A. (2017). OULU-NPU: A mobile face
presentation attack database with real-world variations. In Proceedings of the 12th IEEE
International Conference on Automatic Face &amp; Gesture Recognition (FG 2017) (pp. 612-618).
https://doi.org/10.1109/FG.2017.77.
[21] Liu, Y., Jourabloo, A., &amp; Liu, X. (2018). Learning deep models for face anti-spoofing: Binary or
auxiliary supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (pp. 389-398).
[22] Zhang, K.-Y., Yao, T., Zhang, J., Tai, Y., Ding, S., Li, J., Huang, F., Song, H., &amp; Ma, L. (2020). Face
anti-spoofing via disentangled representation learning. In Proceedings of the European
Conference on Computer Vision (ECCV) (pp. 1-6).
[23] Yu, Z., Qin, Y., Zhao, H., Li, X., &amp; Zhao, G. (2021). Dual-cross central difference network for face
anti-spoofing. In Proceedings of the International Joint Conference on Artificial Intelligence
(IJCAI) (pp. 1281-1287). https://doi.org/10.24963/ijcai.2021/178.
[24] Yu, Z., Wan, J., Qin, Y., Li, X., Li, S. Z., &amp; Zhao, G. (2021). NAS-FAS: Static-dynamic central
difference network search for face anti-spoofing. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 43(9), 3005-3023. https://doi.org/10.1109/TPAMI.2020.3009123.
[25] George, A., &amp; Marcel, S. (2021). On the effectiveness of vision transformers for zero-shot face
anti-spoofing. arXiv:2011.08019v2 [cs.CV].https://doi.org/10.48550/arXiv.2011.08019.
[26] Huang, H.-P., Sun, D., Liu, Y., Chu, W.-S., Xiao, T., Yuan, J., Adam, H., &amp; Yang, M.-H. (2023).</p>
      <p>Adaptive transformers for robust few-shot cross-domain face anti-spoofing. arXiv:2203.12175v2
[cs.CV]. https://doi.org/10.48550/arXiv.2203.12175.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Searching central difference convolutional networks for face anti-spoofing</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          (pp.
          <fpage>5295</fpage>
          -
          <lpage>5305</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Lei</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Deep spatial gradient and temporal depth learning for face anti-spoofing</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          (pp.
          <fpage>5042</fpage>
          -
          <lpage>5051</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yeh</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>Dual-stream transformer for face anti-spoofing</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          (pp.
          <fpage>19218</fpage>
          -
          <lpage>19227</lpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>