Cross-Database Liveness Detection: Insights from Comparative
                         Biometric Analysis
                         Oleksandr Kuznetsov 1,2, Dmytro Zakharov 2, Emanuele Frontoni 1,3, Andrea Maranesi 3 and
                         Serhii Bohucharskyi 2
                         1
                           Department of Political Sciences, Communication and International Relations, University of Macerata, Via
                         Crescimbeni, 30/32, 62100 Macerata, Italy
                         2
                           Department of Information and Communication Systems Security, School of Computer Sciences, V. N. Karazin
                         Kharkiv National University, 4 Svobody Sq., 61022 Kharkiv, Ukraine
                         3
                           Department of Information Engineering, Marche Polytechnic University, Via Brecce Bianche 12, 60131 Ancona,
                         Italy


                                          Abstract
                                          In an era where biometric security serves as a keystone of modern identity verification systems,
                                          ensuring the authenticity of these biometric samples is paramount. Liveness detection, the
                                          capability to differentiate between genuine and spoofed biometric samples, stands at the
                                          forefront of this challenge. This research presents a comprehensive evaluation of liveness
                                          detection models, with a particular focus on their performance in cross-database scenarios, a
                                          test paradigm notorious for its complexity and real-world relevance. Our study commenced by
                                          meticulously assessing models on individual datasets, revealing the nuances in their
                                          performance metrics. Delving into metrics such as the Half Total Error Rate, False Acceptance
                                          Rate, and False Rejection Rate, we unearthed invaluable insights into the models' strengths and
                                          weaknesses. Crucially, our exploration of cross-database testing provided a unique perspective,
                                          highlighting the chasm between training on one dataset and deploying on another. Comparative
                                          analysis with extant methodologies, ranging from convolutional networks to more intricate
                                          strategies, enriched our understanding of the current landscape. The variance in performance,
                                          even among state-of-the-art models, underscored the inherent challenges in this domain. In
                                          essence, this paper serves as both a repository of findings and a clarion call for more nuanced,
                                          data-diverse, and adaptable approaches in biometric liveness detection. In the dynamic dance
                                          between authenticity and deception, our work offers a blueprint for navigating the evolving
                                          rhythms of biometric security.

                                          Keywords 1
                                          Spoofing Attacks, Liveness Detection, Biometric Security, Cross-Database Testing, Biometric
                                          Authenticity

                         1. Introduction
                             In the contemporary digital age, where vast arrays of information converge and intermingle within
                         the virtual realm, the secure identification and authentication of individuals has ascended to paramount
                         importance [1]. Biometric systems, harnessing physiological or behavioral attributes – from fingerprints
                         to facial patterns, voice modulations to iris intricacies – promise a semblance of security that traditional
                         alphanumeric passwords often fail to deliver. They purport to offer a more foolproof method of


                         SCIA-2023: 2nd International Workshop on Social Communication and Information Activity in Digital Humanities, November 9, 2023, Lviv,
                         Ukraine
                         EMAIL: kuznetsov@karazin.ua (O. Kuznetsov); zamdmytro@gmail.com (D. Zakharov); emanuele.frontoni@unimc.it (E. Frontoni);
                         andrea.maranesi99@gmail.com (A. Maranesi); sbogucharskiy@karazin.ua (S. Bohucharskyi)
                         ORCID: 0000-0003-2331-6326 (O. Kuznetsov); 0000-0001-9519-2444 (D. Zakharov); 0000-0002-8893-9244 (E. Frontoni); 0009-0007-
                         8104-7263 (A. Maranesi); 0000-0003-4971-4314 (S. Bohucharskyi)
                                       ©️ 2023 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
identification, one inherently linked to the individual, and ostensibly resistant to theft, duplication, or
subversion [2–4].
    Yet, as with every technological advancement, there arises a counter-movement seeking to exploit
potential vulnerabilities. Spoofing attacks, wherein malicious entities present synthetic or altered
biometric data to deceitfully gain access, have emerged as a significant threat to biometric systems [5–
7]. To counteract these sophisticated maneuvers, the realm of liveness detection has evolved, aiming to
discern real biometric traits from forged or replayed ones [8,9].
    However, the real litmus test for these liveness detection mechanisms is not simply their efficacy
within the confines of a singular dataset or environment, but their robustness and adaptability across
diverse scenarios. The potential for a model trained on one dataset to retain or even amplify its accuracy
on a disparate dataset remains an underexplored, yet critical area of inquiry. This cross-database testing
paradigm offers insights not just into the generalizability of models but also into the intrinsic challenges
and opportunities in bridging the gap between varied biometric landscapes [10,11].
    The present study embarks on this very exploration, delving deep into the performance metrics of
various liveness detection models, particularly in cross-database contexts. Through meticulous
analyses, comparisons with existing methodologies, and a persistent commitment to understanding the
underlying dynamics, this paper aims to shed light on the path forward for liveness detection in
biometric systems – a path replete with challenges but also rife with potential.
    In the subsequent sections, we delineate our methodologies, elucidate the metrics employed, present
our results, and engage in comprehensive discussions and conclusions, all with the overarching
objective of navigating the intricate nexus of biometrics, security, and authenticity in today's digital
realm.

2. Methodology
2.1. AttackNet v2.2: Deep Learning Model Architecture
   In the evolving landscape of deep learning, Convolutional Neural Networks (CNNs) have
established themselves as the forefront methodology in image processing, pattern recognition, and a
myriad of related applications. In our previous work, as referenced in [12], we introduced a lineage of
CNN architectures culminating in the conception of AttackNet v2.2, a model tailored to combat
spoofing attacks in biometric systems.

2.1.1. Model Description
   The architecture of AttackNet v2.2 is predicated on layer-wise refinement, with a methodical buildup
from low-level feature extraction to high-level pattern discernment. The structure can be broadly
demarcated into three phases (Refer to Fig. 1 for the comprehensive visual representation):
   1. Initial Convolutional Phase:
       • The network ingests input through a two-dimensional convolutional layer with 16 filters of
           size 3x3. The 'same' padding ensures spatial dimensions are maintained post-convolution.
       • This is followed by a Leaky Rectified Linear Unit (LeakyReLU) with an alpha value of 0.2,
           allowing a minor gradient when the unit is not active and mitigating the risk of dead neurons
           during training.
       • Batch normalization is then applied along the feature map channel to stabilize the activations
           and accelerate convergence.
       • An ensemble of convolutional layers ensues, terminated with a skip connection (a residual
           link) that merges the original input (y) and the resultant output (z). This residual addition
           aids in avoiding vanishing gradient problems in deeper networks.
       • The phase culminates with a 2x2 max-pooling layer, halving the spatial dimensions, and is
           immediately followed by a dropout of 25% to prevent overfitting.
   2. Second Convolutional Phase:
       • Much like its predecessor, this phase initiates with a convolutional layer, but with a doubled
           depth of 32 filters. The subsequent layers mirror the earlier phase in functionality, ensuring
        deeper and more intricate pattern recognition. The skip connection, again, plays a pivotal
        role in reinforcing learned features and preserving gradient flow.
   3. Dense Phase:
      • The flattened output from the preceding convolutional layers feeds into a dense layer with
        128 units. The tanh activation function is employed here, primarily to ensure the output
        range between -1 and 1, offering a normalized and centered activation spectrum.
      • A substantial dropout of 50% follows, offering a rigorous regularization step before the final
        softmax layer.
      • The terminating softmax layer comprises 2 units, offering probability scores for the binary
        classification task at hand.


Figure 1: AttackNet v2.2 Architecture [12]

2.1.2. Architectural Justifications:

       •    LeakyReLU Activation: Traditional ReLU units can sometimes cause neurons to "die",
            ceasing to adjust during training due to consistently receiving non-positive inputs. The
            introduction of a leak factor, even if minuscule, ensures gradient flow, enhancing the
            robustness of the training process.
        • Residual Connections: Deep networks, while powerful, can sometimes become victims of
            vanishing or exploding gradients, hampering their ability to learn. The residual connections
            (or skip connections) in our model aid in mitigating this issue by providing a direct path for
            the gradient to flow.
        • Dropout Layers: Overfitting remains a pertinent concern in deep architectures. Strategic
            placement of dropout layers in our model ensures generalizability by randomly deactivating
            certain neurons during training, forcing the network to learn redundant representations.
    In essence, AttackNet v2.2 stands as a testament to meticulous design, iterative refinement, and
architectural prudence. It aims to provide a robust solution in the domain of biometric security, making
strides in both liveness detection accuracy and generalizability across diverse datasets.
2.2.    Datasets Used
   In our pursuit of advancing cross-database testing in Liveness Detection, we utilized five distinct
datasets designed to evaluate face presentation attack detection. These datasets represent different
modalities and scenarios essential for comprehensive analysis. The following outlines the details of the
datasets, including their sources, types of biometric data, and any preprocessing performed.

2.2.1. The Custom Silicone Mask Attack Dataset (CSMAD)
   Source: Collected at the Idiap Research Institute [13].
   Type of Biometric Data: The CSMAD consists of face-biometric data derived from 14 subjects,
encompassing bona fide presentations, as well as custom-made silicone mask attacks.
   Content and Preprocessing: Videos were captured under various lighting conditions: fluorescent
ceiling light only, halogen lamps illuminating from either side, and both sides simultaneously. A green
uniform background was used for all recordings, organized into ‘attack,’ ‘bonafide,’ and ‘protocols’
directories. Videos were categorized as 'WEAR' (108 videos) and 'STAND' (51 videos) for attack
presentations.


2.2.2. The 3D Mask Attack Database (3DMAD)
   Source: 3DMAD is a specialized biometric (face) spoofing database [14].
   Type of Biometric Data: This dataset contains 76500 frames of 17 individuals, recorded using Kinect
for both genuine access and 3D mask spoofing attacks.
   Content and Preprocessing: The data, collected across three different sessions, include a depth
image, corresponding RGB image, and manually annotated eye positions. Real-size masks were
obtained using "ThatsMyFace.com," with paper-cut masks also included. The database is maintained
under controlled conditions, with frontal-view and neutral expression.

2.2.3. The Multispectral-Spoof Face Spoofing Database (MSSpoof)
    Source: Created at the Idiap Research Institute [15].
    Type of Biometric Data: A spoofing attack database consisting of VIS and NIR spectrum images for
21 clients.
    Content and Preprocessing: Real accesses and spoofing attacks were recorded using a uEye camera
with an 800nm NIR filter. Different lighting conditions and environmental settings were employed,
resulting in a total of 70 real accesses per client (35 VIS and 35 NIR) and 144 spoofing attacks per
client. The database is divided into training, development, and test subsets, with manually annotated
key points on the face for each sample.

2.2.4. The Replay-Attack Database
    Source: The Replay-Attack Database was produced at the Idiap Research Institute [16].
    Type of Biometric Data: 2D Facial Video
    Content and Preprocessing: This database contains 1,300 video clips of real-access and attack
attempts from 50 clients under various lighting conditions. The data includes training, development,
test, and enrollment sets. Attack attempts utilize high-resolution photos and videos from each client.
Methods of attack include mobile displays (iPhone 3GS), high-resolution screen displays (first-
generation iPad), and hard-copy prints. The database offers 18 different protocols for evaluating
countermeasures to spoof attacks. Data includes annotated face locations. The database structure allows
the study of countermeasures against 2D face spoofing attacks.
2.2.5. Our Own Dataset
    Source: Videos taken with smartphones or downloaded from the internet [12].
    Type of Biometric Data: 2D Facial Images
    Content and Preprocessing: The dataset is divided into bona fide images and attacker images. Bona
fide images are sourced from videos that depict real people, either taken directly with a smartphone or
downloaded online. Attacker images are derived from videos captured using a laptop webcam that
played back the bona fide videos from a smartphone screen or vice versa. The dataset contains 4,656
images with a 50/50 class distribution and a 48/52 training/validation split. The videos were mainly
sourced from YouTube. A total of 84 videos are included, 40 for training and 25 for testing, with images
extracted from each video. The dataset aims to represent and allow for the detection of face spoofing
attempts.
    By integrating these diversified datasets, this study aims to offer a robust examination of cross-
database testing in Liveness Detection. The selected datasets cover various facets of biometric data,
providing a valuable foundation for our investigation.

2.3.    Evaluation Metrics
   In our investigation, it is pivotal to not only generate results but also to evaluate the efficacy of those
results comprehensively. To this end, a suite of performance metrics was employed to assess the
performance of our model across two classes: Bonafide and Attacker. The following is a thorough
examination of the metrics utilized:

2.3.1. Precision
    Precision, also known as the positive predictive value, measures the fraction of correctly identified
positive instances from all the instances predicted as positive. In the context of our study:
                                              𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠(𝑇𝑃)                                    (1)
                 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                                                      .
                                𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃) + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝐹𝑃)
    For the Bonafide class, it represents the accuracy of genuine identity recognitions, while for the
Attacker class, it indicates the accuracy with which fraudulent attempts are identified. A high precision
is crucial to ensure that legitimate users are not mislabeled as attackers and vice versa.

2.3.2. Recall
   Often termed sensitivity or the true positive rate, recall denotes the fraction of positive instances that
were correctly identified from all actual positive instances. Mathematically:
                                           𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠(𝑇𝑃)                                          (2)
                 𝑅𝑒𝑐𝑎𝑙𝑙 =                                                          .
                            𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃) + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (𝐹𝑁)
   In our study, a high recall for the Bonafide class signifies that a significant number of genuine users
are recognized correctly. For the Attacker class, it implies that a vast majority of spoofing attacks are
detected. Recall is especially vital in security-sensitive applications to ensure that attacks are not
overlooked.

2.3.3. F1 Score
   The F1 Score provides a harmonized mean of precision and recall. It is particularly beneficial when
there's an uneven class distribution:
                                              𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙                                  (3)
                              𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ×                      .
                                              𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
   A high F1 Score suggests a balanced identification mechanism, where both false alarms (false
positives) and missed detections (false negatives) are minimized.
2.3.4. HTER (Half Total Error Rate)
   This metric is an average of the False Acceptance Rate (FAR) and the False Rejection Rate (FRR).
Mathematically:
                                               𝐹𝐴𝑅 + 𝐹𝑅𝑅                                         (4)
                                      𝐻𝑇𝐸𝑅 =                  ,
                                                     2
where:
                                            𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠                                      (5)
                                𝐹𝐴𝑅 =                               ,
                                        𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑡𝑢𝑎𝑙 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
                                            𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠                                      (6)
                                 𝐹𝑅𝑅 =                            .
                                         𝑇𝑜𝑡𝑎𝑙 𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
   HTER provides a balanced overview of the system's performance, considering both the cases when
a genuine user is incorrectly rejected and when an attacker is wrongly accepted. A lower HTER signifies
a more robust and reliable biometric system.

2.3.5. Significance
   The ensemble of these metrics enables us to draw holistic insights into the system's behavior. While
metrics like precision and recall give insights into specific types of errors, the F1 Score and HTER
provide a summarized view of the overall performance. Employing these metrics ensures that our
evaluation is not only thorough but also aligned with contemporary best practices in biometric system
performance assessment.

2.4.    Cross-Database Testing: Evaluating Liveness Detection Robustness
    Cross-database testing, often deemed a gold standard for evaluating the generalization ability of a
machine learning model, involves training the model on one database (or dataset) and subsequently
testing its performance on an entirely different database. This methodology is paramount for assessing
the capability of a model to adapt and perform reliably in real-world scenarios where it encounters data
distributions that it has not been directly exposed to during training.
    Rationale for Cross-Database Testing:
     1. Model Generalization: Traditional training and testing on the same dataset can sometimes
         lead to overfitting, where the model memorizes the specific characteristics of the training set,
         leading to excellent training performance but poor generalization to new data. Cross-database
         testing helps in ensuring that the model's performance is genuinely reflective of its ability to
         generalize across various data sources.
     2. Diversity in Data: Different databases can often have diverse data collection protocols, varied
         demographic distributions, and even different types of spoofing attacks. Testing a model across
         such diverse datasets can provide insights into its robustness and adaptability.
     3. Benchmarking: Cross-database testing also sets a benchmark for comparing different liveness
         detection algorithms. A model that consistently performs well across multiple datasets is
         typically considered more robust and reliable.
    In the context of our study, with a total of five datasets at our disposal, we embarked on a rigorous
cross-database testing regime. The model was systematically trained on each dataset in turn and then
tested on the remaining four, resulting in a comprehensive matrix of training-testing combinations. This
procedure facilitated a meticulous investigation into the reliability of liveness detection under diverse
conditions and challenges.
    For biometric systems, the stakes are exceptionally high. A system trained exclusively on one
database might perform flawlessly on that particular data but might falter when encountering a slightly
different spoofing technique or a different demographic distribution. Thus, cross-database testing is not
merely an academic exercise; it directly impacts the real-world reliability and security of biometric
systems. By gauging performance across multiple datasets, we ensure that our liveness detection model
is not only precise but also resistant to diverse spoofing challenges.
    In conclusion, while single-database evaluations provide valuable insights into model performance,
cross-database testing unveils the broader picture, shedding light on the robustness and generalization
capability of the model. This comprehensive assessment is instrumental in advancing the state-of-the-
art in liveness detection, ensuring the development of systems that are both secure and inclusive.

3. Testing Results
   In this section, we present the research findings of our liveness detection study across various
datasets. The results are summarized in terms of standard metrics such as Precision, Recall, F1 score,
FAR, FRR, and HTER.

3.1.    Training Performance Analysis
    Before delving into the testing results, it is pertinent to examine the model's performance during the
training phase. This approach helps us to understand not only the model's behavior and learning efficacy
but also to preemptively identify and mitigate any potential issues, such as overfitting or underfitting,
that are often illuminated by training dynamics.
    During the training phase, we monitored key performance metrics, including Precision, Recall, and
F1 Score, for both 'Bonafide' and 'Attacker' classifications. We also kept a close eye on the False
Acceptance Rate (FAR) and False Rejection Rate (FRR), as these metrics provide additional insight
into the model's reliability.
    In this scenario, learning performance was slightly higher than testing performance, which is a
common finding. These results suggest that the model was trained effectively on the training data, but
not to the extent of perfectly fitting (and therefore potentially overfitting) the data.
    Thus, an analysis of the training metrics indicates that the model exhibits a robust learning pattern,
with no significant discrepancies between performance on training data versus validation data. This
consistency suggests that our model has not experienced overfitting during the training process.

3.2.    Testing Performance Analysis
    The testing phase evaluated the model's ability to generalize its learning from the training datasets
to unseen data. Table 1 below illustrates the performance metrics observed during this phase.
    The testing results, particularly when compared with the training performance, confirm the model's
effective generalization. The consistency across key metrics between both phases underscores the
model's reliability in diverse real-world scenarios.

Table 1
Performance Metrics across Datasets
 Dataset Precision Precision Recall               Recall     F1 score    F1 score    FAR     FRR    HTER
              (B)         (A)        (B)            (A)         (B)         (A)
MSSpoof      0.96        0.93       0.92           0.96        0.94        0.94      0.08   0.04    0.06
 3DMAD        1.0         1.0       1.0             1.0         1.0         1.0      0.0    0.0      0.0
 CSMAD       0.56         1.0       1.0            0.23        0.72        0.37      0.0    0.77    0.385
 Replay      0.97        0.93       0.93           0.98        0.95        0.95      0.07   0.02    0.045
  Attack
   Our        0.8        0.89       0.9            0.77        0.85        0.83       0.2   0.23    0.215
 Dataset
Note: B – Bonafide; A – Attacker
3.3.    MSSpoof Dataset
    The results exhibit a commendable balance between precision and recall for both classes. This
implies that the model not only makes accurate predictions but also captures most of the genuine and
attack instances. However, there's a slight increase in the FAR, signifying a minor vulnerability to false
acceptance.

3.4.    3DMAD Dataset
   The model demonstrates impeccable performance across all metrics. Such an outcome might imply
an excellent alignment between training and testing distributions or might hint towards potential
overfitting. Though optimistic, it's crucial to verify the authenticity of these results in real-world
scenarios.

3.5.    CSMAD Dataset
   This dataset posed significant challenges. While the model identified bona fide instances with
impeccable precision, it faltered with the attacker class. The substantial FRR indicates the model's
inclination to classify many attacker instances incorrectly, which can be a significant security concern.
The reasons can range from data variability, novel spoofing techniques, or a distinct distribution not
seen during training.

3.6.    Replay Attack Dataset
   Comparable to the MSSpoof dataset in performance, the model shows a slight vulnerability in FAR
but excels in detecting attacker instances with high recall. This suggests that while it might occasionally
admit a spoof, it rarely fails to identify an authentic attempt.

3.7.    Our Dataset
   Results show a balanced but slightly lowered performance, with the largest FAR value among all
datasets. The dataset's inherent diversity or the potential novelties it introduces can challenge the model,
making it more cautious and sometimes erring on the side of false acceptance.

3.8.    Findings
    The variation in performance across datasets underscores the criticality of diverse data
representation in training robust liveness detection models. While some datasets like 3DMAD show
near-perfect results, others like CSMAD reveal potential vulnerabilities.
    Our findings emphasize the importance of comprehensive evaluations and the necessity of cross-
database testing. A model's efficacy shouldn't be gauged by its performance on one dataset but should
be benchmarked across a plethora, ensuring readiness for real-world challenges and diverse spoofing
attempts.

4. Cross-database Testing Results
   In the domain of liveness detection, one of the most challenging and revealing evaluations is cross-
database testing. It assesses how a model, trained on one dataset, generalizes across different data
distributions encountered in other datasets. Here, we present the outcomes of this rigorous evaluation
by showcasing results of confusion matrices and calculating the Half Total Error Rate (HTER) for each
scenario.
   The Table 2 gives a comprehensive representation of how each model trained on one dataset
performed on others. From FAR and FRR values, we can understand the type of errors our models are
more prone to. A high FAR indicates that the model might be too lenient, granting access to potential
threats. On the other hand, a high FRR reveals that genuine attempts might be unnecessarily thwarted.
   For instance, when the MSSpoof model was tested on our dataset, a high FAR value of 0.71 emerged,
indicating potential vulnerabilities in its authentication mechanism.

Table 2
Cross-database Testing Metrics
         Trained on                     Tested on                FAR           FRR            HTER
          MSSpoof                        3DMAD                   0.16          0.81           0.485
          MSSpoof                        CSMAD                   0.53          0.00           0.27
          MSSpoof                     Replay Attack              0.46          0.37           0.399
          MSSpoof                      Our Dataset               0.71          0.26           0.391
          3DMAD                         MSSpoof                  0.67          0.11           0.347
          3DMAD                          CSMAD                   0.21          0.00           0.055
          3DMAD                       Replay Attack              0.41          0.10           0.301
          3DMAD                        Our Dataset               0.62          0.02           0.194
           CSMAD                        MSSpoof                  0.10          0.90           0.514
           CSMAD                         3DMAD                   0.25          0.91           0.207
           CSMAD                      Replay Attack              0.08          0.84           0.441
           CSMAD                       Our Dataset               0.38          0.61           0.473
        Replay Attack                   MSSpoof                  0.93          0.02           0.395
        Replay Attack                    3DMAD                   0.00          0.69           0.125
        Replay Attack                    CSMAD                   0.00          0.29           0.071
        Replay Attack                  Our Dataset               0.47          0.23           0.213
         Our Dataset                    MSSpoof                  0.87          0.03           0.369
         Our Dataset                     3DMAD                   0.00          0.97           0.168

    In contrast, a relatively low FRR means that most genuine attempts were correctly identified. In this
cross-database analysis, several patterns emerge. The 3DMAD-trained model, for instance, had
consistently low FRR across different datasets, indicating its robustness in recognizing genuine
attempts. The Replay Attack model, on the other hand, exhibited a high FAR when tested on the
MSSpoof dataset, pointing towards its vulnerabilities. By examining FAR and FRR alongside HTER,
it offers a more nuanced perspective on the strengths and weaknesses of each model across different
data distributions.

4.1.    Models on 3DMAD
   The 3DMAD-trained model exhibited the lowest HTER when tested on CSMAD, which underscores
the similarity in distribution or potentially shared spoofing techniques. However, when tested on
MSSpoof, there was a rise in HTER, suggesting the two datasets might have different data
characteristics.

4.2.    Models on MSSpoof
   The highest HTER was observed when the MSSpoof model was tested on 3DMAD, suggesting
possible disparities between the two datasets. A relatively lower HTER on CSMAD highlights that
certain shared characteristics could benefit the model's generalization.
4.3.    Models on CSMAD
   Training on CSMAD and testing on MSSpoof produced the highest HTER, suggesting that the
CSMAD model might not generalize well to the MSSpoof distribution. Conversely, it performed
reasonably better on 3DMAD, which may indicate some shared nuances between these datasets.

4.4.    Models on Replay Attack
   Surprisingly, this model, when tested on CSMAD, showcased one of the lowest HTER values. It
indicates that, despite the datasets' differences, the model captures some essential liveness
characteristics that are generalizable.

4.5.    Models on Our Dataset
   The model generalized well across all datasets with relatively low HTER values. The lowest HTER
on the Replay Attack dataset suggests potential similarities in spoofing techniques or data distribution.

4.6.    Findings
   The rigorous process of cross-database testing sheds light on critical aspects of biometric
authentication systems, particularly the robustness and adaptability of liveness detection models amidst
varied data distributions. Through this methodology, our study underscores several key insights:
• Evaluating Generalization Capabilities: To quantify a model's ability to generalize, we extended
    our analysis beyond mere performance metrics. We introduced scenarios encompassing unfamiliar
    data, assessing the model's predictive accuracy and consistency across diverse datasets. This
    approach illuminated the model's resilience—or lack thereof—to variations and anomalies not
    present in the training data, thereby providing a tangible measure of its generalization capabilities.
• Inconsistency in Cross-Database Performance: Remarkably, several models exhibiting high
    efficacy on their native datasets encountered significant challenges when subjected to data from
    external sources. This inconsistency is indicative of a common pitfall: models, if overly tuned or
    biased towards specific dataset characteristics, may fail to maintain performance parity across
    broader biometric variations. Such deficiencies become apparent only through meticulous cross-
    database testing.
• Implications for Model Training Strategies: The evident fluctuations in performance across
    different databases underscore the imperative of incorporating diverse, multifaceted datasets into
    the training phase. This diversity safeguards against overfitting and cultivates a more holistic,
    adaptable model. Specifically, systems trained on a richer mixture of biometric data exhibit
    enhanced robustness, mitigating the risk of accuracy degradation when transitioning to unfamiliar
    environments.
• Strengthening Liveness Detection Systems: Our findings advocate for a paradigm shift in
    developing liveness detection models. Moving forward, emphasis must be placed on constructing
    datasets with comprehensive real-world variabilities and on devising testing protocols that simulate
    diverse adversarial conditions. These strategies ensure that future models are equipped with genuine
    resilience against spoofing techniques, irrespective of their nature or origin.
   In conclusion, this analysis accentuates the necessity of an exhaustive, cross-database testing
approach, one that transcends conventional evaluation methods. By exposing models to an array of
biometric datasets, we unearth indispensable insights into their true robustness and generalization
prowess, informing more reliable, secure biometric verification systems for the future.
5. Comparative Analysis
   In this section, we examine the outcomes of our approach juxtaposed with results from other pivotal
studies, notably those employing the Fully Convolutional Network (FCN) combined with the Spatial
Aggregation of Pixel-level Local Classifiers (SAPLC) strategy and the Convolutional Neural Network
(CNN) form [10,11].
   To comprehensively compare the performance of liveness detection models, it's essential to contrast
our findings with existing literature. The table below aggregates results from various studies including
our own, focusing on the HTER metrics across datasets and different training strategies.

Table 3
Comparative HTER Results
   Source         Model & Strategy               Trained on            Tested on             HTER
  This study         Our Model                  Replay Attack        Replay Attack           0.045
  This study         Our Model                  Replay Attack           3DMAD                0.125
  This study         Our Model                     3DMAD                3DMAD                0.000
  This study         Our Model                     3DMAD             Replay Attack           0.301
     [10]           FCN + SAPLC                 Replay Attack        Replay Attack       0.132 to 0.004
     [10]           FCN + SAPLC                 Replay Attack         CASIA-FASD             0.375
     [10]           FCN + SAPLC                  CASIA-FASD          Replay Attack           0.273
     [11]                CNN                    Replay Attack        Replay Attack           0.039
     [11]                CNN                       3DMAD                3DMAD                0.000
     [11]                CNN                       3DMAD                 CASIA               0.399
     [11]                CNN                        CASIA            Replay Attack           0.414

   In summation, while each approach brings its own set of strengths to the table, our model showcases
promising results, especially when considering its robustness in cross-database scenarios:
• Within-Database Results: When training and testing on the same dataset (intra-dataset), our model
    achieved impressive HTER scores, especially for the 3DMAD dataset (0.000). This parallels the
    results of the CNN from [11] for the 3DMAD dataset, which also exhibited a perfect HTER of
    0.000. However, our model slightly outperformed the FCN+SAPLC strategy from [10] on the
    Replay Attack dataset, achieving an HTER of 0.045 against a range of 0.132 to 0.004.
• Cross-Database Results: In cross-database scenarios, where the model is trained on one dataset
    and tested on another, our model demonstrated competitive, if not superior, results. Particularly for
    training on Replay Attack and testing on 3DMAD, our model's HTER of 0.125 closely trailed the
    performance of the CNN strategy from [11], but notably outperformed the FCN+SAPLC strategy
    of [10] for analogous cross-database settings.
• Generalizability: A closer analysis of cross-database HTERs elucidates that our model offers
    commendable generalizability across diverse datasets. This is especially evident when comparing
    our results to those of [10] and [11], where our approach consistently performs on par or surpasses
    the reported outcomes, particularly in cases where training and testing datasets were diverse.
• Novelty of Our Approach: The performance of our model, especially in cross-database scenarios,
    underscores the robustness and generalizability ingrained in our method. It implies that our model
    is equipped with the capacity to learn more generic features, which in the field of liveness detection,
    is paramount.
• Comparison to State-of-the-Art: Both the traditional CNN and the FCN combined with SAPLC
    have been recognized as benchmark methods in liveness detection. Our model's ability to produce
    competitive results in direct juxtaposition with these techniques is testament to its efficacy.
   The findings accentuate the need for continued exploration and refinement in the domain,
particularly in optimizing models for generalizability across diverse real-world scenarios.
6. Discussion
   In the rapidly evolving domain of liveness detection, the development of robust and adaptable
models remains a paramount pursuit. Our study presented a detailed analysis, offering insights into the
effectiveness of various strategies, especially when applied across different datasets. This section delves
deeper, examining the broader implications of our findings, potential limitations, and avenues for future
research.

6.1.    Implications

•   Model Generalizability: One of the salient observations from our research is the importance of
    model generalizability. In real-world applications, a model trained on one dataset may encounter
    inputs that belong to a different distribution. Our findings underscore the significance of designing
    models capable of maintaining high performance across such scenarios.
•   Metric Significance: While metrics like HTER, FAR, and FRR are invaluable in assessing model
    efficacy, our study emphasizes the intricate balance that must be achieved. A model with low HTER
    but high variability in FAR and FRR might be less desirable in certain practical applications than
    one with slightly higher HTER but consistent FAR and FRR.
•   Impact of Diverse Data: The variability observed in the performance of models across different
    datasets illuminates the challenge posed by diverse data. It not only accentuates the complexity of
    the problem at hand but also highlights the necessity of diverse training data to encompass possible
    real-world scenarios.

6.2.    Limitations

•   Data Constraints: Our study was bound by the datasets available. While our datasets are
    comprehensive, they may not capture all possible presentation attack instruments or scenarios. This
    poses a limitation to the generalizability of our findings.
•   Computational Resources: Like many deep learning approaches, our model's training and
    evaluation are computationally intensive, which might pose challenges in real-time deployment
    scenarios or in devices with constrained resources.
•   Absence of Adversarial Testing: While our study delved deeply into cross-database testing, we
    did not explore the model's resilience against adversarial attacks, which is becoming increasingly
    relevant in the realm of biometric security.

6.3.    Future Directions

•   Incorporation of Adversarial Techniques: Given the escalating sophistication of spoofing
    attacks, future research should investigate the incorporation of adversarial training techniques to
    enhance model robustness further.
• Optimization for Real-time Deployment: The model's architecture can be further refined, and
    techniques like model quantization or pruning could be employed to make it more conducive for
    real-time applications.
• Expansion of Dataset Diversity: To further bolster the model's generalizability, future studies
    should consider amassing and utilizing more diverse datasets, possibly even crowd-sourcing real-
    world data, which might offer richer and more unpredictable variations.
   In conclusion, our study represents a step forward in the quest for reliable liveness detection,
shedding light on various nuances of the challenge. However, as with all scientific endeavors, it also
underscores the ever-present need for continued research, refinement, and evolution in the domain.
7. Conclusion
    The landscape of biometric security has experienced rapid advancements in recent years,
underscored by an escalating arms race between state-of-the-art detection mechanisms and increasingly
sophisticated spoofing attacks. Liveness detection, in this context, emerges not merely as a feature but
as a necessity, pivotal in ensuring the integrity and reliability of biometric systems. Our study was
rooted in this paradigm, endeavoring to discern the effectiveness, nuances, and potential avenues for
improvement in liveness detection models, particularly when confronted with the challenges of cross-
database testing.
    Our findings elucidated several key insights. Firstly, the performance variability across different
datasets underscores the intricacies involved in modeling and emphasizes the quintessential role of data
diversity in training robust models. Additionally, while metrics such as HTER provide a comprehensive
measure of a model's performance, delving deeper into the balance between FAR and FRR unveils
critical nuances that have profound implications, especially in real-world deployment scenarios.
    Comparative analysis with previous studies revealed both the progress made in the domain and the
areas where challenges remain. While our model exhibited commendable performance in certain
scenarios, the inconsistencies observed in cross-database testing illuminate the path for future research.
    There are a few takeaways from our research. The journey towards perfecting liveness detection is
ongoing, replete with challenges yet filled with opportunities. The richness of data, the adaptability of
models, and the continuous evolution of techniques are the keystones upon which the edifice of reliable
biometric security will be built. As spoofing techniques evolve, so must our defense mechanisms,
making this a perpetually dynamic field of study.
    In closing, our research contributes to the broader dialogue on liveness detection, offering a synthesis
of findings, methodologies, and reflections that we hope will serve as a foundation for future endeavors
in this domain. The nexus of technology, security, and human identity is a complex tapestry, and it is
our fervent hope that our work adds a meaningful thread to this ever-evolving narrative.

8. Acknowledgements

       •    This project has received funding from the European Union’s Horizon 2020 research and
            innovation programme under the Marie Skłodowska-Curie grant agreement No. 101007820
            - TRUST.
       •    This publication reflects only the author’s view and the REA is not responsible for any use
            that may be made of the information it contains.

9. References
[1] T. Van hamme, G. Garofalo, S. Joos, D. Preuveneers, W. Joosen, AI for Biometric Authentication
    Systems, in: L. Batina, T. Bäck, I. Buhan, S. Picek (Eds.), Security and Artificial Intelligence: A
    Crossdisciplinary Approach, Springer International Publishing, Cham, 2022, pp. 156–180. doi:
    https://doi.org/10.1007/978-3-030-98795-4_8.
[2] G. Hua, Facial Recognition Technologies, in: L.A. Schintler, C.L. McNeely (Eds.), Encyclopedia
    of Big Data, Springer International Publishing, Cham, 2022, pp. 475–479. doi:
    https://doi.org/10.1007/978-3-319-32010-6_93.
[3] S. Marcel, J. Fierrez, N. Evans (Eds.), Handbook of Biometric Anti-Spoofing: Presentation Attack
    Detection and Vulnerability Assessment, Springer Nature, Singapore, 2023. doi:
    https://doi.org/10.1007/978-981-19-5288-3.
[4] C. Lucia, G. Zhiwei, N. Michele, Biometrics for Industry 4.0: a survey of recent applications,
    Journal of Ambient Intelligence and Humanized Computing 14 (2023) 11239–11261. doi:
    https://doi.org/10.1007/s12652-023-04632-7.
[5] A. George, S. Marcel, Robust Face Presentation Attack Detection with Multi-channel Neural
    Networks, in: S. Marcel, J. Fierrez, N. Evans (Eds.), Handbook of Biometric Anti-Spoofing:
     Presentation Attack Detection and Vulnerability Assessment, Springer Nature, Singapore, 2023,
     pp. 261–286. doi: https://doi.org/10.1007/978-981-19-5288-3_11.
[6] J. Hernandez-Ortega, J. Fierrez, A. Morales, J. Galbally, Introduction to Presentation Attack
     Detection in Face Biometrics and Recent Advances, in: S. Marcel, J. Fierrez, N. Evans (Eds.),
     Handbook of Biometric Anti-Spoofing: Presentation Attack Detection and Vulnerability
     Assessment, Springer Nature, Singapore, 2023, pp. 203–230. doi: https://doi.org/10.1007/978-
     981-19-5288-3_9.
[7] S.-Q. Liu, P. C. Yuen, Recent Progress on Face Presentation Attack Detection of 3D Mask Attack,
     in: S. Marcel, J. Fierrez, N. Evans (Eds.), Handbook of Biometric Anti-Spoofing: Presentation
     Attack Detection and Vulnerability Assessment, Springer Nature, Singapore, 2023, pp. 231–259.
     doi: https://doi.org/10.1007/978-981-19-5288-3_10.
[8] Z. Rui, Z. Yan, A Survey on Biometric Authentication: Toward Secure and Privacy-Preserving
     Identification,        IEEE         Access        7        (2019)          5994–6009.          doi:
     https://doi.org/10.1109/ACCESS.2018.2889996.
[9] S. Chakraborty, D. Das, An Overview of Face Liveness Detection, arXiv:1405.2227 [cs.CV]
     (2014). URL: http://arxiv.org/abs/1405.2227.
[10] S. Arora, M.P.S. Bhatia, V. Mittal, A robust framework for spoofing detection in faces using deep
     learning, The Visual Computer 38 (2022) 2461–2472. doi: https://doi.org/10.1007/s00371-021-
     02123-4.
[11] W. Sun, Y. Song, C. Chen, J. Huang, A .C. Kot, Face Spoofing Detection Based on Local Ternary
     Label Supervision in Fully Convolutional Networks, IEEE Transactions on Information Forensics
     and Security 15 (2020) 3181–3196. doi: https://doi.org/10.1109/TIFS.2020.2985530.
[12] A. Kuznetsov, M. Andrea, M. Alessandro, L. Romeo, R. Rosati, K. Davyd, Deep Learning Based
     Face Liveliness Detection, in: 2022 International Scientific-Practical Conference Problems of
     Infocommunications. Science and Technology, 2022.
[13] Custom Silicone Mask Attack Dataset (CSMAD), Idiap Research Institute, Artificial Intelligence
     for Society. URL: https://www.idiap.ch:/en/dataset/csmad/index_html.
[14] 3DMAD, Idiap Research Institute, Artificial Intelligence for Society. URL:
     https://www.idiap.ch:/en/dataset/3dmad/index_html.
[15] Multispectral-Spoof (MSSpoof), Idiap Research Institute, Artificial Intelligence for Society. URL:
     https://www.idiap.ch:/en/dataset/msspoof/index_html.
[16] Replay-Attack, Idiap Research Institute, Artificial Intelligence for Society. URL:
     https://www.idiap.ch:/en/dataset/replayattack/index_html.