=Paper= {{Paper |id=Vol-3766/CVCS2024_12_Paper_Tesse_etal |storemode=property |title=Contribution of residual signals to the detection of face swapping in deepfake videos |pdfUrl=https://ceur-ws.org/Vol-3766/CVCS2024_12_Paper_Tesse_etal.pdf |volume=Vol-3766 |authors=Paul Tessé,Emmanuel Giguet,Christophe Charrier |dblpUrl=https://dblp.org/rec/conf/cvcs/TesseGC24 }} ==Contribution of residual signals to the detection of face swapping in deepfake videos== https://ceur-ws.org/Vol-3766/CVCS2024_12_Paper_Tesse_etal.pdf
                         Contribution of residual signals to the detection of face
                         swapping in deepfake videos
                         Paul Tessé1,† , Emmanuel Giguet1,† and Christophe Charrier1,*,†
                         1
                             Université de Caen Normandie, ENSICAEN, CNRS, GREYC, Normandie Univ, F-14000 Caen, France


                                       Abstract
                                       The remarkable ascent of Deep Learning, particularly with the emergence of generative adversarial networks,
                                       has transformed the landscape of Deepfake technology. These fabrications are evolving into remarkably realistic
                                       renditions, posing greater challenges for detection. Verifying the authenticity of video content has become
                                       increasingly delicate. Moreover, the widespread availability of forgery tools is a growing concern. Despite
                                       numerous proposed detection techniques, determining the efficacy of these methods despite rapid advancements
                                       is challenging. Hence, this paper introduces an approach to detect face swapping in videos using residual signal
                                       analysis.

                                       Keywords
                                       Deepfake videos, Face swapping, Residual error, Digital forensics, Deep learning




                         1. Introduction
                         In today’s hyper-connected society, billions of units of data traverse our networks daily. Regrettably,
                         there’s a growing uncertainty regarding the reliability and safety of this data. This concern is particularly
                         pronounced when considering the vast dissemination and sharing of videos and images, which account
                         for a substantial portion of this data flow. With around 5.35 billion internet users worldwide, each
                         person can potentially generate approximately 15.87 TB of data daily. From those data, approximately
                         28.08 billion photos were stored online daily in 2023. Moreover, accessing video content has become
                         exceedingly convenient with the prevalence of mobile devices, streaming services, the internet, and
                         social media platforms. For instance, videos constituted 28% of all internet traffic in 2022, and 625
                         million videos are viewed on TikTok every internet minute, that’s up from 167 million just two years
                         ago [2].
                            The rapid growth in Deep Learning schemes has led to the emergence of numerous efficient models
                         for generating false images or videos. As a matter of fact, while these models are becoming increasingly
                         powerful, they are also becoming more and more accessible to the public thanks to the internet and
                         social media. We are observing a notable surge in the proliferation of counterfeit multimedia content,
                         notably hyper-realistic videos, commonly referred to as "deepfakes". For instance, AI-powered software
                         applications such as FaceApp [3] and FakeApp [4] have been employed for the creation of convincingly
                         realistic face swaps in both images and videos. This capability enables users to modify facial appearances,
                         hairstyles, genders, ages, and other personal features. The dissemination of such manipulated videos
                         has raised significant concerns and has gained notoriety as Deepfake technology. The main threats are
                         the increasing difficulty to distinguish real from fake, even for informed people, and the inability to
                         trust images and videos even though they are ubiquitous media. Detecting these manipulated videos
                         has become a crucial societal challenge.
                            As a consequence, many researchers have studied deepfake detection, proposing methods mainly
                         based on the use of Deep Learning [5]. However, although these models presented as state-of-the-art

                         CVCS2024: the 12th Colour and Visual Computing Symposium, September 5–6, 2024, Gjøvik, Norway
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ emmanuel.giguet@unicaen.fr (E. Giguet); christophe.charrier@unicaen.fr (C. Charrier)
                         € https://giguete.users.greyc.fr (E. Giguet); https://charrierc.users.greyc.fr/ (C. Charrier)
                          0000-0001-8617-0091 (E. Giguet); 0009-0004-6427-2694 (C. Charrier)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
models show very good performance, the durability of these models and their ability to generalize to
any database are lacking. Nevertheless, the main drawback of these models lies in the fact that they are
usually designed and used as black boxes. In fact, it is not really possible to provide a justification for
the pronounced verdicts, which makes these detectors unusable.
   In this paper, an explainable model that takes a video as input and returns a verdict on its authenticity
as output is proposed. The problem is therefore characterized as a binary classification problem where
the classes are defined as "authentic" and "forged". The secondary objectives considered in this paper
are the following:

    • the system must be able to be used to support the judicial system, and a particular attention must
      be paid to the explainability of the results;
    • the model must work without any reference to pronounce its diagnosis;
    • the model must be as robust and generalizable as much as possible;
    • the approach focuses on face swapping detection, and a face detection mechanism is required in
      order to target the area to be studied;
    • the model must work regardless of the length of the video and regardless of the position or the
      moment where the fake part appears.

  Eventually, the aim is to achieve the best possible trade-off between efficiency and explainability.
The aim is not to come up with a perfect solution, but rather a proof of concept to determine whether
or not the proposed approach is viable.


2. State-of-the-art methods
Although deepfake technology has the potential for beneficial applications, such as in film-making and
virtual reality, its predominant usage remains for malicious intents [6, 7]. Out of all manipulated videos,
our focus will be on face swapping. Various method have been proposed to detected forged video.
Detection based on general Convolutional Neural Network (CNN) is commonly used in literature, where
deepfake detection is considered as a classification task. In those method, face images are extracted from
the suspicious video and are used to train the CNN. The learned model is applied to predict whether
the video is real or a fake.
   In [11] Guëra et al present a deep architecture split into a feature extractor part and a classifier part.
The feature extractor uses convolutional networks to extract spatial features from images, while a LSTM
is used to extract temporal features. Regarding the classifier, the authors use a likelihood estimation in
order to predict the probability of a video to be real or fake. The authors argue very good performance
with this method, 97.1% of detection. However, these results remain poorly explainable. As a matter of
fact, it is difficult to give a sense to the verdict since it is based on features that are not easily explainable.
Whatever the proposed scheme based on this strategy, the detection accuracy heavily relies on both the
used neural network and the train dataset, without considering anymore the need to exploit specific
distinguishable features.
   Recently, many methods seek to analyze what are known as residual signals. These are characteristics
intrinsic to an image that are generated during the acquisition process and that can be altered during the
face swapping process. Some methods based on mid-level manipulation traces are devoted to finding
inconsistencies in fake content. In [1], Gao et al. have explored the inconsistency of identity information
between inner and outer faces. In [8] Wu et al. regarded DeepFake detection as a source detection task
and utilized the multi-scale spatiotemporal photoplethysmography map from multiple facial regions to
capture counterfeit cues. According to [9] emphet al., fake faces typically involve a fusion step. Based
on this assumption, they proposed Face X-ray to detect the presence of face fusion boundaries in images
and use it as a metric to measure the authenticity.
   Despite the above papers reporting good generalization performance, the extracted cues only provide
limited global information which, combined with complex learning models, increases computational
complexity and training difficulty. To address these challenges more effectively, this paper introduces a
                                             Quality Features Extractor FE1
                                                          𝑞1
                                                          𝑞2
                                                          𝑞3
                                                           ..
                                                            .                 𝑞1
                                                          𝑞37                 𝑞2
                                                                              𝑞3                    Fake
                                                                               ..   C
                                                                                .   N               Real
                                                                              𝑞37
       Video frames                                                           𝑓1    N
                           Extracted faces
                                                          𝑓1

                                        Frequency Features Extractror FE2

Figure 1: Synposis of the proposed architecture.


straightforward deepfake detection method where used features can be explained, combining thus both
explainability and performance
   Considering video as a succession of images, we investigate how these residual signals can be analyzed
frame by frame in both spatial and frequency domain. Actually, these signals, which are invisible to
the naked eye and look like a hidden signature, are varied and can be explained, enabling us to extract
features that can be explained and thus, used for prediction.


3. The proposed approach
To address both explainability and performance, we propose a light architecture in which the features
used can all be explained.
   Based on a classical two-part architecture (extractor and classifier), the proposed architecture is
depicted in Fig. 1. First, the input video is split into frames from which the faces are extracted. The
obtained faces are then analyzed by a set of designed Features Extractors (FE), which generate a vector
of explainable features. Finally, these explainable features are concatenated and passed on to the deep
classifier, which gives its verdict at frame level.

3.1. Face extraction
For each video, a face detector is used to locate the face area in the image. Among all solutions, we used
the Multi-task Cascaded Convolutional Neural Networks (MTCNN) introduced by Zhang et al. [10].
  Face swapping generates non visible artifacts around the area where the face has been swapped.
Thus extracting only faces does not guarantee we are able to capture such hidden artifacts, since it
focusing only on faces whereas artifacts arise in a larger area around the face. A margin is thus applied
when faces are extracted.

3.2. Features FE𝑖
Face swapping entails the process of seamlessly replacing a face from a source image with another face
in a target image, achieving a natural blend between the replacement and the target image. Since both
images (source and target image) came from two distinct video, the quality of them is not necessarily
the same. Assessing the quality of the face with regards to its neighborhood can provide relevant
information. In addition, replacing face into an image may introduce artificial high frequencies into
its neighborhood. Capturing such a variation may help to detect face swappping. In this paper, we
investigate those alluded above residual signals.
                                (a) Face extracted with- (b) Face extracted with
                                out margin               margin

Figure 2: Illustration of face extraction with or without margin around the face using MTCNN


3.2.1. Image Quality Assessment Features




Figure 3: Quality features extraction process [26]


   The first residual signal we focus on is the image quality. Actually, it can be observed that image
falsification processes tend to reduce the quality of the original images introducing artifacts. In [13], a
detection method based on image quality analysis was presented. This work was then taken up more
recently in [14]. The authors greatly increase the number of quality measurements carried out and test
their regression model on state-of-the-art databases.
   Since we only have access to the video and process it frame by frame to decide if it is a forged or real
video, we consider No-Reference Image Quality Assessment (NR-IQA) algorithms. Among all available
schemes, we selected the Blind/Referenceless Image Spatial Quality Evaluator also known as BRISQUE
[15] due to its high correlation with human judgments. The principle of this method is schematized in
Fig. 3.
   It achieves this by analyzing spatial domain features within the image, such as local mean and
standard deviation. A machine learning model is then trained on a dataset of natural images with mean
opinion scores to predict the image quality score. The higher BRISQUE scores, the lower image quality.
And lower scores suggest higher image quality. Furthermore, BRISQUE has very low computational
complexity and is an agnostic NR-IQA, that is useful for our purpose, since we do not know which kind
of distortions may appear when considering deepfake video frames.
   BRISQUE is based on a two-scale approach where 18 features are computed by scale, resulting in a
total of 36 quality features.
   In our approach, we choose to compute 37 quality features associated to image quality which include
the 36 BRISQUE features along with the predicted overall quality score.

3.2.2. Frequency spectrum analysis
Given the advantages of fast operation speed and energy concentration, Discrete Cosine Transform
(DCT) is often used for many tasks. Altering the structure of an image when faces are swapped for
instance introduces artificial high frequencies. In [16], the authors investigated the impact of the
deepfake generation process on the frequency spectrum of images. Their results, illustrated by the Fig.
4, clearly indicate that deepfakes have a higher high-frequency intensity than real images and that it
is possible to use this property to detect deepfakes. The origin of this phenomenon lies in the use of
Generative Adversarial Networks (GANs), which are the keystone of modern deepfake models. These
models are forced to use upsampling to generate images, which introduces more high frequencies due
to interpolation.
   We propose to extract features by exploiting the DCT domain to deal with the problem of artificial high
frequencies. At this step, the image is divided into equally sized 𝑛 × 𝑛 blocks and a local 2-dimensional
DCT transform is computed for each of these blocks. For each frequency block, we compute the ratio
between the standard deviation of frequencies F and the average of frequency defined as:

                                                    𝜎|𝐹 |
                                               𝑟=                                                      (1)
                                                    𝜇|𝐹 |

The highest 10𝑡ℎ percentile average of the local block ratios across the image is then computed. This
pooling result is used as frequency feature.
  Finally a comprehensive set of 38 explainable characteristics is computed, encompassing the 37
quality features along with the frequency trait.

3.3. The designed deep classifier
The light designed classifier is mainly based on a succession of linear layers and RELU activations. Fig.
5 provides a detailed description of its characteristics. A batch normalization is applied on the input in




Figure 4: Frequency spectrum analysis




Figure 5: The designed deep classifier
order to re-center and re-scale input features. Then a succession of two linear layer and RELU activation
block is defined. A dropout regularization with a rate of 0.33 is then applied to prevent overfitting. A
new bock of linear layer and RELU activation function is then applied, followed by a linear layer and a
sigmoid activation function. The loss used is the Binary Cross Entropy Logit function considering that
we are operating within a binary classification framework with relatively evenly distributed classes.
   The proposed classifier has been trained using a decreasing learning rate starting at 0.005. The
number of epochs is 100 for a batch size of 1024 to yield the best correlation scores is applied. Early
Stopping technique is used for regularization and model generalization enhancement.


4. Results
4.1. Experimental setup
To evaluate the performance of the proposed method, four databases have been selected:
   1. VidTIMIT dataset [17] which comprises of video and corresponding audio recordings of 43 people,
      reciting short sentences. This database will serve as real samples.
   2. DeepFakeTIMIT dataset [14] which contains videos where faces are swapped using the open
      source GAN-based approach which, in turn, was developed from the original autoencoder-based
      Deepfake algorithm. A total of 620 total videos with faces swapped is provided.
   3. FF++ dataset [18] consisting of 1,000 original video sequences that have been manipulated with
      four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures.
      The data has been sourced from 977 youtube videos and all videos contain a trackable mostly
      frontal face without occlusions which enables automated tampering methods to generate realistic
      forgeries.
   4. Celeb-DF [19] is a large-scale challenging dataset for deepfake forensics. It includes 590 original
      videos collected from YouTube with subjects of different ages, ethnic groups and genders, and
      5,639 corresponding DeepFake videos.
  From previously alluded databases, we generate one new database containing 79 385 real and 85 826
fake extracted frames from

    • 300 real videos randomly selected from the VidTIMIT database,
    • 320 fake videos randomly selected from the DeepFakeTIMIT database,
    • 200 real and 600 fake videos both randomly selected from the FF++ dataset,
    • 50 real and 50 fake videos both randomly selected from the Celeb-DF database.

   From this new database, frames were then randomly separated into 4 different sets used during the
learning and evaluation process of the proposed light CNN model, such as:
   1. training set contains 31,627 True (real) frames and 34,826 False (fake) frames,
   2. validation set containing 13,474 True (real) frames and 15,629 False (fake) frames,
   3. test set (13,590 True (real) frames and 14,466 False (fake) frames,
   4. generalization set (20,694 True (real) frames and 20,905 False (fake) frames,
  The results were computed with a 5-fold process.

4.2. Performance evaluation
In order to evaluate the performance of the proposed scheme, five different measures have been used:
1) Accuracy, 2) Recall, 3) Precision, 4) F1-score and 5) AUC (Area Under the ROC Curve).
   𝑇𝑃 , 𝑇𝑁 , 𝐹𝑃 , and 𝐹𝑁 respectively represents true positive, true negative, false positive, and false
negative.
                           Set         Acccuracy       F1    AUC    Precision   Recall
                          Train           0.96        0.96   0.99     0.97       0.95
                       Validation         0.82        0.84   0.88     0.83       0.86
                          Test            0.84        0.85   0.89     0.83       0.87
                      Generalization      0.49        0.60   0.52     0.50       0.75

Table 1
Performance evaluation of the proposed scheme.



  The Accuracy is the fraction of predictions correctly identified by the model, ad is defined as:

                                                      𝑇𝑃 + 𝐹𝑁
                                   Accuracy =                                                           (2)
                                                 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
  The Recall is the percentage of positives well predicted by the model, defined as:

                                                         𝑇𝑃
                                           Recall =                                                     (3)
                                                      𝑇𝑃 + 𝐹𝑁
The higher it is, the more the Machine Learning model maximizes the number of True Positives. When
recall is high, this means it won’t miss any positives. However, this gives no indication of its predictive
quality on negatives.
  The Precision is the number of positive predictions made defined as:

                                                          𝑇𝑃
                                         Precision =                                                    (4)
                                                        𝑇𝑃 + 𝐹𝑃
  The higher the precision, the more the Machine Learning model minimizes the number of False
Positives. When precision is high, this means that the majority of the model’s positive predictions are
well-predicted positives.
  The F1-score is an harmonic mean and provides a relatively accurate assessment of our model’s
performance. It is defined as
                                                   Recall.Accuracy
                                 F1-score = 2 ×                                                      (5)
                                                  Recall + Accuracy
The higher the F1 Score, the better the model’s performance.
  The AUC computes the area under the ROC curve when plotting the precision versus the recall
value. It represents the overall performance of the model, i.e., the probability that a randomly chosen
positive sample will be ranked higher by the model than a randomly chosen negative sample. A perfect
model would have an AUC of 1, while a random model would have an AUC of 0.5.

4.3. Results
Table 1 displays the obtained results. Whatever the considered measure, we obtained very good
performance both in validation and testing, although, with repesct to the training performance, one
notes a drop in performance, which is symptomatic of Deep Learning. The results for the generalization
base (Celeb-DF) are much lower, which shows that our model is not yet sufficiently robust. These
results are nevertheless satisfactory since Celeb-DF uses the latest deepfakes models, i.e., unseen data,
for which the quality is superior to those used in the three other sets. There is therefore a significant
gap, which may explain this drop. To face this drawback, the amount of data for the training process is
increased by adding samples that are more difficult to diagnose from the DFDC database [20].
  To evaluate the performance of the proposed strategy against state-of-the-art approaches, we selected
four methods. The first method, Xception [23], is an architecture based on Inception, but it replaces
the traditional modules with depthwise separable convolutions. The second method, DSP-FWA [24],
                        Xception     DSP-FWA      EfficientNetB4     EfficientNetB4ATTST     Ours
            Accuracy      0.80         0.70             0.98                  0.69           0.84
               F1         0.85         0.83             0.89                  0.72           0.85
              AUC         0.87         0.88             0.92                  0.82           0.89

Table 2
Performance evaluation obtained from the test set.


                      Xception     DSP-FWA      EfficientNetB4     EfficientNetB4ATTST      Ours
              Size   22,855,952    25,636,712     19,341,616             21,995,642        16,730

Table 3
Number of trainable parameters of each trial scheme.



uses convolutional neural networks (CNNs) to capture artifacts from the warping process needed to
adapt the new face to the source image. The third method, EfficientNetB4 [25], is currently one of the
leading networks in deepfake detection, as identified in a broader study. Lastly, EfficientNetB4ATTST
[25] builds on EfficientNetB4 by integrating an attention mechanism and siamese training to enhance
the model’s generalization capability.
   The performance of the proposed method is competitive with the analyzed state-of-the-art techniques.
Although it does not surpass the best performer, EfficientNetB4, the proposed strategy outperforms
the remaining three methods across the three performance measures used (Table 2). Overall, the
performance is generally strong on face swap generation techniques.
  Table 3 displays the number of trainable parameters for each trail state-of-the-art schemes and the
proposed one. One may note that the introduced lightCNN clearly has significantly fewer parameters
compared to state-of-the-art methods. Although this number of parameters is limited, the results
achieved are competitive with those of state-of-the-art networks. Furthermore, the proposed scheme
can be trained in just a few minutes (less than 5 minutes on a Dell Laptop SP15 with a NVIDIA GeForce
RTX 4070, 8 Go GDDR6).

4.4. Discussion
To conclude, the performance of our model is very encouraging even if it does not outperform the trial
state-of-the-art schemes.
   Potential improvements include incorporating more advanced feature extraction techniques, such as
multi-scale feature aggregation or hybrid models that combine convolutional layers with transformers.
These techniques can capture fine-grained details crucial for detecting subtle manipulations in face-
swapping.
   Another approach is the use of perceptual loss functions to enhance the model’s ability to detect
subtle artifacts introduced by face-swapping. By focusing on high-level features from both authentic
and manipulated faces, the model can better distinguish real faces from swapped ones.
   Additionally, it is essential to assess and address potential biases in the model to ensure it performs
well across different demographic groups. Implementing fairness-aware training strategies can help
achieve this goal.


5. Conclusion
This work has enabled us to experiment with a hybrid approach between traditional forensic methods
and state-of-the-art deepfake detection methods based on Deep Learning. Our architecture is more
explainable and therefore viable in an integration context. What’s more, its good performance with just
38 features and little data seems to indicate that by digging deeper in this direction it would be possible
to achieve very good results. Finally, our architecture also has the advantage of being lightweight and
quick to train and use, which is increasingly rare with the development of Deep Learning. It also makes
it easy to integrate new feature extraction modules, which guarantees its durability. Future works
will investigate how new residual signals computed in spatial, frequency and color spaces may help to
increase the performance of the approach.


References
[1] Gao, J. and Concas, S. and Orrù, G.and Feng, X. and Marcialis, G.L.and Roli, F. Generalized Deepfake Detection
    Algorithm Based on Inconsistency Between Inner and Outer Faces. In Image Analysis and Processing - ICIAP
    2023 Workshops. Vol 14365. Springer (2023)
[2] Datareportal, https://datareportal.com/reports/digital-2024-global-overview-report (2024)
[3] FaceApp, https://www.faceapp.com/ (2016)
[4] FakeApp, https://www.faceapp.com/ (2018)
[5] M. S. Rana, M. N. Nobi, B. Murali and A. H. Sung. Deepfake Detection: A Systematic Literature Review, in
    IEEE Access, vol. 10, pp. 25494-25513 (2022), doi: 10.1109/ACCESS.2022.3154404.
[6] Delfino, R.: Pornographic Deepfake–Revenge Porn’s Next Tragic Act–The Case for Federal Criminalization.
    887, 88 Fordham L. Rev (2019). SSRN33415933.
[7] Dixon, H.B., Jr.: Deepfakes: More frightening than photoshop on steroids. Judges J. 58(3), 35–37 (2019)
[8] Wu Jiahui and Zhu Yu and Jiang Xiaoben ans Liu Yatong and Lin Jiajun. Local attention and long-distance
    interaction of rPPG for deepfake detection, The Visual Computer, 40 (2) (2024), pp. 1083-1094
[9] Li Lingzhi and Bao Jianmin and Zhang Ting and Yang Haoand Chen Dong and Wen Fang, Face x-ray for more
    general face forgery detection, In Proceedings of the conference on computer vision and pattern recognition
    (CVPR), IEEE (2020), pp. 5001-5010
[10] Zhang, K. and Zhang, Z. and Li, Z. and Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded
    Convolutional Networks. IEEE Signal Processing Letters 23(10), 1499–1503 (Oct 2016).
[11] David Güera and Edward J. Delp, Deepfake Video Detection Using Recurrent Neural Networks, 15th IEEE
    International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1-6, 2018
[12] Verdoliva, Luisa. (2020). Media Forensics and DeepFakes: An Overview. IEEE Journal of Selected Topics in
    Signal Processing. PP. 1-1. 10.1109/JSTSP.2020.3002101.
[13] J. Galbally and S. Marcel, "Face Anti-spoofing Based on General Image Quality Assessment," 2014
    22nd International Conference on Pattern Recognition, Stockholm, Sweden, 2014, pp. 1173-1178, doi:
    10.1109/ICPR.2014.211.
[14] Korshunov, Pavel and Marcel, Sébastien. (2018). DeepFakes: a New Threat to Face Recognition? Assessment
    and Detection.
[15] A. Mittal, A. K. Moorthy and A. C. Bovik, "No-Reference Image Quality Assessment in the Spatial Domain," in
    IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695-4708, Dec. 2012, doi: 10.1109/TIP.2012.2214050.
[16] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020.
    Leveraging frequency analysis for deep fake image recognition. In Proceedings of the 37th International
    Conference on Machine Learning (ICML’20), Vol. 119. JMLR.org, Article 304, 3247–3258.
[17] Sanderson, Conrad and Lovell, Brian. (2009). Multi-Region Probabilistic Histograms for Robust and Scalable
    Identity Inference. LNCS. 5558. 10.1007/978-3-642-01793-3-21.
[18] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies and M. Niessner, "FaceForensics++: Learning to
    Detect Manipulated Facial Images," 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
    Seoul, Korea (South), 2019, pp. 1-11, doi: 10.1109/ICCV.2019.00009.
[19] Y. Li, X. Yang, P. Sun, H. Qi and S. Lyu, "Celeb-DF: A Large-Scale Challenging Dataset for DeepFake
    Forensics," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA,
    USA, 2020, pp. 3204-3213, doi: 10.1109/CVPR42600.2020.00327.
[20] Dolhansky, Brian and Bitton, Joanna and Pflaum, Ben and Lu, Jikuo and Howes, Russ and Wang, Menglin
    and Ferrer, Cristian. (2020). The DeepFake Detection Challenge (DFDC) Dataset. (arXiv:2006.07397)
[21] Li, L., Bao, J., Yang, H., Chen, D., and Wen, F. (2020). Advancing high fidelity identity swapping for forgery
    detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.
    5074-5083).
[22] Ding, X., Raziei, Z., Larson, E. C., Olinick, E. V., Krueger, P., and Hahsler, M. (2020). Swapped face detection
    using deep learning and subjective assessment. EURASIP Journal on Information Security, 2020(1), 1-12.
[23] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE
    conference on computer vision and pattern recognition. pp. 1251–1258 (2017)
[24] Li, Y., Lyu, S. Exposing deepfake videos by detecting face warping artifacts. IEEE Conference on Computer
    Vision and Pattern Recognition Workshops (CVPRW) (2019)
[25] Bonettini, N., Cannas, E.D., Mandelli, S., Bondi, L., Bestagini, P., Tubaro, S. Video face manipulation detection
    through ensemble of CNNs. 2020 25th international conference on pattern recognition (ICPR). pp. 5012–5019.
    IEEE (2021)
[26] Chanda, K., Ahmed, W., Banik, S. (2024). Deepfake Image Detection for Low and High Quality Images
    for Biometric Face Recognition. In: Nayak, R., Mittal, N., Kumar, M., Polkowski, Z., Khunteta, A. (eds)
    Recent Advancements in Artificial Intelligence . ICRAAI 2023. Innovations in Sustainable Technologies and
    Computing. Springer, Singapore.