Fake Face Image Detection Using Deep Learning-Based Local
and Global Matching
Margarita Favorskaya and Anton Yakimchuk
Reshetnev Siberian State University of Science and Technology, 31 Krasnoyarsky Rabochy ave., Krasnoyarsk,
660037 Russia


                Abstract
                The widespread adoption of face recognition systems in practice has provoked multiple
                attempts to fail these systems in order to impersonate another person. The range of such fake
                attacks is wide, and methods which can be used to compensate for one type of attacks are not
                adapted against other attacks. In this study, we propose a method for detecting fake face
                images based on local and global matching provided by deep neural networks. Also we do
                not discard the background analysis as a pre-processing stage. The idea is to assess the depth
                of the face in a still image as one of the main features of liveliness, which is not an easy task.
                The proposed method is directed against presentation attacks and attacks of adversarial
                perturbations. The experiments were conducted with and without deep neural networks. The
                use of deep learning increased the true accept rate and significantly reduced the error values.

                Keywords 1
                Fake face detection, presentation attacks, attacks of adversarial perturbations, deep learning,
                local matching, global matchin

1. Introduction
    Face recognition is one of the most famous biometric methods of identity authentication, which is
widely used in the field of security of organizations and enterprises, safety in public places such as
airport terminals, train stations, stadiums, outdoor surveillance, etc. Research in this area began in the
1990s with the traditional machine learning methods (principal component analysis, Bayesian
classification and metric models), methods for detecting local features (Gabor filters and Local Binary
Patterns (LBPs)) and methods for detecting generalized features and advanced to deep learning
techniques. Currently, the accuracy of deep learning-based face recognition has achieved 99.80%. At
the same time, it is believed that human vision shows an accuracy of 97.53% [1].
    Since it is quite easy to replace a face image or present a short video impersonating another person,
face recognition systems must include a fake face detection module. This fake face detection module
is usually introduced after the face detection and alignment module, but before the visual processing
module and the recognition module. It worth noting that fake face detection and face recognition have
different target functions. Detection of forgery is associated with the search for artifacts of the
"liveliness" of the face. Therefore, lighting, shadows, glare, scene depth, etc. are of great importance.
At the same time, face recognition involves minimizing the listed above artifacts and extracting
features that are invariant to lighting, posture, emotions, overlapping objects, etc. The aim of our
study is to develop a method for detecting fake faces using a single photograph. Our objective is to
develop an approach which takes into account the background analysis of an image and extraction of
pseudo-depth parameters from a single photograph using local and global matching provided by deep
neural networks. Of course, accurate depth parameters can be estimated with additional expensive

SibDATA 2021: The 2nd Siberian Scientific Workshop on Data Analysis Technologies with Applications 2021, June 25, 2021,
Krasnoyarsk, Russia
EMAIL: favorskaya@sibsau.ru (M. Favorskaya); yakimchuk_aa@sibsau.ru (A. Yakimchuk)
ORCID: 0000-0002-2181-0454 (M. Favorskaya); 0000-0002-6654-9122 (A. Yakimchuk)
           © 2021 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
devices requiring a fusion of visual, thermal and/or depth information. Our method aims at applying
algorithmic solutions to complex cases such as fake face recognition.
   The structure of the paper is the following. A short literature review is given in Section 2. Section
3 describes the proposed method for detecting fake faces in the images based on local and global
matching. The results of the conducted experiments are discussed in Section 4. Section 5 concludes
the paper.

2. Related work
    Currently, there are two types of widespread attacks in face recognition systems, referred to as
presentation attacks or spoofing attacks and attacks of adversarial perturbations [2]. Presentation
attacks include presenting fake printed images, smartphone images or short video sequences to a
facial recognition camera or disguising a person using cosmetics, makeup or a 3D mask. Masking is
the most complicated case for recognizing presentation attacks. Attacks with the 3D mask are nearly
impossible to identify without additional modalities. Since the 2010s, most countermeasures for
presentation attacks have relied on deep neural networks (earlier, features were manually extracted).
Thus, Yang et al. [3] trained a convolutional neural network (CNN) ImageNet to distinguish fake
faces from genuine ones using both one frame and five scaled frames. This algorithm required
preliminary image alignment using biomarkers. Binary classification (spoof/genuine) was performed
on the CNN output using a support vector machine (SVM). In [4], a two-stream CNN was proposed,
where one stream analyzed local fragments of the face, assigning spoofing estimates, and another
stream was trained to estimate the depth of the scene using 3D samples. Li et al. [5] proposed CNN
with a more complex architecture called deep part features from CNN. The features partially extracted
by the first VGG (Visual Geometry Group) CNN were applied to the second fine-tuned VGG CNN
for classification. An original way to decompose an image into a genuine face and spoofing noise
using CNN was proposed in [6]. In this work, the classification of genuine images was implemented
using noise.
    The analysis of video sequences provides better detection of fake face images since in this case,
artifacts of the “liveliness” of the face are available, for example, blinking [7], simple movements of
the head, and so on. Note that CNN with the LSTM layers are traditionally utilized for the analysis of
spatio-temporal structures. Such an architecture is applied to recognize genuine video sequences in
[8]. Some research is aimed at detecting 3D masks [9-10].
    Adversarial perturbation attacks are based on deep learning models, and, therefore, have appeared
relatively recently. Adversarial perturbation is reduced to a slight distortion of the input image, such
as brightness, in such a way that this perturbation is not identified by human vision, but leads to the
fact that the deep network gives an incorrect classification. Goswami et al. [11] suggested detecting
such masked attacks by analyzing the responses of filters in hidden layers and eliminating the most
problematic filters. The SmartBox software tool for testing the performance of algorithms for
detecting and mitigating adversarial attacks in face recognition systems is presented in [12]. The
SmartBox software tool supports several algorithms, for example, DeepFool, Elastic-Net and utilities
against gradient attacks and L2 attacks. Despite some success in confronting this type of attacks,
adversarial perturbation attacks are constantly becoming more complex and they require further
improvement of the algorithms. Other, more specific types of attacks can be noted, namely, stealing
deep templates of faces for the purpose of manipulation by third persons. The deconvolutional neural
network NbNet was proposed to confront such attacks [13]. The matter is that digital manipulation
attacks using generative adversarial networks can generate fully or partially modified photorealistic
facial images by altering an emotional expression, manipulating attributes or completely synthesizing
a face. Thus, adversarial perturbation attacks are directed against deep neural networks which have
proved to be good in the face recognition problem. The necessity to protect deep neural networks and
deep patterns remains a major challenge in face recognition systems.
3. The proposed method
    The proposed method is based on several verifications due to the fact that the impact of different
attacks leads to different consequences. The method is based on two stages of the face image entering
the input of the recognition system. Note that the task of verifying the genuineness of a face image is
more difficult than using a short video.
    The background analysis and local and global matching are described in Sections 3.1-3.2,
respectively.

3.1.    Background analysis
   Background analysis is required to assess the correspondence of the global brightness and color
parameters of a face image to the entire scene or their divergence from it. It is difficult to cut out the
face image without the background in a photograph. Looking for another background for the face is a
good reason to conduct a more detailed analysis for genuineness. For this, a sufficiently large
fragment of the scene is segmented, where the face image occupies no more than 25-30%. The
assumption is based on the fact that while it is quite simple to change the parameters of the face
image, it is difficult to change the parameters of the scene background, taking into account the
geometric binding of the camera, unknown to the attacker. Figure 1 depicts examples of capturing
faces in the background of the scene. In Figure 1b, the background near the face does not match the
background of the scene.


                         a                             b
Figure 1: Capturing face images considering the background of the scene: a) without artifacts,
b) with artifacts

    CCTV cameras are usually installed stationary. Therefore, for constructing the scene background
model, we can use the Gaussian mixture model (GMM) with its adaptation to changes in lighting and
shadows, as well as to temporal/seasonal/meteorological characteristics [14].
    In the GMM model, the pixel intensity is determined by a mixture of K Gaussian distributions,
where K is a small number. Each Gaussian distribution is associated with its own weight. The GMM
parameters are updated recursively with every incoming sample. The pixel probability P(Xt) is
estimated by Eq. 1, where Xt is the pixel value at time t, K is the number of the Gaussian distributions
taken into account, wj,t is the weight value, j,t is the mean value,  j ,t is the covariance matrix of the
jth Gaussian at time t,  is the Gaussian probability density function (PDF).
                                                                                                                                      (1)
                                                                                                            
                                                    K
                                     P  X t    wj ,t  X t ,  j ,t ,  j ,t
                                                    j 1


   The probability density function  is defined by Eq. 2, where n is the dimensionality of Xt.
                                                                                          1
                                                                                                         j ,t   X t   j ,t 
                                                                                                                  1
                                                                                                                                      (2)
                                                                                         
                                                                                                         T
                                                               1                           X t   j ,t
                       X t ,  j ,t ,  j , t                               1
                                                                                  e       2
                                                           n
                                                    2   j ,t
                                                           2                  2


   For simplicity, the covariance matrix                          j ,t   is defined as  2j ,t I for the jth component, where I is
the identity matrix under assumption that the Xt components (red, green and blue) are independent and
have the same deviations.
   The background distributions have a higher probability and lower standard deviations because the
background colors remain the same for longer time than the foreground objects. This observation
makes the GMM model updated when an incoming pixel is checked against the existing GMM
components. If the pixel value is within 2.5 of the standard deviation of some weighted Gaussian
distribution, then the distribution is updated. Otherwise, the distribution with the minimum weight is
replaced by a new distribution with high initial variance and low prior weight.

3.2.    Detecting local and global matching
     The analysis of local areas near the face is close to the approach used in [4], but in contrast to it,
we use the grid representation of the face image with the size of 33 elements. We get 9 patches
which can be analyzed by 9 sub-streams in the form of the simplest CNN. At the output of such
CNNs, the values of entropy and loss functions are estimated for each of 9 patches, forming the
general assessment of the genuineness of the face image. Such local matching is a countermeasure to
gradient attacks, which are usually local in their nature, and partly to attacks of adversarial
perturbations. The global matching performs the global assessment of the entire face image. Its
purpose is to identify 3D features. To do this, one can use different hardware and software solutions.
Hardware solutions include the use of a 3D scanner (for example, Microsoft Kinect) or a stereo
camera which is not always possible for practical application. Therefore, it is better to focus on
software solutions, in particular, on using CNN trained to classify the depth of the scene.
     The local and global matching is performed if the image passed the first stage (as the roughest
fake). Moreover, this stage can be presented into a single network with two global streams.
Presentation attacks usually distort image details. Therefore, special attention should be paid to the
areas around the eyes, because these areas contain the most detailed information. Our approach of
local matching is close to [15] and is based on the fully convolutional network (FCN), which was
proposed by Long et al. in 2014 [16]. FCN is widely used in semantic image segmentation and differs
from the traditional CNNs by convolutional layers instead of fully connected layers. Such an
architecture tunes the network output into a heat map. The loss function has the form of Eq. 3, where
pi,j(k)  {0, 1} is the prior probability, qi,j(k) is the prediction probability, k is the true class (0 or 1,
genuine or fake image).
                                               1
                                                                                                       (3)
                                   Li , j    pi , j  k  log qi , j  k 
                                              k 0


    The general loss function is defined as the sum of the local loss functions on the grid. CNN builds
a 2×n×n probability map, and after summing the values of each n×n map, a 12 vector is formed to
predict the class. In this case, the decision is made taking into account the predictions of each local
region rather than on the basis of any dominant region.
    The global matching is the assessment of the entire face image, which partly serves to validate the
previous decision. Various representations of the input image are allowed, for example, representation
in the YCbCr color space, in the form of LBP, high-frequency components, training on 3D models,
etc. The experiments have shown good results for the models based on the transition to the YCbCr
color model and analysis of high-frequency components of genuine and fake images. For the global
matching, FCN with 6 convolutional layers and 2 pooling layers is also used, and SVM serves as a
classifier. Then, the results of two streams are combined, and the final decision on the genuineness of
the face image is made.

4. Experimental results
   For the experiments, the OULU-NPU dataset [17] and own dataset were used. The OULU-NPU
dataset contains 4950 videos received from 6 smartphones. The own dataset includes around 420
short videos with real faces, printed face images and videos from the tablet. The presentation attacks
are of two types: print attacks and replay attacks. For experiments, print attacks were simulated. The
dataset was divided into a training set and a test set with the ratio of 70% to 30%. The proposed
method showed the robustness to the presentation attacks and even to the attacks based on adversarial
examples. According to ISO/IEC 30107-3:2017 [18], we calculated the following metrics: true accept
rate (TAR), attack presentation classification error rate (APCER) as the false accept rate (FAR) and
bona-fide presentation classification error rate (BPCER) as the false reject rate (FRR) (in terms of
face recognition) provided by Eqs. 4-5, where TP is the true positive, FP is the false positive, TN is
the true negative, FN is the false negative.

                                  APCER  FP  FP  TN                                          (4)

                                  BPCER  FN  FN  TP                                          (5)

Table 1 includes the estimates without and with deep learning approach with significant difference.

Table 1
Estimates of the fake image detection
  Types of attacks                          TAR, %                  APCER, %           BPCER, %
                                            Without CNN
  Print attacks                             59.3-65.1               10.4-15.7          8.4-9.2
                                            With CNN
  Print attacks                             82.4-89.1               3.6-7.1            1.9-3.5
  Attacks of adversarial perturbations      69.5-75.2               7.5-8.7            4.7-6.2

    The experiments show that the accuracy of detecting fake face images reached 82.4-89.1% and
69.5-75.2% for the presentation attacks (print attacks) and attacks of adversarial perturbations,
respectively.
    The augmentation or generation of new data based on the existing dataset makes it quite easy to
expand the training set. We applied data augmentation “on-the-fly”, when new distorted samples were
created directly during the training process between learning epochs without increasing the amount of
initial data. The augmentation was carefully implemented using slight distortions of shooting
conditions, affine deformation of objects, blur and reflection. This procedure improved the quality of
the model and its robustness to noise in the input data. Using augmentation without changing the
network architecture, it was possible to increase the accuracy of the fake face detection by 3.4% for
print attacks.

5. Conclusion
   At present, fake face image detection is a necessary procedure for the normal functioning of face
recognition systems. In this study, it is shown that there are different approaches to solving this
problem. However, for the protection against various types of attacks, it is reasonable to use several
methods. We offer a two-stage method for verifying the genuineness of a face image before its
entering the face recognition system. The first stage is the background analysis, while the second
stage is local and global matching. For the background estimation, a Gaussian mixture model is built,
and a two-stream deep neural network is created to assess local and global features. The experiments
conducted on the OULU-NPU dataset and own dataset show the accuracy for the presentation attacks
and attacks of adversarial perturbations to be 82.4-89.1% and 69.5-75.2%, respectively. Using data
augmentation, it was possible to increase the accuracy of detecting the presentation attacks to 85.7-
92.5%. However, the temporal estimates of the recognition process do not correspond to the real time
and require further refinement of the algorithms.
6. References
[1] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level
     performance in face verification, in: the IEEE Conference on Computer Vision and Pattern
     Recognition, IEEE, Columbus, OH, USA, 2014, pp. 1701–1708.
[2] M. Wang, W. Deng, Deep face recognition: A survey. Neurocomputing 429 (2021) 215–244.
[3] J. Yang, Z. Lei, S.Z. Li, Learn convolutional neural network for face antispoofing. Cornell ArXiv
     Print, arXiv preprint arXiv:1408.5601, 2014, pp. 1–8.
[4] Y. Atoum, Y. Liu, A. Jourabloo, X. Liu, Face anti-spoofing using patch and depth-based CNNs,
     in: 2017 IEEE International Joint Conference on Biometrics (IJCB), IEEE, Denver, CO, USA,
     2017, pp. 319–328.
[5] L. Li, X. Feng, Z. Boulkenafet, Z. Xia, M. Li, A. Hadid, An original face antispoofing approach
     using partial convolutional neural network, in: the 6th International Conference on Image
     Processing Theory, Tools and Applications (IPTA), IEEE, Oulu, Finland, 2016, pp. 1–6.
[6] A. Jourabloo, Y. Liu, X. Liu, Face de-spoofing: Anti-spoofing via noise modeling, in: Ferrari, V.,
     Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. LNCS, volume
     11217, 2018, pp 297–315.
[7] K. Patel, H. Han, A.K. Jain, Cross-database face antispoofing with robust feature representation,
     in: You, Z., Zhou, J., Wang, Y., Sun, Z., Shan, S., Zheng, W., Feng, J., Zhao, Q. (eds) Biometric
     Recognition (CCBR 2016), LNCS, volume 9967, 2016, pp. 611–619.
[8] Z. Xu, S. Li, W.Deng, Learning temporal features using LSTM-CNN architecture for face anti-
     spoofing, in: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), IEEE, Kuala
     Lumpur, Malaysia, 2015, pp. 141–145.
[9] R. Shao, X. Lan, P. C. Yuen, Deep convolutional dynamic texture learning with adaptive
     channel-discriminability for 3D mask face anti-spoofing, in: 2017 IEEE International Joint
     Conference on Biometrics (IJCB), IEEE, Denver, CO, USA, 2017, pp. 748–755.
[10] R. Shao, X. Lan, P. C. Yuen, Joint discriminative learning of deep dynamic textures for 3D mask
     face anti-spoofing. Transactions on Information Forensics and Security 14(4) (2019) 923–938.
[11] G. Goswami, N. Ratha, A. Agarwal, R. Singh, M. Vatsa, Unravelling robustness of deep learning
     based face recognition against adversarial attacks, in: The Thirty-Second AAAI Conference on
     Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA, volume. 32, 2018, pp. 6829–
     6836.
[12] A. Goel, A. Singh, A. Agarwal, M.Vatsa, R. Singh, Smartbox: Benchmarking adversarial
     detection and mitigation algorithms for face recognition, in: 2018 IEEE 9th International
     Conference on Biometrics Theory, Applications and Systems (BTAS), IEEE, Redondo Beach,
     CA, USA, 2018, pp. 1–7.
[13] G. Mai, K. Cao, P. C .Yuen, A .K. Jain, On the reconstruction of face images from deep face
     templates, Transactions on Pattern Analysis and Machine Intelligence 41(5) (2018) 1188–1202.
[14] M. N. Favorskaya, V. V. Buryachenko, Background extraction method for analysis of natural
     images captured by camera traps, Information and Control Systems 6 (2018) 35–45.
[15] Y. Ma, L. Wu, Z. Li, F. Liu, A novel face presentation attack detection scheme based on multi-
     regional convolutional neural networks, Pattern Recognition Letters 131 (2020) 261–267.
[16] J. Long, E. S helhamer, T. Darrell, Fully convolutional networks for semantic segmentation,
     Transactions on Pattern Analysis and Machine Intelligence 39(4), (2014) 640–651.
[17] OULU-NPU – a mobile face presentation attack database with real-world variations, URL:
     https://sites.google.com/site/oulunpudatabase.
[18] ISO/IEC 30107-3:2017 Information technology – Biometric presentation attack detection – Part
     3: Testing and reporting, URL: https://www.iso.org/standard/67381.html.