DeepFakesON-Phys: DeepFakes Detection based on Heart Rate Estimation
                Javier Hernandez-Ortega, Ruben Tolosana, Julian Fierrez, Aythami Morales
                                      Biometrics and Data Pattern Analytics Lab - BiDA Lab
                                                Universidad Autonoma de Madrid
                           {javier.hernandezo, ruben.tolosana, julian.fierrez, aythami.morales}@uam.es


                            Abstract                                  learning (Hernandez-Ortega et al. 2020a), health care (Mc-
                                                                      Duff et al. 2015), human-computer interaction (Tan and Ni-
  This work introduces a novel DeepFake detection framework           jholt 2010), and security (Marcel et al. 2019), among many
  based on physiological measurement. In particular, we con-
  sider information related to the heart rate using remote pho-
                                                                      other tasks.
  toplethysmography (rPPG). rPPG methods analyze video se-               In physical face attacks, a.k.a. Presentation Attacks (PAs),
  quences looking for subtle color changes in the human skin,         real subjects are often impersonated using artifacts such
  revealing the presence of human blood under the tissues. In         as photographs, videos, and masks (Marcel et al. 2019).
  this work we investigate to what extent rPPG is useful for          Face recognition systems are known to be vulnerable against
  the detection of DeepFake videos. The proposed fake de-             these attacks unless proper detection methods are imple-
  tector named DeepFakesON-Phys uses a Convolutional At-              mented (Galbally, Marcel, and Fierrez 2014; Hernandez-
  tention Network (CAN), which extracts spatial and tempo-            Ortega et al. 2019). Some of these detection methods are
  ral information from video frames, analyzing and combin-
  ing both sources to better detect fake videos. DeepFakesON-
                                                                      based on liveness detection by using information such as
  Phys has been experimentally evaluated using the latest pub-        eye blinking or natural facial micro-expressions (Bharadwaj
  lic databases in the field: Celeb-DF and DFDC. The results          et al. 2013). Specifically for detecting 3D mask imperson-
  achieved, above 98% AUC (Area Under the Curve) on both              ation, which is one of the most challenging type of attacks,
  databases, outperform the state of the art and prove the suc-       detecting pulse from face videos using remote photoplethys-
  cess of fake detectors based on physiological measurement to        mography (rPPG) has shown to be an effective countermea-
  detect the latest DeepFake videos.                                  sure (Hernandez-Ortega et al. 2018). When applying this
                                                                      technique to a video sequence with a fake face, the estimated
                                                                      heart rate signal is significantly different to the heart rate ex-
                        Introduction                                  tracted from a real face (Erdogmus and Marcel 2014).
DeepFakes have become a great public concern re-                         Seeing the good results achieved by rPPG techniques
cently (Citron 2019; Cellan-Jones 2019). The very popular             when dealing with physical 3D face mask attacks, and since
term “DeepFake” is usually referred to a deep learning based          DeepFakes are digital manipulations somehow similar to
technique able to create fake videos by swapping the face of          them, in this work we hypothesize that fake detectors based
a person by the face of another person. This type of digi-            on physiological measurement can also be used against
tal manipulation is also known in the literature as Identity          DeepFakes after adapting them properly. DeepFake gener-
Swap, and it is moving forward very fast (Tolosana et al.             ation methods have historically tried to mimic the visual
2020b).                                                               appearance of genuine faces. However, to the best of our
   Currently, most face manipulations are based on pop-               knowledge, they do not emulate the physiology of human
ular machine learning techniques such as AutoEncoders                 beings, e.g., heart rate, blood oxygenation, or breath rate,
(AE) (Kingma and Welling 2013) and Generative Adversar-               so estimating that type of signals from the video could be a
ial Networks (GAN) (Goodfellow et al. 2014), achieving in             powerful tool for the detection of DeepFakes.
general very realistic visual results, specially in the latest
                                                                         The novelty of this work consists in using rPPG fea-
generation of public DeepFakes (Tolosana et al. 2020a), and
                                                                      tures previously learned for the task of heart rate estima-
the present trends (Karras et al. 2020). However, and despite
                                                                      tion and adapting them for the detection of DeepFakes by
the impressive visual results, are current face manipulations
                                                                      means of a knowledge-transfer process, thus obtaining a
also considering the physiological aspects of the human be-
                                                                      novel fake detector based on physiological measurement
ing in the synthesis process?
                                                                      named DeepFakesON-Phys. In particular, the information
   Physiological measurement has provided very valu-                  related to the heart rate is considered to decide whether a
able information to many different tasks such as e-                   video is real or fake. Our physiological detector intends to be
Copyright © 2021 for this paper by its authors. Use permitted under   a robust solution to the weaknesses of most state-of-the-art
Creative Commons License Attribution 4.0 International (CC BY         DeepFake detectors based on the visual features existing in
4.0).                                                                 fake videos (Matern, Riess, and Stamminger 2019; Agarwal
                                                                             Motion Model

                                                              36x36x3

                                                                                                                                                                          Output Score
         DeepFakesON-Phys


                               Deepfake                  t                                                                                                                   [0,1]
                                                     e
                             Training Data       tim                                                          Element-wise                            Element-wise
                                                             Normalized                                       Multiplication                          Multiplication
                                                                Frame         32x3x3   32x3x3   Pool Size 2                    64x3x3   Pool Size 2
                                                              Difference
                                                             I(t) - I(t-1)

                                                                                                                                                                       Conv. 2D +
                                                                                                                                                                       Tanh activation
                                 Video Frames                 36x36x3                                                 1x1                                     1x1      Conv. 2D +
                            ...I(t-1), I(t), I(t+1)...
                                                                                                                                                                       Sigmoid activation
                                               tim                                                                                                                      Average Pooling
                                                  et


                                                             Normalized                                                                                                Convolutional
                                                               Frame          32x3x3   32x3x3   Pool Size 2                    64x3x3   Pool Size 2
                                                                I(t)
                                                                                                                                                                           Attention
                            Preprocessing                                    Appearance Model                                                                               Network


Figure 1: DeepFakesON-Phys architecture. It comprises two stages: i) a preprocessing step to normalize the video frames,
and ii) a Convolutional Attention Network composed of Motion and Appearance Models to better detect fake videos.


and Farid 2019) and also on the artifacts/fingerprints inserted                                                 • A thorough experimental assessment of the pro-
during the synthesis process (Neves et al. 2020), which are                                                       posed DeepFakesON-Phys, considering the latest public
highly dependent on a specific fake manipulation technique.                                                       databases of the 2nd DeepFake generation such as Celeb-
    DeepFakesON-Phys is based on DeepPhys (Chen and                                                               DF v2 and DFDC Preview. DeepFakesON-Phys achieves
McDuff 2018), a deep learning model trained for heart                                                             high-accuracy results, outperforming the state of the art.
rate estimation from face videos based on rPPG. DeepPhys                                                          In addition, the results achieved prove that current face
showed high accuracy even when dealing with challenging                                                           manipulation techniques do not pay attention to the heart-
conditions such as heterogeneous illumination or low reso-                                                        rate-related physiological information of the human being
lution, outperforming classic hand-crafted approaches. We                                                         when synthesizing fake videos.
used the architecture of DeepPhys, but making changes to                                                          The remainder of the paper is organized as fol-
make it suitable for DeepFake detection. We initialized the                                                    lows. Related Works summarizes previous studies
weights of the layers of DeepFakesON-Phys with the ones                                                        focused on the detection of DeepFakes. Proposed
from DeepPhys (meant for heart rate estimation based on                                                        Method: DeepFakesON-Phys describes the proposed
rPPG) and we adapted them to the new task using fine-                                                          DeepFakesON-Phys fake detection approach. Databases
tuning. This process allowed us to train our detector without                                                  summarizes all databases considered in the experimental
the need of a high number of samples (compared to training                                                     framework of this study. Experiments describes the experi-
it from scratch). Fine-tuning also helped us to obtain a model                                                 mental protocol and the results achieved in comparison with
that detects DeepFakes by looking to rPPG related features                                                     the state of the art. Finally, Conclusions draws the final
from the images in the face videos.                                                                            conclusions and points out future research lines.
    In this context, the main contributions of our work are:
• An in-depth literature review of DeepFake detection                                                                                        Related Works
  approaches with special emphasis to physiological tech-                                                      Different approaches have been proposed in the literature
  niques, including the key aspects of the detection systems,                                                  to detect DeepFake videos. Table 1 shows a comparison of
  the databases used, and the main results achieved.                                                           the most relevant approaches in the area, paying special at-
• An approach based on physiological measurement to                                                            tention to the fake detectors based on physiological mea-
  detect DeepFake videos: DeepFakesON-Phys1 . Fig. 1                                                           surement. For each study we include information related to
  graphically summarizes the proposed fake detection ap-                                                       the method, classifiers, best performance, and databases for
  proach based on the original architecture DeepPhys (Chen                                                     research. It is important to remark that in some cases, dif-
  and McDuff 2018), a Convolutional Attention Network                                                          ferent evaluation metrics are considered, e.g., Area Under
  (CAN) composed of two parallel Convolutional Neural                                                          the Curve (AUC) and Equal Error Rate (EER), which com-
  Networks (CNN) able to extract spatial and temporal in-                                                      plicate the comparison among studies. Finally, the results
  formation from video frames. This architecture is adapted                                                    highlighted in italics indicate the generalization ability of
  for the detection of DeepFake videos by means of a                                                           the detectors against unseen databases, i.e., those databases
  knowledge-transfer process.                                                                                  were not considered for training. Most of these results are
                                                                                                               extracted from (Li et al. 2020).
   1
       https://github.com/BiDAlab/DeepFakesON-Phys                                                                The first studies in the area focused on the visual arti-
Table 1: Comparison of different state-of-the-art fake detectors. Results in italics indicate the generalization capacity of the
detectors against unseen databases. FF++ = FaceForensics++, AUC = Area Under the Curve, Acc. = Accuracy, EER = Equal
Error Rate.
                        Study                           Method                      Classifiers         Best Performance          Databases
                                                                                                          AUC = 85.1%                Own
                                                                                                          AUC = 78.0%            FF++ / DFD
                                                     Visual Features            Logistic Regression
        (Matern, Riess, and Stamminger 2019)                                                              AUC = 66.2%           DFDC Preview
                                                                                       MLP
                                                                                                          AUC = 55.1%             Celeb-DF
                                                                                                          AUC = 97.7%              UADFV
                                                                                                          AUC = 93.0%            FF++ / DFD
                                                  Face Warping Features               CNN
           (Li and Lyu 2019; Li et al. 2020)                                                              AUC = 75.5%           DFDC Preview
                                                                                                          AUC = 64.6%             Celeb-DF
                                                                                                          Acc. ' 94.0%       FF++ (DeepFake, LQ)
                                                                                                          Acc. ' 98.0%       FF++ (DeepFake, HQ)
                                                  Mesoscopic Features
                                                                                                         Acc. ' 100.0%      FF++ (DeepFake, RAW)
                 (Rössler et al. 2019)           Steganalysis Features               CNN
                                                                                                          Acc. ' 93.0%       FF++ (FaceSwap, LQ)
                                                 Deep Learning Features
                                                                                                          Acc. ' 97.0%       FF++ (FaceSwap, HQ)
                                                                                                          Acc. ' 99.0%      FF++ (FaceSwap, RAW)
                                                                                                          AUC = 61.3%              UADFV
                                                                                                          AUC = 96.6%            FF++ / DFD
                                                 Deep Learning Features         Capsule Networks
        (Nguyen, Yamagishi, and Echizen 2019)                                                             AUC = 53.3%           DFDC Preview
                                                                                                          AUC = 57.5%             Celeb-DF
                                                                                                          AUC = 99.4%
                  (Dang et al. 2020)             Deep Learning Features     CNN + Attention Mechanism                              DFFD
                                                                                                           EER = 3.1%
                                                                                                        Precision = 93.0%
               (Dolhansky et al. 2019)           Deep Learning Features               CNN                                      DFDC Preview
                                                                                                          Recall = 8.4%
                                                                                                          AUC = 96.9%       FF++ (DeepFake, LQ)
                  (Sabir et al. 2019)           Image + Temporal Features          CNN + RNN
                                                                                                          AUC = 96.3%       FF++ (FaceSwap, LQ)
                                                                                                         AUC = 100.0%             UADFV
                                                                                                          AUC = 99.5%       FF++ (FaceSwap, HQ)
                (Tolosana et al. 2020a)          Facial Regions Features              CNN
                                                                                                          AUC = 91.1%          DFDC Preview
                                                                                                          AUC = 83.6%            Celeb-DF
                (Conotter et al. 2014)            Physiological Features                -                 Acc. = 100%               Own
              (Li, Chang, and Lyu 2018)           Physiological Features              LRCN                AUC = 99.0%             UADFV
              (Agarwal and Farid 2019)            Physiological Features              SVM                 AUC = 96.3%       Own (FaceSwap, HQ)
                                                                                                          Acc. = 94.9%       FF++ (DeepFakes)
            (Ciftci, Demir, and Yin 2020)         Physiological Features            SVM/CNN
                                                                                                          Acc. = 91.5%           Celeb-DF
             (Jung, Kim, and Kim 2020)            Physiological Features             Distance             Acc. = 87.5%              Own
                                                                                                         Acc. = 100.0%        FF++ (FaceSwap)
                                                  Physiological Features    CNN + Attention Mechanism    Acc. = 100.0%        FF++ (DeepFake)
                   (Qi et al. 2020)
                                                                                                          Acc. = 64.1%         DFDC Preview
                                                                                                         AUC = 99.9%            Celeb-DF v2
             DeepFakesON-Phys [Ours]             Physiological Features               CAN
                                                                                                         AUC = 98.2%           DFDC Preview


facts existed in the 1st generation of fake videos. The authors                      Fake detectors based on the image and temporal discrep-
of (Matern, Riess, and Stamminger 2019) proposed fake de-                         ancies across frames have also been proposed in the liter-
tectors based on simple visual artifacts such as eye colour,                      ature. (Sabir et al. 2019) proposed a Recurrent Convolu-
missing reflections, and missing details in the teeth areas,                      tional Network similar to (Güera and Delp 2018), trained
achieving a final 85.1% AUC.                                                      end-to-end instead of using a pre-trained model. Their pro-
   Approaches based on the detection of the face warping                          posed detection approach was tested using FaceForensics++
artifacts have also been studied in the literature. For exam-                     database (Rössler et al. 2019), achieving AUC results above
ple, (Li and Lyu 2019; Li et al. 2020) proposed detection                         96%.
systems based on CNN in order to detect the presence of                              Although most approaches are based on the detection of
such artifacts from the face and the surrounding areas, being                     fake videos using the whole face, in (Tolosana et al. 2020a)
one of the most robust detection approaches against unseen                        the authors evaluated the discriminative power of each facial
face manipulations.                                                               region using state-of-the-art network architectures, achiev-
   Undoubtedly, fake detectors based on pure deep learn-                          ing interesting results on DeepFake databases of the 1st and
ing features are the most popular ones: feeding the net-                          2nd generations.
works with as many real/fake videos as possible and let-                             Finally, we pay special attention to the fake detectors
ting the networks to automatically extract the discrimina-                        based on physiological information. The eye blinking rate
tive features. In general, these fake detectors have achieved                     was studied in (Li, Chang, and Lyu 2018; Jung, Kim, and
very good results using popular network architectures such                        Kim 2020). (Li, Chang, and Lyu 2018) proposed Long-Term
as Xception (Rössler et al. 2019; Dolhansky et al. 2019),                        Recurrent Convolutional Networks (LRCN) to capture the
novel ones such as Capsule Networks (Nguyen, Yamagishi,                           temporal dependencies existed in human eye blinking. Their
and Echizen 2019), and novel training techniques based on                         method was evaluated on the UADFV database, achieving
attention mechanisms (Dang et al. 2020).                                          a final 99.0% AUC. More recently, (Jung, Kim, and Kim
2020) proposed a different approach named DeepVision.              a real face, compared with a fake face. Since the changes
They fused the Fast-HyperFace (Ranjan, Patel, and Chel-            in color and illumination due to oxygen concentration are
lappa 2017) and EAR (Soukupova and Cech 2016) algo-                subtle and invisible to the human eye, we think that most
rithms to track the blinking, achieving an accuracy of 87.5%       of the existing DeepFake manipulation methods do not
over an in-house database.                                         consider the physiological aspects of the human being yet.
   Fake detectors based on the analysis of the way we speak           The initial architecture of DeepFakesON-Phys is based on
were studied in (Agarwal and Farid 2019), focusing on              the DeepPhys model described in (Chen and McDuff 2018),
the distinct facial expressions and movements. These fea-          whose objective was to estimate the human heart rate using
tures were considered in combination with Support Vector           facial video sequences. The model is based on deep learn-
Machines (SVM), achieving a 96.3% AUC over their own               ing and was designed to extract spatio-temporal informa-
database.                                                          tion from videos mimicking the behavior of traditional hand-
   Finally, fake detection methods based on the heart rate         crafted rPPG techniques. Features are extracted through the
have been also studied in the literature. One of the first         color changes in users’ faces that are caused by the varia-
studies in this regard was (Conotter et al. 2014) where the        tion of oxygen concentration in the blood. Signal processing
authors preliminary evaluated the potential of blood flow          methods are also used for isolating the color changes caused
changes in the face to distinguish between computer gen-           by blood from other changes that may be caused by factors
erated and real videos. Their proposed approach was evalu-         such as external illumination, noise, etc.
ated using 12 videos (6 real and fake videos each), conclud-          As can be seen in Fig. 1, after the first preprocessing stage,
ing that it is possible to use this metric to detect computer      the Convolutional Attention Network (CAN) is composed of
generated videos.                                                  two different CNN branches:
   Changes in the blood flow have also been studied                • Motion Model: it is designed to detect changes between
in (Ciftci, Demir, and Yin 2020; Qi et al. 2020) using Deep-         consecutive frames, i.e., performing a short-time analysis
Fake videos. In (Ciftci, Demir, and Yin 2020), the authors           of the video for detecting fakes. To accomplish this task,
considered rPPG techniques to extract robust biological fea-         the input at a time t consists of a frame computed as the
tures. Classifiers based on SVM and CNN were analyzed,               normalized difference of the current frame I(t) and the
achieving final accuracies of 94.9% and 91.5% for the Deep-          previous one I(t − 1).
Fakes videos of FaceForensics++ and Celeb-DF, respec-
tively.                                                            • Appearance Model: it focuses on the analysis of the
   Recently, in (Qi et al. 2020) a more sophisticated fake           static information on each video frame. It has the target
detector named DeepRhythm was presented. This approach               of providing the Motion Model with information about
was also based on features extracted using rPPG tech-                which points of the current frame may contain the most
niques. DeepRhythm was enhanced through two modules:                 relevant information for detecting DeepFakes, i.e., a batch
i) motion-magnified spatial-temporal representation, and ii)         of attention masks that are shared at different layers of the
dual-spatial-temporal attention. These modules were incor-           CNN. The input of this branch at time t is the raw frame
porated in order to provide a better adaptation to dynami-           of the video I(t), normalized to zero mean and unitary
cally changing faces and various fake types. In general, good        standard deviation.
results with accuracies of 100% were achieved on Face-                The attention masks coming from the Appearance Model
Forensics++ database. However, this method suffers from            are shared with the Motion Model at two different points of
a demanding preprocessing stage, needing a precise detec-          the CAN. Finally, the output layer of the Motion Model is
tion of 81 facial landmarks and the use of a color magnifi-        also the final output of the entire CAN.
cation algorithm prior to fake detection. Also, poor results          In the original architecture (Chen and McDuff 2018), the
were achieved on databases of the 2nd generation such as the       output stage consisted of a regression layer for estimating
DFDC Preview (Acc. = 64.1%).                                       the time derivative of the subject’s heart rate. In our case, as
   In the present work, in addition to the proposal of a differ-   we do not aim to estimate the pulse of the subject, but the
ent DeepFake detection architecture, we enhance previous           presence of a fake face, we change the final regression layer
approaches, e.g. (Qi et al. 2020), by keeping the preprocess-      to a classification layer, using a sigmoid activation function
ing stage as light and robust as possible, only composed of a      for obtaining a final score in the [0,1] range for each instant t
face detector and frame normalization. To provide an over-         of the video, related to the probability of the face being real.
all picture, we include in Table 1 the results achieved with          Since the original DeepPhys model from (Chen and
our proposed DeepFakesON-Phys in comparison with key               McDuff 2018) is not publicly available, instead of train-
related works, which shows that we outperform the state of         ing a new CAN from scratch, we decided to initialize
the art on Celeb-DF v2 and DFDC Preview databases.                 DeepFakesON-Phys with the weights from the model pre-
                                                                   trained for heart rate estimation presented in (Hernandez-
     Proposed Method: DeepFakesON-Phys                             Ortega et al. 2020b), which is also an adaptation of Deep-
Fig. 1 graphically summarizes the architecture of                  Phys but trained using the COHFACE database (Heusch,
DeepFakesON-Phys, the proposed fake detector based                 Anjos, and Marcel 2017). This model also showed to have
on heart rate estimation. We hypothesize that rPPG methods         high accuracy in the heart rate estimation task using real face
should obtain significantly different results when trying to       videos, so our idea is to take benefit of that acquired knowl-
estimate the subjacent heart rate from a video containing          edge to better train DeepFakesON-Phys through a proper
Table 2: Identity swap publicly available databases of the         used to create this dataset, unlike other popular databases.
2nd generation considered in our experimental framework.           Regarding fake videos, a total of 4,119 videos were created
                                                                   using two different unknown approaches for fakes genera-
                        2nd Generation                             tion. Fake videos were generated by swapping subjects with
        Database            Real Videos         Fake Videos        similar appearances, i.e., similar facial attributes such as skin
       Celeb-DF v2                                                 tone, facial hair, glasses, etc. After a given pairwise model
                             590 (Youtube)    5,639 (DeepFake)     was trained on two identities, the identities were swapped
      (Li et al. 2020)
     DFDC Preview                                                  onto the other’s videos.
                             1,131 (Actors)   4,119 (Unknown)
  (Dolhansky et al. 2019)
                                                                                          Experiments
fine-tuning process.                                               Experimental Protocol
   Once we initialized DeepFakesON-Phys with the men-              Celeb-DF v2 and DFDC Preview databases have been di-
tioned weights, we freeze the weights of all the layers of         vided into non-overlapping datasets, development and eval-
the original CAN model apart from the new classification           uation. It is important to remark that each dataset comprises
layer and the last fully-connected layer, and we retrain the       videos from different identities (both real and fake), unlike
model. Due to this fine-tuning process we take benefit of the      some previous studies. This aspect is very important in order
weights learned for heart rate estimation, just adapting them      to perform a fair evaluation and predict the generalization
for the DeepFake detection task. This way, we make sure            ability of the fake detection systems against unseen identi-
that the weights of the convolutional layers remain looking        ties. Also, it is important to remark that the evaluation is car-
for information relative to heart rate and the last layers learn   ried out at frame level as in most previous studies (Tolosana
how to use that information for detecting the existence of         et al. 2020b), not video level, using the popular AUC and
DeepFakes.                                                         accuracy metrics.
                                                                      For the Celeb-DF v2 database, we consider real/fake
                            Databases                              videos of 40 and 19 different identities for the development
Two different public databases are considered in the exper-        and evaluation datasets respectively, whereas for the DFDC
imental framework of this study. In particular, Celeb-DF v2        Preview database, we follow the same experimental protocol
and DFDC Preview, the two most challenging DeepFake                proposed in (Dolhansky et al. 2019) as the authors already
databases up to date. Their videos exhibit a large range of        considered this concern.
variations in aspects such as face sizes (in pixels), light-
ing conditions (i.e., day, night, etc.), backgrounds, different    Fake Detection Results: DeepFakesON-Phys
acquisition scenarios (i.e., indoors and outdoors), distances
                                                                   This section evaluates the ability of DeepFakesON-Phys to
from the person to the camera, and pose variations, among
                                                                   detect the most challenging DeepFake videos of the 2nd gen-
others. These databases present enough images (fake and
                                                                   eration. Table 3 shows the fake detection performance re-
genuine) to fine-tune the original weights meant for heart
                                                                   sults achieved in terms of AUC and accuracy over the final
rate estimation, obtaining new weights also based in rPPG
                                                                   evaluation datasets of Celeb-DF v2 and DFDC Preview. It is
features but adapted for DeepFake detection. Table 2 sum-
                                                                   important to highlight that a separate fake detector is trained
marizes the main characteristics of the databases.
                                                                   for each database.
Celeb-DF v2                                                           In general, very good results are achieved in both
                                                                   DeepFake databases. For the Celeb-DF v2 database,
The aim of the Celeb-DF v2 database (Li et al. 2020) was           DeepFakesON-Phys achieves an accuracy of 98.7% and an
to generate fake videos of better visual quality compared          AUC of 99.9%. Regarding the DFDC Preview database, the
with the previous UADFV database. This database consists           results achieved are 94.4% accuracy and 98.2% AUC, simi-
of 590 real videos extracted from Youtube, corresponding to        lar ones to the obtained for the Celeb-DF database.
celebrities with a diverse distribution in terms of gender, age,      Observing the results, it seems clear that the fake detectors
and ethnic group. Regarding fake videos, a total of 5,639          have learnt to distinguish the spatio-temporal differences be-
videos were created swapping faces using DeepFake tech-            tween the real/fake faces of Celeb-DF v2 and DFDC Pre-
nology. The final videos are in MPEG4.0 format.                    view databases. Since all the convolutional layers of the pro-
                                                                   posed fake detector are frozen (the network was originally
DFDC Preview                                                       initialized with the weights from the model trained to pre-
The DFDC database (Dolhansky et al. 2019) is one of the            dict the heart rate (Hernandez-Ortega et al. 2020b)), and we
latest public databases, released by Facebook in collabora-        only train the last fully-connected layers, we can conclude
tion with other companies and academic institutions such as        that the proposed detection approach based on physiologi-
Microsoft, Amazon, and the MIT. In the present study we            cal measurement is successfully using pulse-related features
consider the DFDC Preview dataset consisting of 1,131 real         for distinguishing between real and fake faces. These results
videos from 66 paid actors, ensuring realistic variability in      prove that current face manipulation techniques do not pay
gender, skin tone, and age. It is important to remark that no      attention to the heart-rate-related physiological information
publicly available data or data from social media sites were       of the human being when synthesizing fake videos.
                              Real classified as Real                                     Fake classified as Real                                           Fake classified as Fake


                                              DeepFake Scores                                              DeepFake Scores                                                    DeepFake Scores
                                                                                1                                                                  1
                   1
                                                                               0.9                                                                0.9
                  0.9
                                                                               0.8                                                                0.8
                  0.8
                                                                               0.7                                                                0.7
                  0.7
                                                                               0.6                                                                0.6
                                                                       Score


                                                                                                                                          Score
          Score


                  0.6
                                                                               0.5                                                                0.5
                  0.5
                                                                               0.4                                                                0.4
                  0.4
                                                                               0.3                                                                0.3
                  0.3

                  0.2                                                          0.2                                                                0.2

                            threshold=0.579                                    0.1       threshold=0.579                                                    threshold=0.579
                  0.1                                                                                                                             0.1
                            DeepFake scores                                              DeepFake scores                                                    DeepFake scores
                   0                                                            0                                                                  0
                        0                5                  10    15                 0    2      4         6     8    10   12   14   16                 0   2      4     6      8    10     12   14   16   18
                                                 time [s]                                                      time [s]                                                          time [s]


Figure 2: Examples of successful and failed DeepFake detections. Top: sample frames of evaluated videos. Bottom: score
distribution for each sample video. For the fake video misclassified as containing a real face, the DeepFake detection scores
present a higher mean compared to the case of the fake video correctly classified as a fake.


Table 3: Fake detection performance results in terms of                                                              ditional approach followed in the literature (Tolosana et al.
AUC and Accuracy over the final evaluation datasets.                                                                 2020b)). In case we consider an evaluation at video level,
                                                                                                                     DeepFakesON-Phys would be able to detect fake videos by
    Database                         AUC Results (%)             Acc. Results (%)                                    integrating the temporal information available in short-time
   Celeb-DF v2                            99.9                         98.7                                          segments, e.g., in a similar way as described in (Hernandez-
  DFDC Preview                            98.2                         94.4                                          Ortega et al. 2018) for continuous face anti-spoofing.
                                                                                                                        We believe that the failures produced in this particular
                                                                                                                     case are propitiated by the interferences of external illumi-
                                                                                                                     nation. rPPG methods that use handcrafted features are usu-
   Fig. 2 shows some examples of successful and failed                                                               ally fragile against external artificial illumination in the fre-
detections when evaluating the proposed approach with                                                                quency and power ranges of normal human heart rate, mak-
real/fake faces of Celeb-DF v2. In particular, all the failures                                                      ing difficult to distinguish those illumination changes from
correspond to fake faces generated from a particular video,                                                          the color changes caused by blood perfusion. Anyway, the
misclassifying them as real faces. Fig. 2 shows a frame from                                                         proposed physiological approach presented in this work is
the original real video (top-left), one from a misclassified                                                         more robust to this kind of illumination perturbations than
fake video generated using that scenario (top-middle), and                                                           hand-crafted methods, thanks to the fact that the training
another from a fake video correctly classified as fake and                                                           process is data-driven, making possible to identify those in-
generated using the same real and fake identities but from                                                           terferences by using their presence in the training data.
other source videos (top-right). The detection threshold is
the same for all the testing databases and videos, and it has                                                        Comparison with the State of the Art
been selected to maximize the accuracy in the evaluation.                                                            Finally, we compared in Table 4 the results achieved in the
   Looking at the score distributions along time of the three                                                        present work with other state-of-the-art DeepFake detection
examples (Fig. 2, bottom), it can be seen that for the real face                                                     approaches: head pose variations (Yang, Li, and Lyu 2019),
video (left) the scores are 1 for most of the time and always                                                        face warping artifacts (Li et al. 2020), mesoscopic fea-
over the detection threshold. However, for the fake videos                                                           tures (Afchar et al. 2018), pure deep learning features (Dang
considered (middle and right), the score changes constantly,                                                         et al. 2020; Tolosana et al. 2020a), and physiological fea-
making the score of some fake frames to cross the detec-                                                             tures (Qi et al. 2020; Ciftci, Demir, and Yin 2020). The best
tion threshold and consequently misclassifying them as real.                                                         results achieved for each database are remarked in bold.
Nevertheless, it is important to remark that these mistakes                                                          Results in italics indicate that the evaluated database was
only happen if we analyze the results at frame level (tra-                                                           not used for training. Some of these results are extracted
Table 4: Comparison of different state-of-the-art fake detectors with our proposed DeepFakesON-Phys. The best results
achieved for each database are remarked in bold. Results in italics indicate that the evaluated database (Celeb-DF or DFDC)
was not used for training.
                                                                                                                AUC Results (%)
             Study                       Method                    Classifiers
                                                                                           Celeb-DF (Li et al. 2020) DFDC (Dolhansky et al. 2019)

    (Yang, Li, and Lyu 2019)       Head Pose Features                SVM                             54.6                        55.9
          (Li et al. 2020)        Face Warping Features              CNN                             64.6                        75.5
       (Afchar et al. 2018)        Mesoscopic Features               CNN                             54.8                        75.3
        (Dang et al. 2020)        Deep Learning Features   CNN + Attention Mechanism                 71.2                          -
     (Tolosana et al. 2020a)      Deep Learning Features             CNN                             83.6                        91.1
          (Qi et al. 2020)        Physiological Features   CNN + Attention Mechanism                   -                      Acc. = 64.1
  (Ciftci, Demir, and Yin 2020)   Physiological Features          SVM/CNN                         Acc. = 91.5                      -


                                                                                                 AUC = 99.9                   AUC = 98.2
  DeepFakesON-Phys [Ours]         Physiological Features   CNN + Attention Mechanism
                                                                                                 Acc. = 98.7                  Acc. = 94.4


from (Li et al. 2020).                                                           order to make a fair comparison of accuracy, and showing
   Note that the comparison in Table 4 is not always un-                         the actual performance of our method. Another future work
der the same datasets and protocols, therefore it must be                        will be oriented to the analysis of the robustness of the pro-
interpreted with care. Despite of that, it is patent that the                    posed fake detection approach against face manipulations
proposed DeepFakesON-Phys has achieved state-of-the-art                          unseen during the training process (Tolosana et al. 2020b),
results in both Celeb-DF and DFDC Preview databases.                             temporal integration of frame data (Hernandez-Ortega et al.
In particular, it has further outperformed popular fake de-                      2018), and the application of the proposed physiological ap-
tectors based on pure deep learning approaches such as                           proach to other face manipulation techniques such as face
Xception and Capsule Networks (Tolosana et al. 2020a)                            morphing (Raja and et al. 2020).
and also other recent physiological approaches based on
SVM/CNN (Ciftci, Demir, and Yin 2020).                                                              Acknowledgments
                                                                                 This work has been supported by projects: IDEA-FAST
                          Conclusions                                            (IMI2-2018-15-two-stage-853981), PRIMA (ITN-2019-
This work has evaluated the potential of physiological mea-                      860315), TRESPASS-ETN (ITN-2019-860813), BIBECA
surement to detect DeepFake videos. In particular, we have                       (RTI2018-101248-B-I00 MINECO/FEDER), and edBB
proposed a novel DeepFake detector named DeepFakesON-                            (Universidad Autonoma de Madrid, UAM). J. H.-O. is sup-
Phys based on a Convolutional Attention Network (CAN)                            ported by a PhD fellowship from UAM. R. T. is supported
originally trained for heart rate estimation using remote pho-                   by a Postdoctoral fellowship from CAM/FSE.
toplethysmography (rPPG). The proposed CAN approach
consists of two parallel CNN networks that extract and share                                              References
temporal and spatial information from video frames.                                Afchar, D.; Nozick, V.; Yamagishi, J.; and Echizen, I. 2018.
    DeepFakesON-Phys has been evaluated using Celeb-DF                             MesoNet: a Compact Facial Video Forgery Detection Net-
v2 and DFDC Preview databases, two of the latest and most                          work. In Proc. IEEE Int. Workshop on Information Forensics
challenging DeepFake video databases. Regarding the ex-                            and Security.
perimental protocol, each database was divided into devel-                         Agarwal, S.; and Farid, H. 2019. Protecting World Leaders
opment and evaluation datasets, considering different iden-                        Against Deep Fakes. In Proc. IEEE/CVF Conf. on Computer
tities in each dataset in order to perform a fair evaluation of                    Vision and Pattern Recognition Workshops.
the technology.                                                                    Bharadwaj, S.; Dhamecha, T. I.; Vatsa, M.; and Singh, R.
    The soundness and competitiveness of DeepFakesON-                              2013. Computationally Efficient Face Spoofing Detection
Phys has been proven by the very good results achieved,                            with Motion Magnification. In Proc. IEEE/CVF Conf. on
AUC values of 99.9% and 98.2% for the Celeb-DF and                                 Comp. Vision and Pattern Recognition Workshops.
DFDC databases, respectively. These results have outper-                           Cellan-Jones, R. 2019.    Deepfake Videos Double in
formed other state-of-the-art fake detectors based on face                         Nine Months. URL https://www.bbc.com/news/technology-
warping and pure deep learning features, among others. Fi-                         49961089.
nally, the experimental results of this study reveal that cur-                     Chen, W.; and McDuff, D. 2018. DeepPhys: Video-Based
rent face manipulation techniques do not pay attention to                          Physiological Measurement Using Convolutional Attention
the heart-rate-related or blood-related physiological infor-                       Networks. In Proc. European Conf. on Computer Vision,
mation.                                                                            349–365.
    Immediate work may consist in replicating the state of                         Ciftci, U. A.; Demir, I.; and Yin, L. 2020. FakeCatcher: De-
the art DeepFake works and training them with the same                             tection of Synthetic Portrait Videos Using Biological Signals.
databases than the ones used to train DeepFakesON-Phys in                          IEEE Trans. on Pattern Analysis and Machine Intelligence .
Citron, D. 2019. How DeepFake Undermine Truth and                 Li, Y.; Yang, X.; Sun, P.; Qi, H.; and Lyu, S. 2020. Celeb-
Threaten Democracy. URL https://www.ted.com.                      DF: A Large-Scale Challenging Dataset for DeepFake Foren-
                                                                  sics. In Proc. IEEE/CVF Conf. on Comp. Vision and Pattern
Conotter, V.; Bodnari, E.; Boato, G.; and Farid, H. 2014.
                                                                  Recognition.
Physiologically-Based Detection of Comp. Generated Faces
in Video. In Proc. IEEE Int. Conf. on Image Processing.           Marcel, S.; Nixon, M.; Fierrez, J.; and Evans, N. 2019. Hand-
                                                                  book of Biometric Anti-Spoofing (2nd Edition).
Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; and Jain, A. 2020.
On the Detection of Digital Face Manipulation. In Proc.           Matern, F.; Riess, C.; and Stamminger, M. 2019. Exploiting
IEEE/CVF Conf. on Comp. Vision and Pattern Recognition.           Visual Artifacts to Expose DeepFakes and Face Manipula-
                                                                  tions. In Proc. IEEE Winter App. of Comp. Vision Workshops.
Dolhansky, B.; Howes, R.; Pflaum, B.; Baram, N.; and Fer-
rer, C. C. 2019. The Deepfake Detection Challenge (DFDC)          McDuff, D. J.; Estepp, J. R.; Piasecki, A. M.; and Blackford,
Preview Dataset. arXiv preprint:1910.08854 .                      E. B. 2015. A Survey of Remote Optical Photoplethysmo-
                                                                  graphic Imaging Methods. In Proc. Annual Int. Conf. of the
Erdogmus, N.; and Marcel, S. 2014. Spoofing Face Recog-           IEEE Engineering in Medicine and Biology Society.
nition with 3D Masks. IEEE Transactions on Information
                                                                  Neves, J.; et al. 2020. GANprintR: Improved Fakes and Eval-
Forensics and Security 9(7): 1084–1097.
                                                                  uation of the State of the Art in Face Manipulation Detection.
Galbally, J.; Marcel, S.; and Fierrez, J. 2014. Biometric Anti-   IEEE Journal of Selected Topics in Signal Processing 14(5):
Spoofing Methods: A Survey in Face Recognition. IEEE Ac-          1038–1048.
cess 2: 1530–1552.                                                Nguyen, H. H.; Yamagishi, J.; and Echizen, I. 2019. Use of
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-      a Capsule Network to Detect Fake Images and Videos. arXiv
Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Gen-   preprint:1910.12467 .
erative Adversarial Nets. In Proc. Advances in Neural Infor-      Qi, H.; Guo, Q.; Juefei-Xu, F.; Xie, X.; Ma, L.; Feng, W.;
mation Processing Systems.                                        Liu, Y.; and Zhao, J. 2020. DeepRhythm: Exposing Deep-
Güera, D.; and Delp, E. 2018. Deepfake Video Detection           Fakes with Attentional Visual Heartbeat Rhythms. arXiv
Using Recurrent Neural Networks. In Proc. Int. Conf. on Ad-       preprint:2006.07634 .
vanced Video and Signal Based Surveillance.                       Raja, K.; and et al. 2020. Morphing Attack Detection -
                                                                  Database, Evaluation Platform and Benchmarking. IEEE
Hernandez-Ortega, J.; Daza, R.; Morales, A.; Fierrez, J.; and
                                                                  Transactions on Information Forensics and Security. .
Tolosana, R. 2020a. Heart Rate Estimation from Face Videos
for Student Assessment: Experiments on edBB. In Proc.             Ranjan, R.; Patel, V. M.; and Chellappa, R. 2017. Hyperface:
IEEE Comp. Software and Applications Conf.                        A Deep Multi-Task Learning Framework for Face Detection,
                                                                  Landmark Localization, Pose Estimation, and Gender Recog-
Hernandez-Ortega, J.; Fierrez, J.; Morales, A.; and Diaz, D.      nition. IEEE Trans. on Pattern Analysis and Machine Intelli-
2020b. A Comparative Evaluation of Heart Rate Estimation          gence 41(1): 121–135.
Methods using Face Videos. In Proc. IEEE Intl. Workshop on
Medical Computing.                                                Rössler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.;
                                                                  and Nießner, M. 2019. FaceForensics++: Learning to Detect
Hernandez-Ortega, J.; Fierrez, J.; Morales, A.; and Galbally,     Manipulated Facial Images. In Proc. IEEE/CVF Int. Conf. on
J. 2019. Introduction to Face Presentation Attack Detection.      Comp. Vision.
In Handbook of Biometric Anti-Spoofing, 187–206. Springer.
                                                                  Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi,
Hernandez-Ortega, J.; Fierrez, J.; Morales, A.; and Tome, P.      I.; and Natarajan, P. 2019. Recurrent Convolutional Strate-
2018. Time Analysis of Pulse-Based Face Anti-Spoofing in          gies for Face Manipulation Detection in Videos. In Proc.
Visible and NIR. In Proc. IEEE Conf. on Comp. Vision and          IEEE/CVF Conf. on Comp. Vision and Pattern Recognition
Pattern Recognition Workshops.                                    Workshops.
Heusch, G.; Anjos, A.; and Marcel, S. 2017. A repro-              Soukupova, T.; and Cech, J. 2016. Real-Time Eye Blink De-
ducible study on remote heart rate measurement. arXiv             tection Using Facial Landmarks. In Proc. Comp. Vision Win-
preprint:1709.00962 .                                             ter Workshop.
Jung, T.; Kim, S.; and Kim, K. 2020. DeepVision: Deepfakes        Tan, D.; and Nijholt, A. 2010. Brain-Computer Interfaces and
Detection Using Human Eye Blinking Pattern. IEEE Access           Human-Computer Interaction. In Brain-Computer Interfaces,
8: 83144–83154.                                                   3–19. Springer.
Karras, T.; et al. 2020. Analyzing and Improving the Image        Tolosana, R.; Romero-Tapiador, S.; Fierrez, J.; and Vera-
Quality of StyleGAN. In Proc. IEEE/CVF Conf. on Comp.             Rodriguez, R. 2020a. DeepFakes Evolution: Analysis of Fa-
Vision and Patter Recognition.                                    cial Regions and Fake Detection Performance. Proc. Interna-
                                                                  tional Conference on Pattern Recognition Workshops .
Kingma, D. P.; and Welling, M. 2013. Auto-Encoding Varia-
tional Bayes. In Proc. Int. Conf. on Learning Represent.          Tolosana, R.; Vera-Rodriguez, R.; Fierrez, J.; Morales, A.;
                                                                  and Ortega-Garcia, J. 2020b. DeepFakes and Beyond: A Sur-
Li, Y.; Chang, M.; and Lyu, S. 2018. In Ictu Oculi: Exposing      vey of Face Manipulation and Fake Detection. Information
AI Generated Fake Face Videos by Detecting Eye Blinking.          Fusion 64: 131–148.
In Proc. IEEE Int. Work. Information Forensics and Security.
                                                                  Yang, X.; Li, Y.; and Lyu, S. 2019. Exposing Deep Fakes
Li, Y.; and Lyu, S. 2019. Exposing DeepFake Videos By De-         Using Inconsistent Head Poses. In Proc. IEEE Int. Conf. on
tecting Face Warping Artifacts. In Proc. IEEE/CVF Conf. on        Acoustics, Speech and Signal Processing.
Comp. Vision and Pattern Recognition Workshops.