DeepFakesON-Phys: DeepFakes Detection based on Heart Rate Estimation Javier Hernandez-Ortega, Ruben Tolosana, Julian Fierrez, Aythami Morales Biometrics and Data Pattern Analytics Lab - BiDA Lab Universidad Autonoma de Madrid {javier.hernandezo, ruben.tolosana, julian.fierrez, aythami.morales}@uam.es Abstract learning (Hernandez-Ortega et al. 2020a), health care (Mc- Duff et al. 2015), human-computer interaction (Tan and Ni- This work introduces a novel DeepFake detection framework jholt 2010), and security (Marcel et al. 2019), among many based on physiological measurement. In particular, we con- sider information related to the heart rate using remote pho- other tasks. toplethysmography (rPPG). rPPG methods analyze video se- In physical face attacks, a.k.a. Presentation Attacks (PAs), quences looking for subtle color changes in the human skin, real subjects are often impersonated using artifacts such revealing the presence of human blood under the tissues. In as photographs, videos, and masks (Marcel et al. 2019). this work we investigate to what extent rPPG is useful for Face recognition systems are known to be vulnerable against the detection of DeepFake videos. The proposed fake de- these attacks unless proper detection methods are imple- tector named DeepFakesON-Phys uses a Convolutional At- mented (Galbally, Marcel, and Fierrez 2014; Hernandez- tention Network (CAN), which extracts spatial and tempo- Ortega et al. 2019). Some of these detection methods are ral information from video frames, analyzing and combin- ing both sources to better detect fake videos. DeepFakesON- based on liveness detection by using information such as Phys has been experimentally evaluated using the latest pub- eye blinking or natural facial micro-expressions (Bharadwaj lic databases in the field: Celeb-DF and DFDC. The results et al. 2013). Specifically for detecting 3D mask imperson- achieved, above 98% AUC (Area Under the Curve) on both ation, which is one of the most challenging type of attacks, databases, outperform the state of the art and prove the suc- detecting pulse from face videos using remote photoplethys- cess of fake detectors based on physiological measurement to mography (rPPG) has shown to be an effective countermea- detect the latest DeepFake videos. sure (Hernandez-Ortega et al. 2018). When applying this technique to a video sequence with a fake face, the estimated heart rate signal is significantly different to the heart rate ex- Introduction tracted from a real face (Erdogmus and Marcel 2014). DeepFakes have become a great public concern re- Seeing the good results achieved by rPPG techniques cently (Citron 2019; Cellan-Jones 2019). The very popular when dealing with physical 3D face mask attacks, and since term “DeepFake” is usually referred to a deep learning based DeepFakes are digital manipulations somehow similar to technique able to create fake videos by swapping the face of them, in this work we hypothesize that fake detectors based a person by the face of another person. This type of digi- on physiological measurement can also be used against tal manipulation is also known in the literature as Identity DeepFakes after adapting them properly. DeepFake gener- Swap, and it is moving forward very fast (Tolosana et al. ation methods have historically tried to mimic the visual 2020b). appearance of genuine faces. However, to the best of our Currently, most face manipulations are based on pop- knowledge, they do not emulate the physiology of human ular machine learning techniques such as AutoEncoders beings, e.g., heart rate, blood oxygenation, or breath rate, (AE) (Kingma and Welling 2013) and Generative Adversar- so estimating that type of signals from the video could be a ial Networks (GAN) (Goodfellow et al. 2014), achieving in powerful tool for the detection of DeepFakes. general very realistic visual results, specially in the latest The novelty of this work consists in using rPPG fea- generation of public DeepFakes (Tolosana et al. 2020a), and tures previously learned for the task of heart rate estima- the present trends (Karras et al. 2020). However, and despite tion and adapting them for the detection of DeepFakes by the impressive visual results, are current face manipulations means of a knowledge-transfer process, thus obtaining a also considering the physiological aspects of the human be- novel fake detector based on physiological measurement ing in the synthesis process? named DeepFakesON-Phys. In particular, the information Physiological measurement has provided very valu- related to the heart rate is considered to decide whether a able information to many different tasks such as e- video is real or fake. Our physiological detector intends to be Copyright © 2021 for this paper by its authors. Use permitted under a robust solution to the weaknesses of most state-of-the-art Creative Commons License Attribution 4.0 International (CC BY DeepFake detectors based on the visual features existing in 4.0). fake videos (Matern, Riess, and Stamminger 2019; Agarwal Motion Model 36x36x3 Output Score DeepFakesON-Phys Deepfake t [0,1] e Training Data tim Element-wise Element-wise Normalized Multiplication Multiplication Frame 32x3x3 32x3x3 Pool Size 2 64x3x3 Pool Size 2 Difference I(t) - I(t-1) Conv. 2D + Tanh activation Video Frames 36x36x3 1x1 1x1 Conv. 2D + ...I(t-1), I(t), I(t+1)... Sigmoid activation tim Average Pooling et Normalized Convolutional Frame 32x3x3 32x3x3 Pool Size 2 64x3x3 Pool Size 2 I(t) Attention Preprocessing Appearance Model Network Figure 1: DeepFakesON-Phys architecture. It comprises two stages: i) a preprocessing step to normalize the video frames, and ii) a Convolutional Attention Network composed of Motion and Appearance Models to better detect fake videos. and Farid 2019) and also on the artifacts/fingerprints inserted • A thorough experimental assessment of the pro- during the synthesis process (Neves et al. 2020), which are posed DeepFakesON-Phys, considering the latest public highly dependent on a specific fake manipulation technique. databases of the 2nd DeepFake generation such as Celeb- DeepFakesON-Phys is based on DeepPhys (Chen and DF v2 and DFDC Preview. DeepFakesON-Phys achieves McDuff 2018), a deep learning model trained for heart high-accuracy results, outperforming the state of the art. rate estimation from face videos based on rPPG. DeepPhys In addition, the results achieved prove that current face showed high accuracy even when dealing with challenging manipulation techniques do not pay attention to the heart- conditions such as heterogeneous illumination or low reso- rate-related physiological information of the human being lution, outperforming classic hand-crafted approaches. We when synthesizing fake videos. used the architecture of DeepPhys, but making changes to The remainder of the paper is organized as fol- make it suitable for DeepFake detection. We initialized the lows. Related Works summarizes previous studies weights of the layers of DeepFakesON-Phys with the ones focused on the detection of DeepFakes. Proposed from DeepPhys (meant for heart rate estimation based on Method: DeepFakesON-Phys describes the proposed rPPG) and we adapted them to the new task using fine- DeepFakesON-Phys fake detection approach. Databases tuning. This process allowed us to train our detector without summarizes all databases considered in the experimental the need of a high number of samples (compared to training framework of this study. Experiments describes the experi- it from scratch). Fine-tuning also helped us to obtain a model mental protocol and the results achieved in comparison with that detects DeepFakes by looking to rPPG related features the state of the art. Finally, Conclusions draws the final from the images in the face videos. conclusions and points out future research lines. In this context, the main contributions of our work are: • An in-depth literature review of DeepFake detection Related Works approaches with special emphasis to physiological tech- Different approaches have been proposed in the literature niques, including the key aspects of the detection systems, to detect DeepFake videos. Table 1 shows a comparison of the databases used, and the main results achieved. the most relevant approaches in the area, paying special at- • An approach based on physiological measurement to tention to the fake detectors based on physiological mea- detect DeepFake videos: DeepFakesON-Phys1 . Fig. 1 surement. For each study we include information related to graphically summarizes the proposed fake detection ap- the method, classifiers, best performance, and databases for proach based on the original architecture DeepPhys (Chen research. It is important to remark that in some cases, dif- and McDuff 2018), a Convolutional Attention Network ferent evaluation metrics are considered, e.g., Area Under (CAN) composed of two parallel Convolutional Neural the Curve (AUC) and Equal Error Rate (EER), which com- Networks (CNN) able to extract spatial and temporal in- plicate the comparison among studies. Finally, the results formation from video frames. This architecture is adapted highlighted in italics indicate the generalization ability of for the detection of DeepFake videos by means of a the detectors against unseen databases, i.e., those databases knowledge-transfer process. were not considered for training. Most of these results are extracted from (Li et al. 2020). 1 https://github.com/BiDAlab/DeepFakesON-Phys The first studies in the area focused on the visual arti- Table 1: Comparison of different state-of-the-art fake detectors. Results in italics indicate the generalization capacity of the detectors against unseen databases. FF++ = FaceForensics++, AUC = Area Under the Curve, Acc. = Accuracy, EER = Equal Error Rate. Study Method Classifiers Best Performance Databases AUC = 85.1% Own AUC = 78.0% FF++ / DFD Visual Features Logistic Regression (Matern, Riess, and Stamminger 2019) AUC = 66.2% DFDC Preview MLP AUC = 55.1% Celeb-DF AUC = 97.7% UADFV AUC = 93.0% FF++ / DFD Face Warping Features CNN (Li and Lyu 2019; Li et al. 2020) AUC = 75.5% DFDC Preview AUC = 64.6% Celeb-DF Acc. ' 94.0% FF++ (DeepFake, LQ) Acc. ' 98.0% FF++ (DeepFake, HQ) Mesoscopic Features Acc. ' 100.0% FF++ (DeepFake, RAW) (Rössler et al. 2019) Steganalysis Features CNN Acc. ' 93.0% FF++ (FaceSwap, LQ) Deep Learning Features Acc. ' 97.0% FF++ (FaceSwap, HQ) Acc. ' 99.0% FF++ (FaceSwap, RAW) AUC = 61.3% UADFV AUC = 96.6% FF++ / DFD Deep Learning Features Capsule Networks (Nguyen, Yamagishi, and Echizen 2019) AUC = 53.3% DFDC Preview AUC = 57.5% Celeb-DF AUC = 99.4% (Dang et al. 2020) Deep Learning Features CNN + Attention Mechanism DFFD EER = 3.1% Precision = 93.0% (Dolhansky et al. 2019) Deep Learning Features CNN DFDC Preview Recall = 8.4% AUC = 96.9% FF++ (DeepFake, LQ) (Sabir et al. 2019) Image + Temporal Features CNN + RNN AUC = 96.3% FF++ (FaceSwap, LQ) AUC = 100.0% UADFV AUC = 99.5% FF++ (FaceSwap, HQ) (Tolosana et al. 2020a) Facial Regions Features CNN AUC = 91.1% DFDC Preview AUC = 83.6% Celeb-DF (Conotter et al. 2014) Physiological Features - Acc. = 100% Own (Li, Chang, and Lyu 2018) Physiological Features LRCN AUC = 99.0% UADFV (Agarwal and Farid 2019) Physiological Features SVM AUC = 96.3% Own (FaceSwap, HQ) Acc. = 94.9% FF++ (DeepFakes) (Ciftci, Demir, and Yin 2020) Physiological Features SVM/CNN Acc. = 91.5% Celeb-DF (Jung, Kim, and Kim 2020) Physiological Features Distance Acc. = 87.5% Own Acc. = 100.0% FF++ (FaceSwap) Physiological Features CNN + Attention Mechanism Acc. = 100.0% FF++ (DeepFake) (Qi et al. 2020) Acc. = 64.1% DFDC Preview AUC = 99.9% Celeb-DF v2 DeepFakesON-Phys [Ours] Physiological Features CAN AUC = 98.2% DFDC Preview facts existed in the 1st generation of fake videos. The authors Fake detectors based on the image and temporal discrep- of (Matern, Riess, and Stamminger 2019) proposed fake de- ancies across frames have also been proposed in the liter- tectors based on simple visual artifacts such as eye colour, ature. (Sabir et al. 2019) proposed a Recurrent Convolu- missing reflections, and missing details in the teeth areas, tional Network similar to (Güera and Delp 2018), trained achieving a final 85.1% AUC. end-to-end instead of using a pre-trained model. Their pro- Approaches based on the detection of the face warping posed detection approach was tested using FaceForensics++ artifacts have also been studied in the literature. For exam- database (Rössler et al. 2019), achieving AUC results above ple, (Li and Lyu 2019; Li et al. 2020) proposed detection 96%. systems based on CNN in order to detect the presence of Although most approaches are based on the detection of such artifacts from the face and the surrounding areas, being fake videos using the whole face, in (Tolosana et al. 2020a) one of the most robust detection approaches against unseen the authors evaluated the discriminative power of each facial face manipulations. region using state-of-the-art network architectures, achiev- Undoubtedly, fake detectors based on pure deep learn- ing interesting results on DeepFake databases of the 1st and ing features are the most popular ones: feeding the net- 2nd generations. works with as many real/fake videos as possible and let- Finally, we pay special attention to the fake detectors ting the networks to automatically extract the discrimina- based on physiological information. The eye blinking rate tive features. In general, these fake detectors have achieved was studied in (Li, Chang, and Lyu 2018; Jung, Kim, and very good results using popular network architectures such Kim 2020). (Li, Chang, and Lyu 2018) proposed Long-Term as Xception (Rössler et al. 2019; Dolhansky et al. 2019), Recurrent Convolutional Networks (LRCN) to capture the novel ones such as Capsule Networks (Nguyen, Yamagishi, temporal dependencies existed in human eye blinking. Their and Echizen 2019), and novel training techniques based on method was evaluated on the UADFV database, achieving attention mechanisms (Dang et al. 2020). a final 99.0% AUC. More recently, (Jung, Kim, and Kim 2020) proposed a different approach named DeepVision. a real face, compared with a fake face. Since the changes They fused the Fast-HyperFace (Ranjan, Patel, and Chel- in color and illumination due to oxygen concentration are lappa 2017) and EAR (Soukupova and Cech 2016) algo- subtle and invisible to the human eye, we think that most rithms to track the blinking, achieving an accuracy of 87.5% of the existing DeepFake manipulation methods do not over an in-house database. consider the physiological aspects of the human being yet. Fake detectors based on the analysis of the way we speak The initial architecture of DeepFakesON-Phys is based on were studied in (Agarwal and Farid 2019), focusing on the DeepPhys model described in (Chen and McDuff 2018), the distinct facial expressions and movements. These fea- whose objective was to estimate the human heart rate using tures were considered in combination with Support Vector facial video sequences. The model is based on deep learn- Machines (SVM), achieving a 96.3% AUC over their own ing and was designed to extract spatio-temporal informa- database. tion from videos mimicking the behavior of traditional hand- Finally, fake detection methods based on the heart rate crafted rPPG techniques. Features are extracted through the have been also studied in the literature. One of the first color changes in users’ faces that are caused by the varia- studies in this regard was (Conotter et al. 2014) where the tion of oxygen concentration in the blood. Signal processing authors preliminary evaluated the potential of blood flow methods are also used for isolating the color changes caused changes in the face to distinguish between computer gen- by blood from other changes that may be caused by factors erated and real videos. Their proposed approach was evalu- such as external illumination, noise, etc. ated using 12 videos (6 real and fake videos each), conclud- As can be seen in Fig. 1, after the first preprocessing stage, ing that it is possible to use this metric to detect computer the Convolutional Attention Network (CAN) is composed of generated videos. two different CNN branches: Changes in the blood flow have also been studied • Motion Model: it is designed to detect changes between in (Ciftci, Demir, and Yin 2020; Qi et al. 2020) using Deep- consecutive frames, i.e., performing a short-time analysis Fake videos. In (Ciftci, Demir, and Yin 2020), the authors of the video for detecting fakes. To accomplish this task, considered rPPG techniques to extract robust biological fea- the input at a time t consists of a frame computed as the tures. Classifiers based on SVM and CNN were analyzed, normalized difference of the current frame I(t) and the achieving final accuracies of 94.9% and 91.5% for the Deep- previous one I(t − 1). Fakes videos of FaceForensics++ and Celeb-DF, respec- tively. • Appearance Model: it focuses on the analysis of the Recently, in (Qi et al. 2020) a more sophisticated fake static information on each video frame. It has the target detector named DeepRhythm was presented. This approach of providing the Motion Model with information about was also based on features extracted using rPPG tech- which points of the current frame may contain the most niques. DeepRhythm was enhanced through two modules: relevant information for detecting DeepFakes, i.e., a batch i) motion-magnified spatial-temporal representation, and ii) of attention masks that are shared at different layers of the dual-spatial-temporal attention. These modules were incor- CNN. The input of this branch at time t is the raw frame porated in order to provide a better adaptation to dynami- of the video I(t), normalized to zero mean and unitary cally changing faces and various fake types. In general, good standard deviation. results with accuracies of 100% were achieved on Face- The attention masks coming from the Appearance Model Forensics++ database. However, this method suffers from are shared with the Motion Model at two different points of a demanding preprocessing stage, needing a precise detec- the CAN. Finally, the output layer of the Motion Model is tion of 81 facial landmarks and the use of a color magnifi- also the final output of the entire CAN. cation algorithm prior to fake detection. Also, poor results In the original architecture (Chen and McDuff 2018), the were achieved on databases of the 2nd generation such as the output stage consisted of a regression layer for estimating DFDC Preview (Acc. = 64.1%). the time derivative of the subject’s heart rate. In our case, as In the present work, in addition to the proposal of a differ- we do not aim to estimate the pulse of the subject, but the ent DeepFake detection architecture, we enhance previous presence of a fake face, we change the final regression layer approaches, e.g. (Qi et al. 2020), by keeping the preprocess- to a classification layer, using a sigmoid activation function ing stage as light and robust as possible, only composed of a for obtaining a final score in the [0,1] range for each instant t face detector and frame normalization. To provide an over- of the video, related to the probability of the face being real. all picture, we include in Table 1 the results achieved with Since the original DeepPhys model from (Chen and our proposed DeepFakesON-Phys in comparison with key McDuff 2018) is not publicly available, instead of train- related works, which shows that we outperform the state of ing a new CAN from scratch, we decided to initialize the art on Celeb-DF v2 and DFDC Preview databases. DeepFakesON-Phys with the weights from the model pre- trained for heart rate estimation presented in (Hernandez- Proposed Method: DeepFakesON-Phys Ortega et al. 2020b), which is also an adaptation of Deep- Fig. 1 graphically summarizes the architecture of Phys but trained using the COHFACE database (Heusch, DeepFakesON-Phys, the proposed fake detector based Anjos, and Marcel 2017). This model also showed to have on heart rate estimation. We hypothesize that rPPG methods high accuracy in the heart rate estimation task using real face should obtain significantly different results when trying to videos, so our idea is to take benefit of that acquired knowl- estimate the subjacent heart rate from a video containing edge to better train DeepFakesON-Phys through a proper Table 2: Identity swap publicly available databases of the used to create this dataset, unlike other popular databases. 2nd generation considered in our experimental framework. Regarding fake videos, a total of 4,119 videos were created using two different unknown approaches for fakes genera- 2nd Generation tion. Fake videos were generated by swapping subjects with Database Real Videos Fake Videos similar appearances, i.e., similar facial attributes such as skin Celeb-DF v2 tone, facial hair, glasses, etc. After a given pairwise model 590 (Youtube) 5,639 (DeepFake) was trained on two identities, the identities were swapped (Li et al. 2020) DFDC Preview onto the other’s videos. 1,131 (Actors) 4,119 (Unknown) (Dolhansky et al. 2019) Experiments fine-tuning process. Experimental Protocol Once we initialized DeepFakesON-Phys with the men- Celeb-DF v2 and DFDC Preview databases have been di- tioned weights, we freeze the weights of all the layers of vided into non-overlapping datasets, development and eval- the original CAN model apart from the new classification uation. It is important to remark that each dataset comprises layer and the last fully-connected layer, and we retrain the videos from different identities (both real and fake), unlike model. Due to this fine-tuning process we take benefit of the some previous studies. This aspect is very important in order weights learned for heart rate estimation, just adapting them to perform a fair evaluation and predict the generalization for the DeepFake detection task. This way, we make sure ability of the fake detection systems against unseen identi- that the weights of the convolutional layers remain looking ties. Also, it is important to remark that the evaluation is car- for information relative to heart rate and the last layers learn ried out at frame level as in most previous studies (Tolosana how to use that information for detecting the existence of et al. 2020b), not video level, using the popular AUC and DeepFakes. accuracy metrics. For the Celeb-DF v2 database, we consider real/fake Databases videos of 40 and 19 different identities for the development Two different public databases are considered in the exper- and evaluation datasets respectively, whereas for the DFDC imental framework of this study. In particular, Celeb-DF v2 Preview database, we follow the same experimental protocol and DFDC Preview, the two most challenging DeepFake proposed in (Dolhansky et al. 2019) as the authors already databases up to date. Their videos exhibit a large range of considered this concern. variations in aspects such as face sizes (in pixels), light- ing conditions (i.e., day, night, etc.), backgrounds, different Fake Detection Results: DeepFakesON-Phys acquisition scenarios (i.e., indoors and outdoors), distances This section evaluates the ability of DeepFakesON-Phys to from the person to the camera, and pose variations, among detect the most challenging DeepFake videos of the 2nd gen- others. These databases present enough images (fake and eration. Table 3 shows the fake detection performance re- genuine) to fine-tune the original weights meant for heart sults achieved in terms of AUC and accuracy over the final rate estimation, obtaining new weights also based in rPPG evaluation datasets of Celeb-DF v2 and DFDC Preview. It is features but adapted for DeepFake detection. Table 2 sum- important to highlight that a separate fake detector is trained marizes the main characteristics of the databases. for each database. Celeb-DF v2 In general, very good results are achieved in both DeepFake databases. For the Celeb-DF v2 database, The aim of the Celeb-DF v2 database (Li et al. 2020) was DeepFakesON-Phys achieves an accuracy of 98.7% and an to generate fake videos of better visual quality compared AUC of 99.9%. Regarding the DFDC Preview database, the with the previous UADFV database. This database consists results achieved are 94.4% accuracy and 98.2% AUC, simi- of 590 real videos extracted from Youtube, corresponding to lar ones to the obtained for the Celeb-DF database. celebrities with a diverse distribution in terms of gender, age, Observing the results, it seems clear that the fake detectors and ethnic group. Regarding fake videos, a total of 5,639 have learnt to distinguish the spatio-temporal differences be- videos were created swapping faces using DeepFake tech- tween the real/fake faces of Celeb-DF v2 and DFDC Pre- nology. The final videos are in MPEG4.0 format. view databases. Since all the convolutional layers of the pro- posed fake detector are frozen (the network was originally DFDC Preview initialized with the weights from the model trained to pre- The DFDC database (Dolhansky et al. 2019) is one of the dict the heart rate (Hernandez-Ortega et al. 2020b)), and we latest public databases, released by Facebook in collabora- only train the last fully-connected layers, we can conclude tion with other companies and academic institutions such as that the proposed detection approach based on physiologi- Microsoft, Amazon, and the MIT. In the present study we cal measurement is successfully using pulse-related features consider the DFDC Preview dataset consisting of 1,131 real for distinguishing between real and fake faces. These results videos from 66 paid actors, ensuring realistic variability in prove that current face manipulation techniques do not pay gender, skin tone, and age. It is important to remark that no attention to the heart-rate-related physiological information publicly available data or data from social media sites were of the human being when synthesizing fake videos. Real classified as Real Fake classified as Real Fake classified as Fake DeepFake Scores DeepFake Scores DeepFake Scores 1 1 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 Score Score Score 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 threshold=0.579 0.1 threshold=0.579 threshold=0.579 0.1 0.1 DeepFake scores DeepFake scores DeepFake scores 0 0 0 0 5 10 15 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 18 time [s] time [s] time [s] Figure 2: Examples of successful and failed DeepFake detections. Top: sample frames of evaluated videos. Bottom: score distribution for each sample video. For the fake video misclassified as containing a real face, the DeepFake detection scores present a higher mean compared to the case of the fake video correctly classified as a fake. Table 3: Fake detection performance results in terms of ditional approach followed in the literature (Tolosana et al. AUC and Accuracy over the final evaluation datasets. 2020b)). In case we consider an evaluation at video level, DeepFakesON-Phys would be able to detect fake videos by Database AUC Results (%) Acc. Results (%) integrating the temporal information available in short-time Celeb-DF v2 99.9 98.7 segments, e.g., in a similar way as described in (Hernandez- DFDC Preview 98.2 94.4 Ortega et al. 2018) for continuous face anti-spoofing. We believe that the failures produced in this particular case are propitiated by the interferences of external illumi- nation. rPPG methods that use handcrafted features are usu- Fig. 2 shows some examples of successful and failed ally fragile against external artificial illumination in the fre- detections when evaluating the proposed approach with quency and power ranges of normal human heart rate, mak- real/fake faces of Celeb-DF v2. In particular, all the failures ing difficult to distinguish those illumination changes from correspond to fake faces generated from a particular video, the color changes caused by blood perfusion. Anyway, the misclassifying them as real faces. Fig. 2 shows a frame from proposed physiological approach presented in this work is the original real video (top-left), one from a misclassified more robust to this kind of illumination perturbations than fake video generated using that scenario (top-middle), and hand-crafted methods, thanks to the fact that the training another from a fake video correctly classified as fake and process is data-driven, making possible to identify those in- generated using the same real and fake identities but from terferences by using their presence in the training data. other source videos (top-right). The detection threshold is the same for all the testing databases and videos, and it has Comparison with the State of the Art been selected to maximize the accuracy in the evaluation. Finally, we compared in Table 4 the results achieved in the Looking at the score distributions along time of the three present work with other state-of-the-art DeepFake detection examples (Fig. 2, bottom), it can be seen that for the real face approaches: head pose variations (Yang, Li, and Lyu 2019), video (left) the scores are 1 for most of the time and always face warping artifacts (Li et al. 2020), mesoscopic fea- over the detection threshold. However, for the fake videos tures (Afchar et al. 2018), pure deep learning features (Dang considered (middle and right), the score changes constantly, et al. 2020; Tolosana et al. 2020a), and physiological fea- making the score of some fake frames to cross the detec- tures (Qi et al. 2020; Ciftci, Demir, and Yin 2020). The best tion threshold and consequently misclassifying them as real. results achieved for each database are remarked in bold. Nevertheless, it is important to remark that these mistakes Results in italics indicate that the evaluated database was only happen if we analyze the results at frame level (tra- not used for training. Some of these results are extracted Table 4: Comparison of different state-of-the-art fake detectors with our proposed DeepFakesON-Phys. The best results achieved for each database are remarked in bold. Results in italics indicate that the evaluated database (Celeb-DF or DFDC) was not used for training. AUC Results (%) Study Method Classifiers Celeb-DF (Li et al. 2020) DFDC (Dolhansky et al. 2019) (Yang, Li, and Lyu 2019) Head Pose Features SVM 54.6 55.9 (Li et al. 2020) Face Warping Features CNN 64.6 75.5 (Afchar et al. 2018) Mesoscopic Features CNN 54.8 75.3 (Dang et al. 2020) Deep Learning Features CNN + Attention Mechanism 71.2 - (Tolosana et al. 2020a) Deep Learning Features CNN 83.6 91.1 (Qi et al. 2020) Physiological Features CNN + Attention Mechanism - Acc. = 64.1 (Ciftci, Demir, and Yin 2020) Physiological Features SVM/CNN Acc. = 91.5 - AUC = 99.9 AUC = 98.2 DeepFakesON-Phys [Ours] Physiological Features CNN + Attention Mechanism Acc. = 98.7 Acc. = 94.4 from (Li et al. 2020). order to make a fair comparison of accuracy, and showing Note that the comparison in Table 4 is not always un- the actual performance of our method. Another future work der the same datasets and protocols, therefore it must be will be oriented to the analysis of the robustness of the pro- interpreted with care. Despite of that, it is patent that the posed fake detection approach against face manipulations proposed DeepFakesON-Phys has achieved state-of-the-art unseen during the training process (Tolosana et al. 2020b), results in both Celeb-DF and DFDC Preview databases. temporal integration of frame data (Hernandez-Ortega et al. In particular, it has further outperformed popular fake de- 2018), and the application of the proposed physiological ap- tectors based on pure deep learning approaches such as proach to other face manipulation techniques such as face Xception and Capsule Networks (Tolosana et al. 2020a) morphing (Raja and et al. 2020). and also other recent physiological approaches based on SVM/CNN (Ciftci, Demir, and Yin 2020). Acknowledgments This work has been supported by projects: IDEA-FAST Conclusions (IMI2-2018-15-two-stage-853981), PRIMA (ITN-2019- This work has evaluated the potential of physiological mea- 860315), TRESPASS-ETN (ITN-2019-860813), BIBECA surement to detect DeepFake videos. In particular, we have (RTI2018-101248-B-I00 MINECO/FEDER), and edBB proposed a novel DeepFake detector named DeepFakesON- (Universidad Autonoma de Madrid, UAM). J. H.-O. is sup- Phys based on a Convolutional Attention Network (CAN) ported by a PhD fellowship from UAM. R. T. is supported originally trained for heart rate estimation using remote pho- by a Postdoctoral fellowship from CAM/FSE. toplethysmography (rPPG). The proposed CAN approach consists of two parallel CNN networks that extract and share References temporal and spatial information from video frames. Afchar, D.; Nozick, V.; Yamagishi, J.; and Echizen, I. 2018. DeepFakesON-Phys has been evaluated using Celeb-DF MesoNet: a Compact Facial Video Forgery Detection Net- v2 and DFDC Preview databases, two of the latest and most work. In Proc. IEEE Int. Workshop on Information Forensics challenging DeepFake video databases. Regarding the ex- and Security. perimental protocol, each database was divided into devel- Agarwal, S.; and Farid, H. 2019. Protecting World Leaders opment and evaluation datasets, considering different iden- Against Deep Fakes. In Proc. IEEE/CVF Conf. on Computer tities in each dataset in order to perform a fair evaluation of Vision and Pattern Recognition Workshops. the technology. Bharadwaj, S.; Dhamecha, T. I.; Vatsa, M.; and Singh, R. The soundness and competitiveness of DeepFakesON- 2013. Computationally Efficient Face Spoofing Detection Phys has been proven by the very good results achieved, with Motion Magnification. In Proc. IEEE/CVF Conf. on AUC values of 99.9% and 98.2% for the Celeb-DF and Comp. Vision and Pattern Recognition Workshops. DFDC databases, respectively. These results have outper- Cellan-Jones, R. 2019. Deepfake Videos Double in formed other state-of-the-art fake detectors based on face Nine Months. URL https://www.bbc.com/news/technology- warping and pure deep learning features, among others. Fi- 49961089. nally, the experimental results of this study reveal that cur- Chen, W.; and McDuff, D. 2018. DeepPhys: Video-Based rent face manipulation techniques do not pay attention to Physiological Measurement Using Convolutional Attention the heart-rate-related or blood-related physiological infor- Networks. In Proc. European Conf. on Computer Vision, mation. 349–365. Immediate work may consist in replicating the state of Ciftci, U. A.; Demir, I.; and Yin, L. 2020. FakeCatcher: De- the art DeepFake works and training them with the same tection of Synthetic Portrait Videos Using Biological Signals. databases than the ones used to train DeepFakesON-Phys in IEEE Trans. on Pattern Analysis and Machine Intelligence . Citron, D. 2019. How DeepFake Undermine Truth and Li, Y.; Yang, X.; Sun, P.; Qi, H.; and Lyu, S. 2020. Celeb- Threaten Democracy. URL https://www.ted.com. DF: A Large-Scale Challenging Dataset for DeepFake Foren- sics. In Proc. IEEE/CVF Conf. on Comp. Vision and Pattern Conotter, V.; Bodnari, E.; Boato, G.; and Farid, H. 2014. Recognition. Physiologically-Based Detection of Comp. Generated Faces in Video. In Proc. IEEE Int. Conf. on Image Processing. Marcel, S.; Nixon, M.; Fierrez, J.; and Evans, N. 2019. Hand- book of Biometric Anti-Spoofing (2nd Edition). Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; and Jain, A. 2020. On the Detection of Digital Face Manipulation. In Proc. Matern, F.; Riess, C.; and Stamminger, M. 2019. Exploiting IEEE/CVF Conf. on Comp. Vision and Pattern Recognition. Visual Artifacts to Expose DeepFakes and Face Manipula- tions. In Proc. IEEE Winter App. of Comp. Vision Workshops. Dolhansky, B.; Howes, R.; Pflaum, B.; Baram, N.; and Fer- rer, C. C. 2019. The Deepfake Detection Challenge (DFDC) McDuff, D. J.; Estepp, J. R.; Piasecki, A. M.; and Blackford, Preview Dataset. arXiv preprint:1910.08854 . E. B. 2015. A Survey of Remote Optical Photoplethysmo- graphic Imaging Methods. In Proc. Annual Int. Conf. of the Erdogmus, N.; and Marcel, S. 2014. Spoofing Face Recog- IEEE Engineering in Medicine and Biology Society. nition with 3D Masks. IEEE Transactions on Information Neves, J.; et al. 2020. GANprintR: Improved Fakes and Eval- Forensics and Security 9(7): 1084–1097. uation of the State of the Art in Face Manipulation Detection. Galbally, J.; Marcel, S.; and Fierrez, J. 2014. Biometric Anti- IEEE Journal of Selected Topics in Signal Processing 14(5): Spoofing Methods: A Survey in Face Recognition. IEEE Ac- 1038–1048. cess 2: 1530–1552. Nguyen, H. H.; Yamagishi, J.; and Echizen, I. 2019. Use of Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde- a Capsule Network to Detect Fake Images and Videos. arXiv Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Gen- preprint:1910.12467 . erative Adversarial Nets. In Proc. Advances in Neural Infor- Qi, H.; Guo, Q.; Juefei-Xu, F.; Xie, X.; Ma, L.; Feng, W.; mation Processing Systems. Liu, Y.; and Zhao, J. 2020. DeepRhythm: Exposing Deep- Güera, D.; and Delp, E. 2018. Deepfake Video Detection Fakes with Attentional Visual Heartbeat Rhythms. arXiv Using Recurrent Neural Networks. In Proc. Int. Conf. on Ad- preprint:2006.07634 . vanced Video and Signal Based Surveillance. Raja, K.; and et al. 2020. Morphing Attack Detection - Database, Evaluation Platform and Benchmarking. IEEE Hernandez-Ortega, J.; Daza, R.; Morales, A.; Fierrez, J.; and Transactions on Information Forensics and Security. . Tolosana, R. 2020a. Heart Rate Estimation from Face Videos for Student Assessment: Experiments on edBB. In Proc. Ranjan, R.; Patel, V. M.; and Chellappa, R. 2017. Hyperface: IEEE Comp. Software and Applications Conf. A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recog- Hernandez-Ortega, J.; Fierrez, J.; Morales, A.; and Diaz, D. nition. IEEE Trans. on Pattern Analysis and Machine Intelli- 2020b. A Comparative Evaluation of Heart Rate Estimation gence 41(1): 121–135. Methods using Face Videos. In Proc. IEEE Intl. Workshop on Medical Computing. Rössler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; and Nießner, M. 2019. FaceForensics++: Learning to Detect Hernandez-Ortega, J.; Fierrez, J.; Morales, A.; and Galbally, Manipulated Facial Images. In Proc. IEEE/CVF Int. Conf. on J. 2019. Introduction to Face Presentation Attack Detection. Comp. Vision. In Handbook of Biometric Anti-Spoofing, 187–206. Springer. Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi, Hernandez-Ortega, J.; Fierrez, J.; Morales, A.; and Tome, P. I.; and Natarajan, P. 2019. Recurrent Convolutional Strate- 2018. Time Analysis of Pulse-Based Face Anti-Spoofing in gies for Face Manipulation Detection in Videos. In Proc. Visible and NIR. In Proc. IEEE Conf. on Comp. Vision and IEEE/CVF Conf. on Comp. Vision and Pattern Recognition Pattern Recognition Workshops. Workshops. Heusch, G.; Anjos, A.; and Marcel, S. 2017. A repro- Soukupova, T.; and Cech, J. 2016. Real-Time Eye Blink De- ducible study on remote heart rate measurement. arXiv tection Using Facial Landmarks. In Proc. Comp. Vision Win- preprint:1709.00962 . ter Workshop. Jung, T.; Kim, S.; and Kim, K. 2020. DeepVision: Deepfakes Tan, D.; and Nijholt, A. 2010. Brain-Computer Interfaces and Detection Using Human Eye Blinking Pattern. IEEE Access Human-Computer Interaction. In Brain-Computer Interfaces, 8: 83144–83154. 3–19. Springer. Karras, T.; et al. 2020. Analyzing and Improving the Image Tolosana, R.; Romero-Tapiador, S.; Fierrez, J.; and Vera- Quality of StyleGAN. In Proc. IEEE/CVF Conf. on Comp. Rodriguez, R. 2020a. DeepFakes Evolution: Analysis of Fa- Vision and Patter Recognition. cial Regions and Fake Detection Performance. Proc. Interna- tional Conference on Pattern Recognition Workshops . Kingma, D. P.; and Welling, M. 2013. Auto-Encoding Varia- tional Bayes. In Proc. Int. Conf. on Learning Represent. Tolosana, R.; Vera-Rodriguez, R.; Fierrez, J.; Morales, A.; and Ortega-Garcia, J. 2020b. DeepFakes and Beyond: A Sur- Li, Y.; Chang, M.; and Lyu, S. 2018. In Ictu Oculi: Exposing vey of Face Manipulation and Fake Detection. Information AI Generated Fake Face Videos by Detecting Eye Blinking. Fusion 64: 131–148. In Proc. IEEE Int. Work. Information Forensics and Security. Yang, X.; Li, Y.; and Lyu, S. 2019. Exposing Deep Fakes Li, Y.; and Lyu, S. 2019. Exposing DeepFake Videos By De- Using Inconsistent Head Poses. In Proc. IEEE Int. Conf. on tecting Face Warping Artifacts. In Proc. IEEE/CVF Conf. on Acoustics, Speech and Signal Processing. Comp. Vision and Pattern Recognition Workshops.