=Paper=
{{Paper
|id=Vol-3084/paper5
|storemode=property
|title=Detecting Deepfakes with Multi-Metric Loss
|pdfUrl=https://ceur-ws.org/Vol-3084/paper5.pdf
|volume=Vol-3084
|authors=Ziwei Zhang,Xin Li,Rongrong Ni,Yao Zhao
}}
==Detecting Deepfakes with Multi-Metric Loss==
Detecting Deepfakes with Multi-Metric Loss Ziwei Zhang1 , Xin Li1 , Rongrong Ni1 and Yao Zhao1 1 Beijing Jiaotong University Abstract In recent years, DeepFake techniques have advanced to generate so realistic forged content that it could jeopardize personal privacy and national security. We observe the distribution discrepancy between genuine faces and tampered faces manip- ulated by DeepFake techniques. It can be described that embedding vectors of genuine faces are tightly distributed in the embedding space, while tampered faces are comparatively scattered. We, therefore, propose a novel DeepFake detection method based on Multi-metric Loss. Specifically, real and fake faces are mapped onto the embedding space, which is of intra-class compactness and inter-class separation. Then by adding Weight-Center Loss to project genuine faces onto a more compact region in the embedding space, the distance between the two types of sample clusters is further expanded, thereby improving the separability of genuine and tampered samples. Moreover, the Adaptive Hardness-aware Expander is designed to further improve feature description ability of the model because the metric is always challenged with proper difficulty. Extensive experiments show that our approach can achieve state-of-the-art performance on present datasets. Keywords Deepfakes, Multi-metric Loss, Adaptive Hardness-aware Expander 1. Introduction Of various digital media, videos containing digital hu- man faces, especially the ones involving personal identi- fication information, are most vulnerable to be attacked. These assaults are collectively referred to as DeepFake manipulations. Therefore, to develop effective methods capable of detecting DeepFake videos carries substant weight. Since the existing manipulations tamper with specific areas frame by frame, the artifacts and noises (a) DFDC (b) Celeb-DF appear in the spurious videos. So previous researchers have proposed many handcrafted methods [1, 2, 3, 4] and Figure 1: DFDC and Celeb-DF dataset distribution visualiza- tion by t-SNE. The projections of real face features are tightly data-driven methods [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] distributed, while the fakes are comparatively scattered. to find manipulation traces. Due to uncertain counterfeit methods and manipu- lation quality in DeepFake videos, the spurious data is ent levels and face sample cluster with diverse labels scattered in the whole feature space. Relatively, gen- (real/fake). Under the restriction of Triplet Loss and uine human faces concentrate close to a non-linear low- Cross-Entropy Loss, the real faces and fake faces are dimensional manifold [17] in the feature space. As shown mapped onto the embedding space, which is intra-class in Figure 1, the vectors of real faces are tightly distributed, compactness and inter-class separation. Then through while the fakes are comparatively scattered. Therefore, adding Weight-Center Loss, the real faces are projected we consider that this distribution discrepancy also exists to a more compact region. The method of excavating in the embedding space obtained by the feature space fundamental distinction between the two types of sam- mapping. The existing detection schemes, however, do ples is, therefore, to extend the distance between the not consider the distribution discrepancy between the two types of sample clusters in the embedding space, two types of samples. thereby improving the separability of genuine and spu- To this end, we propose the DeepFake detection frame- rious videos. In the end-stage of training, in order to work with Multi-metric Loss, as shown in Figure 2. further improve the feature description ability of the Triplet Loss, Cross-Entropy Loss and Weight-Center Loss model, we designed the Adaptive Hardness-aware Ex- together constitute Multi-metric Loss acting on differ- pander (AHE). The rigorous experiments on FaceForen- sics++ [6], DFDC [18] and Celeb-DF [19] datasets show International Workshop on Safety & Security of Deep Learning, that the proposed method based on Multi-metric Loss August 19th -26th, 2021, Montreal-themed Virtual Reality is highly effective and achieves state-of-the-art perfor- " rrni@bjtu.edu.cn (R. Ni); yzhao@bjtu.edu.cn (Y. Zhao) mance. Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 2: The framework of our method. Use MTCNN to crop the video frames into facial area maps and send them to the backbone network to get the embedding vectors. The Cross-Entropy Loss and Triple Loss of all original embedding vectors and the Weight-Center Loss of genuine embedding vectors, are calculated. In the end-stage of training, Adaptive Hardness-aware Expander continuously synthesizes samples with adaptive hardness. 2. Related works though they also mapped data onto the embedding space With huge risks posed by face forgery technology, there based on Deep Metric Learning, they just followed the tra- is currently an urge to investigate DeepFake detection ditional metric strategy and imposed the same constraint methods. Existing detection techniques mainly fall into on two types of samples. In our work, considering the two categories: handcrafted and data-driven methods. distribution discrepancy of real and fake data, different Handcrafted Methods. For the limited face manipu- levels of classification constraints are imposed on these lation techniques at that time, early works achieved the two kinds of sample clusters. Specifically, we design the DeepFake detection through handcraft features. This Multi-metric Loss to further widen the distance between methods mainly include eye blinking [1], incomplete de- the real cluster and the fakes by capturing fundamental tails in the eye and teeth [2], face warping [3] and head distinction between spurious videos and genuine videos, poses [4]. With the development of generative adver- and the Adaptive Hardness-aware Expander to further sarial network (GAN) [20],a variety of tampering tech- improve the feature description ability of the model. nologies have emerged and forgery faces have become more realistic. Therefore, the effectiveness of formerly 3. Proposed Approach handcrafted methods has gradually been weakened. In this section, we give an overview of our framework. Data-driven Methods. Given the powerful feature As aforementioned, the embedding vectors of real faces representation capabilities of deep neural network, the are aggregative in the embedding space, while the fakes data-driven methods have received widespread attention. are relatively scattered. Motivated by this observation, Firstly, some classification networks were applied to de- two key components are integrated into the framework: tect fake faces like MesoNet [5], XceptionNet [6], Capsule 1) Multi-metric Loss is designed to mine fundamental network [7], R3D and C3D [8] etc. Then Zhou et al. [9] distinction between real and fake faces so as to improve proposed to use the Two-stream neural network to cap- separability; 2) Adaptive Hardness-aware Expander can ture tampering artifacts and local noise residuals. The be used to further improve the feature description ability adaptive face weighting layer [10] was designed with of the model. The framework is depicted in Figure 2. it focus forgery details. The model [11] was trained to 3.1. Multi-metric Loss mark the blending boundary for forged images. Consid- ering inconsistent warping left by manipulation in the Let π³ denote the data space where we sample a set of inter-frame, the methods [12, 13, 14] were proposed. The facial area maps X = [π₯1 , π₯2 , Β· Β· Β· , π₯π ]. Each data π₯π methods [15, 16] introduced the Deep Metric Learning has a label ππ β {0, 1} representing real or fake. Let β to DeepFake detection for the first time. β : π³ ββ π΄ be the mapping from the data space to the Kumar et al. [15] mainly explored the methodβs ef- feature space, where the extracted feature π¦π preserves fectiveness for detecting videos with high compression semantic characteristics of its corresponding data point factor. Feng et al. [16] used the difference of the full face π₯π . Then the feature is projected onto the embedding π image in videos as the feature for DeepFake detection. Al- space π΅ with the mapping π : π΄ ββ π΅. Since the Figure 3: The proposed Multi-metric Loss with Triplet Loss, Cross-Entropy Loss and Weight-Center Loss. Weight-Center Loss, which only acts on the cluster of real samples, imposes larger penalty on samples that deviate from the center, and imposes smaller penalty on adjacent samples, while continuously updating the center of the real sample cluster. projection can be incorporated into the deep network, we 3.1.3. Weight-Center Loss π can directly learn the mapping π (Β·; π) = ββπ : π³ ββ π΅ Considering the distribution discrepancy of genuine and from the data space to the embedding space, where π is tampered data, we hope to further widen the distance network parameters. between two categories of sample clusters by capturing Based on the data distribution discrepancy, namely, the fundamental distinction between real videos and fake embedding vectors of real faces are tightly distributed, videos. Under the action of Triplet Loss and CE Loss, the while the fakes are comparatively scattered. We deem network has acquired preliminary classification capabil- that various levels of classification constraints should be ity. On this basis, we design Weight-Center Loss for real imposed, so as to mine fundamental distinction between sample cluster to capture the fundamental distinction spurious videos and genuine videos, as shown in the between two types of samples. Figure 3. Multi-metric Loss is formulated as follows: Some embedding vectors are far from the center of Loss = βπ πππβπ‘βπΆπππ‘ππ + π½βπ ππππππ‘ + πβπΆπΈ (1) the real sample cluster, it may be due to certain interfer- ence, which has nothing to do with judging real and fake 3.1.1. Triplet Loss videos. Therefore, Weight-Center Loss is proposed which Under the constraint of Triplet Loss, the mapping from only acts on the cluster of real samples. We define the high-dimensional sparse features into low-dimensional sample that is far from the center of the real sample clus- dense vectors is learned. Reflected in the embedding ter compared to the surrounding samples as the deviating space, the distribution of data is characterized by intra- sample. It adaptively imposes larger penalty on deviat- class compactness and inter-class separation. ing samples, and imposes smaller penalty on adjacent Let π (π₯π ; π) be the anchor embedding vector. The samples. Simultaneously, the center of the real sample embedding vector with the same and different label rel- cluster is continuously updated. Based on the above oper- ative to π (π₯π ; π), are defined as π (π₯π ; π) and π (π₯π ; π), ations, real faces are projected to a more compact region respectively. Triplet Loss is formulated as follows: in the embedding space, so as to broaden the distance be- tween the real sample cluster and the fake sample cluster. βπ ππππππ‘ := [Sππ β Sππ + π ]+ (2) Weight-Center Loss is formulated as follows: where Sππ = β¨π (π₯π ; π) , π (π₯π ; π)β© indicates the simi- [οΈ ]οΈ 1 βοΈ βπΌ(π βπ) larity of positive pair, Sππ = β¨π (π₯π ; π) , π (π₯π ; π)β© is the βπ πππβπ‘βπΆπππ‘ππ = log 1 + π ππ (3) πΌ similarity of negative pair, β¨Β·, Β·β© denotes dot product, and πβπ π is metric margin. where π is the collection of real embedding vec- tors, πππ is the similarity of the center sample pair 3.1.2. Cross-Entropy Loss {π (π₯π ; π) , π (π₯π ; π)}, π (π₯π ; π) and π (π₯π ; π) are real In our approach, Cross-Entropy (CE) Loss and the Triplet embedding vectors and the iterative center and πΌ, π are Loss act jointly. Specifically, CE Loss encourages the sep- fixed hyperparameters. It is worth noting that the center aration of real embedding vectors from the fakes. Simul- is iterated continuously. taneously, the Triplet Loss is used to achieve intra-class Based on [21], we can get the generic definition about compactness and inter-class separation, so as to initially the penalty weight of sample pair. Then the penalty separate the two types of sample clusters. weight of the center sample {π (π₯π ; π) , π (π₯π ; π)} in manipulated, and for other samples π§, π§ + , we perform {οΈ }οΈ no transformation. Then the reduction in the distance be- tween negative pairs will create rise of the hard level, so that the measurement process is always at an appropriate level of difficulty during the training cycle. As shown in the Figure 4, in order to simplify the representation, we use π§, π§ + , π§ β to represent the anchor embedding vector π (π₯π ; π), the positive embedding vector π (π₯π ; π), and the negative embedding vector π (π₯π ; π), respectively. Firstly, a toy example that constructs an augmented harder negative sample π§Λβ by linear interpolation, is presented: Figure 4: Adaptive{οΈ Hardness-aware Expander module. The π§Λβ = π§ + π π§ β β π§ , π β [0, 1] (5) (οΈ )οΈ β β β }οΈ synthetic samples π§Λ1 ,π§Λ2 ,. . .,π§Λπ , which are generated by However, samples too close to the anchor are likely to the distance of tuple π§, π§ + , π§ β and CE Loss. {οΈ }οΈ cause confusion in the label. Therefore, we exploit the CE Loss in the previous section to control the hardness of the βπ πππβπ‘βπΆπππ‘ππ is calculated as follows: generated negative samples, since it is a good indicator 1 π€ππ = βπΌ(πβπ ) βοΈ βπΌ(πππ βπππ ) (4) of training process. If the CE Loss is small, the generated π ππ + πβπ,πΜΈ=π π negative sample will be closer to the anchor point, but where π (π₯π ; π) is one embedding vector of set, will not cross the positive sample. Adaptive Hardness- π (π₯π ; π) is the other in the set except sample π (π₯π ; π). aware Expander can be represented as: πππ and πππ indicate the similarity of sample pair {οΈ ]οΈ (π§ββπ§) π§ + ππβ +(1βπ) π+ πβ if πβ > π+ [οΈ β {π (π₯π ; π) , π (π₯π ; π)} and {π (π₯π ; π) , π (π₯π ; π)}. π§ Λ = (6) Eq.4 shows that the penalty weight for sample pair is π§ β if πβ β€ π+ β πΎ determined by its relative similarity, measured by com- where π = π βπΆπΈ is a balance factor to control paring it with the distance from surrounding samples the hardness of the generated negative samples, πΎ is with the center, which is fundamentally different from the pulling factor β¦ used to balance the β¦ scale of βπΆπΈ , Center Loss [22]. According to the relative position rela- π+ = β¦π§ β β π§ β¦2 and πβ = β¦π§ + β π§ β¦2 are the distance β¦ β¦ tionship in the set, there are two different situations, as between positive pair and negative pair, respectively. shown in Figure 3. Firstly, the embedding vector π (π₯π ; π) In the early stage of training, the generated hard sam- is far from the center of set relative to other samples ples can not represent related face information, consider- π (π₯π ; π), as described by the deviating sample in Fig- ing that the embedding space has no accurate semantic ure 3, and the formula is expressed as πππ > πππ . We structure. It may even cause the model to be trained in consider that the current embedding vector extracted con- the wrong direction from the beginning. As the training tains certain interference, which has nothing to do with progresses, however, the model is growing more tolerant judging real and fake videos, so the larger penalty weight of hard samples, that is, the metric is always challenged is imposed. When the embedding vector is closer to the with proper difficulty. Thereby Adaptive Hardness-aware center of set relative to other samples, as described by Expander can improve the feature description ability of the adjacent sample in Figure 3, the formula is expressed the model. as πππ β€ πππ . Smaller penalty weight is imposed and the network parameters are fine-tuned to find the features 4. Experiments that could best represent the fundamental distinction In this section, we first explore the optimal settings for between spurious videos and authentic videos. our approach and then present extensive experimental results to demonstrate the effectiveness of our method. 3.2. Adaptive Hardness-aware Expander In the end-stage of training, considering that original 4.1. Implement Details samples are already well separable under the action of For all real/fake video frames, we use face extractor Multi-metric Loss. Continuing to train original samples MTCNN to detect faces and save the aligned facial images cannot further improve the modelβs feature description as inputs with the size of 256 Γ 256. π½, π in Eq.1 and πΌ ability. To address this limitations, we propose the Adap- in Eq.3 is set to 2.0, 1.0, 2.0 to impose different levels of tive Hardness-aware Expander, as shown in{οΈFigure 4. }οΈ classification constraints. The margin of Triplet Loss in We construct the hardness-aware triplet π§, π§ + , π§Λβ π Eq.2 is set to 1.0. Optimization is performed using SGD in the embedding space, where manipulation of the dis- optimizer with weight decay 5πβ4 . The initial learning tances among samples will directly alter the {οΈhard level rate is kept at 0.01 and divided by 10 after every 3000 of the triple. The distances of negative pairs π§, π§Λβ is iterations. We adopt ResNet-34, which is pre-trained on }οΈ π Table 1 Testing ACC(%) and AUC(%) score of our method and other methods on FaceForensics++ dataset. FF++/df FF++/ff FF++/fs FF++/nt Methods ACC AUC ACC AUC ACC AUC ACC AUC MesoNet [5] 0.827 0.853 0.562 0.634 0.611 0.679 0.502 0.596 XceptionNet [6] 0.948 0.986 0.928 0.972 0.903 0.933 0.807 0.835 Li et al. [3] 0.969 0.995 0.972 0.987 0.963 0.990 0.890 0.913 Capsule[7] 0.941 0.960 0.963 0.958 0.972 0.974 0.887 0.948 Feng et al. [16] 0.953 0.991 0.938 0.957 0.921 0.940 0.841 0.902 Kumar et al. [15] 0.960 0.990 0.932 0.962 0.944 0.978 0.832 0.872 Bonettini et al.[10] 0.981 0.992 0.955 0.970 0.973 0.980 0.845 0.863 Ours 0.985 0.998 0.974 0.991 0.995 1.000 0.938 0.968 Table 2 hardness-aware samples forces the network to pay more Testing ACC(%) and AUC(%) score of our method and attention to some key features that characterize the truth other methods on DFDC and Celeb-DF dataset. and counterfeit under the constraint of Multi-metric Loss, DFDC Celeb-DF thereby improving the feature description ability of the ACC AUC ACC AUC model and achieving better classification performance. MesoNet [5] 0.746 0.818 0.482 0.536 Therefore, our method can achieve state-of-the-art perfor- XceptionNet [6] 0.845 0.909 0.788 0.832 mance on FaceForensics++, DFDC and CelebDF datasets. Li et al. [3] 0.793 0.861 0.571 0.628 Capsule [7] 0.861 0.933 0.791 0.879 4.3. Ablation Study Feng et al. [16] 0.883 0.963 0.814 0.867 To verify the effectiveness of Multi-metric Loss and Adap- Kumar et al. [15] 0.825 0.899 0.792 0.943 tive Hardness-aware Expander, we conduct ablation stud- Bonettini et al.[10] 0.944 0.967 0.903 0.959 ies and results are shown in Table 3, Figure 5, Figure 6. Ours 0.962 0.979 0.927 0.968 4.3.1. Effectiveness of Multi-metric Loss Table 3 To confirm the effectiveness of Multi-metric Loss, we The ablation study about Multi-metric Loss. evaluate how different levels classification constraints df ff fs nt affect the detection accuracy. We train the model on Triplet Loss 0.946 0.925 0.939 0.810 FF++ (c23), other hyperparameters are kept the same as + Cross-Entropy Loss 0.962 0.933 0.942 0.870 settings in Table 1. + Weight-center Loss 0.985 0.974 0.995 0.938 The t-SNE plots of four different manipulation meth- ods in FF++ datasets are reported in Figure 5. It can be the ImageNet dataset, as the backbone network. Our found that the separability of the sample is poor when model is trained on 4 RTX 2080Ti GPUs with batch Triple Loss acts independently, as shown in the first row size 16 and the total number of iterations is set to 10, 000. of Figure 5. The reason is that the data selection in the batch results in uneven data distribution, which makes 4.2. Comparsion with Previous Methods it difficult to divide the interface. When Cross-entropy In this section, we compare our method with previous Loss is introduced, the data distribution of different ma- DeepFake detection methods. The performance of var- nipulation in FF++ datasets is shown in the second row ious methods on FaceForensics++ [6], DFDC [18] and of Figure 5. Among them, Cross-Entropy Loss encour- Celeb-DF [19] dataset is shown. We adopt ACC (accu- ages the separation of real embedding vectors from fake racy) and AUC (area under Receiver Operating Charac- embedding vectors, and the Triple Loss helps constrain teristic Curve) as the evaluation metrics for experiments. the intra-class compactness and inter-class separation, The evaluation results of the individual datasets are thereby improving the separability of samples. In the shown in Table 1 and Table 2. The results indicate that third row of Figure 5, Weight-Center Loss is added and our model trained with Multi-metric Loss and AHE have it only acts on the real cluster. By mining the features significant improvement over previous methods with representing authenticity, the real sample clusters are metric learning [15, 16], especially in DFDC and Celeb- tightly clustered, thereby further extending distance be- DF dataset. The reason is that different levels of classifica- tween two types of sample clusters in the embedding tion constraints based on the phenomena of distribution space. The ACC ablation studys about Multi-metric Loss discrepancy is imposed to mine the fundamental distinc- on FF++ are reported in Table 3, which further confirm tion between spurious videos and genuine videos, so that the effectiveness of Multi-metric Loss. it can still work on tampered videos without obvious Note that Triplet Loss and Cross-Entropy Loss work artifacts. At the same time, the generation of adaptive during the entire training stage, while Weight-Center FF++/df FF++/ff FF++/fs FF++/nt Triplet Loss + Cross-Entropy Loss + Weight-Center Loss Real Samples Fake Samples Figure 5: t-SNE plots of ablation study about Triplet Loss, Cross-Entropy Loss and Weight-Center Loss in FF++ dataset. FF++/df FF++/ff FF++/fs FF++/nt tive hardness force the network paying more attention to some key features that characterize the authenticity and Tampering the counterfeit under the constraint of Multi-metric Loss, figure thereby improving the feature description ability of the model. For example, in Figure 6, NeuralTextures (nt), a Grad-CAM tampering scheme only modifies the mouth area. Before without the Adaptive Hardness-aware Expander is used, class AHE activation map shows that the nose and mouth regions together provide evidence that the video is tampered. Af- Grad-CAM ter the Adaptive Hardness-aware Expander is used, class with AHE activation map shows that the network will pay more at- tention to the mouth area tampered, which demonstrates the interpretability of our proposed method. Figure 6: Heatmaps generated by Grad-CAM about with or without AHE in four manipulation methods. 5. Conclusion In this work, we propose the DeepFake detection method based on Multi-metric Loss, considering the distribution Loss only works in the middle and end stages of training. discrepancy that the embedding vectors of genuine faces The main reason is that the center point of the real sam- are tightly distributed in the embedding space, while tam- ples is unstable at the beginning of the training, which pered faces are comparatively scattered. Multi-metric will cause the network to optimize in the wrong direction. Loss improves the separability of genuine and tampered 4.3.2. Effectiveness of AHE samples through further widening distance between the To confirm the effectiveness of Adaptive Hardness-aware two types of sample clusters. Besides, adaptive hardness- Expander, we analyze the class activation maps for four aware samples is generated to make the metric be always different manipulation methods, as shown in Figure 6. in the proper difficulty, so as to improve the feature de- Class activation maps corresponding to the operation scription ability of the model. Our method achieves good of Expander indicate that synthetic samples with adap- improvements in extensive metrics. Acknowledgments Detection, in: 2020 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2020, This work was supported in part by the National Key Re- pp. 5000β5009. search and Development of China (2018YFC0807306), Na- [12] D. GΓΌera, E. J. Delp, Deepfake Video Detection tional NSF of China (U1936212), Beijing Fund-Municipal Using Recurrent Neural Networks, in: 2018 15th Education Commission Joint Project (KZ202010015023). IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018, pp. 1β6. [13] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, References I. Masi, P. Natarajan, Recurrent Convolutional Strategies for Face Manipulation Detection in [1] Y. Li, M. Chang, S. Lyu, In Ictu Oculi: Exposing AI Videos, in: 2019 IEEE/CVF Conference on Com- Created Fake Videos by Detecting Eye Blinking, in: puter Vision and Pattern Recognition Workshops 2018 IEEE International Workshop on Information (CVPRW), 2019, pp. 80β87. Forensics and Security (WIFS), 2018, pp. 1β7. [14] K. Chugh, P. Gupta, A. Dhall, R. Subramanian, Not [2] F. Matern, C. Riess, M. Stamminger, Exploiting Made for Each Other- Audio-Visual Dissonance- Visual Artifacts to Expose Deepfakes and Face Ma- Based Deepfake Detection and Localization, in: nipulations, in: 2019 IEEE Winter Applications of Proceedings of the 28th ACM International Confer- Computer Vision Workshops (WACVW), 2019, pp. ence on Multimedia, 2020, p. 439β447. 83β92. [15] A. Kumar, A. Bhavsar, R. Verma, Detecting Deep- [3] Y. Li, S. Lyu, Exposing DeepFake Videos By Detect- fakes with Metric Learning, in: 2020 8th Inter- ing Face Warping Artifacts, in: IEEE/CVF Confer- national Workshop on Biometrics and Forensics ence on Computer Vision and Pattern Recognition (IWBF), 2020, pp. 1β6. (CVPR) Workshops, 2019, pp. 46β52. [16] K. Feng, J. Wu, M. Tian, A Detect method for [4] X. Yang, Y. Li, S. Lyu, Exposing Deep Fakes Using deepfake video based on full face recognition, in: Inconsistent Head Poses, in: IEEE International 2020 IEEE International Conference on Informa- Conference on Acoustics, Speech and Signal Pro- tion Technology,Big Data and Artificial Intelligence cessing, ICASSP, 2019, pp. 8261β8265. (ICIBA), 2020, pp. 1121β1125. [5] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, [17] N. Lei, Z. Luo, S. Yau, X. D. Gu, Geomet- MesoNet: a Compact Facial Video Forgery Detec- ric Understanding of Deep Learning, 2018. tion Network, in: 2018 IEEE International Work- arXiv:1805.10451. shop on Information Forensics and Security (WIFS), [18] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, 2018, pp. 1β7. M. Wang, C. Canton, The DeepFake Detection [6] A. RΓΆssler, D. Cozzolino, L. Verdoliva, C. Riess, Challenge Dataset, 2020. arXiv:2006.07397. J. Thies, M. Niessner, FaceForensics++: Learn- [19] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-DF: ing to Detect Manipulated Facial Images, in: 2019 A Large-Scale Challenging Dataset for DeepFake IEEE/CVF International Conference on Computer Forensics, in: 2020 IEEE/CVF Conference on Com- Vision (ICCV), 2019, pp. 1β11. puter Vision and Pattern Recognition, CVPR, 2020, [7] H. Nguyen, J. Yamagishi, I. Echizen, Use of a Cap- pp. 3204β3213. sule Network to Detect Fake Images and Videos, [20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, 2019. arXiv:1910.12467v2. D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, [8] I. Ganiyusufoglu, L. M. NgΓ΄, N. Savov, S. Karaoglu, Generative Adversarial Nets, in: 2014 Annual Con- T. Gevers, Spatio-temporal Features for Gen- ference on Neural Information Processing Systems eralized Detection of Deepfake Videos, 2020. (NIPS), 2014, pp. 2672β-2680. arXiv:2010.11844. [21] X. Wang, X. Han, W. Huang, D. Dong, M. R. Scott, [9] P. Zhou, X. Han, V. I. Morariu, L. S. Davis, Two- Multi-Similarity Loss With General Pair Weight- Stream Neural Networks for Tampered Face Detec- ing for Deep Metric Learning, in: 2019 IEEE/CVF tion, in: IEEE/CVF Conference on Computer Vision Conference on Computer Vision and Pattern Recog- and Pattern Recognition (CVPR) Workshops, 2017. nition (CVPR), 2019, pp. 5017β5025. [10] N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, [22] Y. Wen, K. Zhang, Z. Li, Y. Qiao, A Discriminative P. Bestagini, S. Tubaro, Video Face Manipulation Feature Learning Approach for Deep Face Recog- Detection Through Ensemble of CNNs, in: 2020 nition, in: Computer Vision - ECCV 2016 - 14th 25th International Conference on Pattern Recogni- European Conference, Amsterdam, 2016, pp. 499β tion (ICPR), 2020, pp. 5012β5019. 515. [11] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, B. Guo, Face X-Ray for More General Face Forgery