1. Introduction

Detecting Deepfakes with Multi-Metric Loss

Ziwei Zhang

Xin Li

Rongrong Ni

Yao Zhao

0 0 Beijing Jiaotong University

In recent years, DeepFake techniques have advanced to generate so realistic forged content that it could jeopardize personal privacy and national security. We observe the distribution discrepancy between genuine faces and tampered faces manipulated by DeepFake techniques. It can be described that embedding vectors of genuine faces are tightly distributed in the embedding space, while tampered faces are comparatively scattered. We, therefore, propose a novel DeepFake detection method based on Multi-metric Loss. Specifically, real and fake faces are mapped onto the embedding space, which is of intra-class compactness and inter-class separation. Then by adding Weight-Center Loss to project genuine faces onto a more compact region in the embedding space, the distance between the two types of sample clusters is further expanded, thereby improving the separability of genuine and tampered samples. Moreover, the Adaptive Hardness-aware Expander is designed to further improve feature description ability of the model because the metric is always challenged with proper dificulty. Extensive experiments show that our approach can achieve state-of-the-art performance on present datasets.

eol>Deepfakes Multi-metric Loss Adaptive Hardness-aware Expander

1. Introduction Of various digital media, videos containing digital hu

man faces, especially the ones involving personal identiifcation information, are most vulnerable to be attacked.

These assaults are collectively referred to as DeepFake manipulations. Therefore, to develop efective methods capable of detecting DeepFake videos carries substant weight. Since the existing manipulations tamper with specific areas frame by frame, the artifacts and noises (a) DFDC (b) Celeb-DF appear in the spurious videos. So previous researchers have proposed many handcrafted methods [1, 2, 3, 4] and Figure 1: DFDC and Celeb-DF dataset distribution visualizadata-driven methods [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] tion by t-SNE. The projections of real face features are tightly to find manipulation traces. distributed, while the fakes are comparatively scattered.

Due to uncertain counterfeit methods and manipulation quality in DeepFake videos, the spurious data is ent levels and face sample cluster with diverse labels scattered in the whole feature space. Relatively, gen- (real/fake). Under the restriction of Triplet Loss and uine human faces concentrate close to a non-linear low- Cross-Entropy Loss, the real faces and fake faces are dimensional manifold [17] in the feature space. As shown mapped onto the embedding space, which is intra-class in Figure 1, the vectors of real faces are tightly distributed, compactness and inter-class separation. Then through while the fakes are comparatively scattered. Therefore, adding Weight-Center Loss, the real faces are projected we consider that this distribution discrepancy also exists to a more compact region. The method of excavating in the embedding space obtained by the feature space fundamental distinction between the two types of sammapping. The existing detection schemes, however, do ples is, therefore, to extend the distance between the not consider the distribution discrepancy between the two types of sample clusters in the embedding space, two types of samples. thereby improving the separability of genuine and spu

To this end, we propose the DeepFake detection frame- rious videos. In the end-stage of training, in order to work with Multi-metric Loss, as shown in Figure 2. further improve the feature description ability of the Triplet Loss, Cross-Entropy Loss and Weight-Center Loss model, we designed the Adaptive Hardness-aware Extogether constitute Multi-metric Loss acting on difer- pander (AHE). The rigorous experiments on FaceForensics++ [6], DFDC [18] and Celeb-DF [19] datasets show that the proposed method based on Multi-metric Loss is highly efective and achieves state-of-the-art performance.

International Workshop on Safety & Security of Deep Learning, August 19th -26th, 2021, Montreal-themed Virtual Reality " rrni@bjtu.edu.cn (R. Ni); yzhao@bjtu.edu.cn (Y. Zhao)

© 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)

2. Related works

though they also mapped data onto the embedding space With huge risks posed by face forgery technology, there based on Deep Metric Learning, they just followed the trais currently an urge to investigate DeepFake detection ditional metric strategy and imposed the same constraint methods. Existing detection techniques mainly fall into on two types of samples. In our work, considering the two categories: handcrafted and data-driven methods. distribution discrepancy of real and fake data, diferent Handcrafted Methods. For the limited face manipu- levels of classification constraints are imposed on these lation techniques at that time, early works achieved the two kinds of sample clusters. Specifically, we design the DeepFake detection through handcraft features. This Multi-metric Loss to further widen the distance between methods mainly include eye blinking [1], incomplete de- the real cluster and the fakes by capturing fundamental tails in the eye and teeth [2], face warping [3] and head distinction between spurious videos and genuine videos, poses [4]. With the development of generative adver- and the Adaptive Hardness-aware Expander to further sarial network (GAN) [20],a variety of tampering tech- improve the feature description ability of the model. nologies have emerged and forgery faces have become 3. Proposed Approach more realistic. Therefore, the efectiveness of formerly handcrafted methods has gradually been weakened. In this section, we give an overview of our framework. Data-driven Methods. Given the powerful feature As aforementioned, the embedding vectors of real faces representation capabilities of deep neural network, the are aggregative in the embedding space, while the fakes data-driven methods have received widespread attention. are relatively scattered. Motivated by this observation, Firstly, some classification networks were applied to de- two key components are integrated into the framework: tect fake faces like MesoNet [5], XceptionNet [6], Capsule 1) Multi-metric Loss is designed to mine fundamental network [7], R3D and C3D [8] etc. Then Zhou et al. [9] distinction between real and fake faces so as to improve proposed to use the Two-stream neural network to cap- separability; 2) Adaptive Hardness-aware Expander can ture tampering artifacts and local noise residuals. The be used to further improve the feature description ability adaptive face weighting layer [10] was designed with of the model. The framework is depicted in Figure 2. it focus forgery details. The model [11] was trained to 3.1. Multi-metric Loss mark the blending boundary for forged images. Considering inconsistent warping left by manipulation in the Let denote the data space where we sample a set of inter-frame, the methods [12, 13, 14] were proposed. The facial area maps X = [1, 2, · · · , ]. Each data methods [15, 16] introduced the Deep Metric Learning has a label ∈ {0, 1} representing real or fake. Let to DeepFake detection for the first time. ℎ : −ℎ→ be the mapping from the data space to the

Kumar et al. [15] mainly explored the method’s ef- feature space, where the extracted feature preserves fectiveness for detecting videos with high compression semantic characteristics of its corresponding data point factor. Feng et al. [16] used the diference of the full face . Then the feature is projected onto the embedding image in videos as the feature for DeepFake detection. Al- space with the mapping : −→ . Since the projection can be incorporated into the deep network, we 3.1.3. Weight-Center Loss can directly learn the mapping (·; ) = ℎ∘ : −→ Considering the distribution discrepancy of genuine and from the data space to the embedding space, where is tampered data, we hope to further widen the distance network parameters. between two categories of sample clusters by capturing

Based on the data distribution discrepancy, namely, the fundamental distinction between real videos and fake embedding vectors of real faces are tightly distributed, videos. Under the action of Triplet Loss and CE Loss, the while the fakes are comparatively scattered. We deem network has acquired preliminary classification capabilthat various levels of classification constraints should be ity. On this basis, we design Weight-Center Loss for real imposed, so as to mine fundamental distinction between sample cluster to capture the fundamental distinction spurious videos and genuine videos, as shown in the between two types of samples.

Figure 3. Multi-metric Loss is formulated as follows: Some embedding vectors are far from the center of Loss = ℒ ℎ− + ℒ + ℒ (1) the real sample cluster, it may be due to certain interference, which has nothing to do with judging real and fake 3.1.1. Triplet Loss videos. Therefore, Weight-Center Loss is proposed which Under the constraint of Triplet Loss, the mapping from only acts on the cluster of real samples. We define the high-dimensional sparse features into low-dimensional sample that is far from the center of the real sample clusdense vectors is learned. Reflected in the embedding ter compared to the surrounding samples as the deviating space, the distribution of data is characterized by intra- sample. It adaptively imposes larger penalty on deviatclass compactness and inter-class separation. ing samples, and imposes smaller penalty on adjacent

Let (; ) be the anchor embedding vector. The samples. Simultaneously, the center of the real sample embedding vector with the same and diferent label rel- cluster is continuously updated. Based on the above operative to (; ), are defined as (; ) and (; ), ations, real faces are projected to a more compact region respectively. Triplet Loss is formulated as follows: in the embedding space, so as to broaden the distance beℒ := [S − S + ]+ (2) Wtweeiegnhtt-hCeernetaelrsLamospsleiscfloursmteurlaanteddthaes ffaoklleowsasm: ple cluster. ℒ ℎ− = 1 log [︃1 + ∑︁ −(−)]︃ (3) where S = ⟨ (; ) , (; )⟩ indicates the similarity of positive pair, S = ⟨ (; ) , (; )⟩ is the similarity of negative pair, ⟨·, ·⟩ denotes dot product, and is metric margin. ∈ where is the collection of real embedding vec3.1.2. Cross-Entropy Loss tors, is the similarity of the center sample pair { (; ) , (; )}, (; ) and (; ) are real In our approach, Cross-Entropy (CE) Loss and the Triplet embedding vectors and the iterative center and , are Loss act jointly. Specifically, CE Loss encourages the sep- ifxed hyperparameters. It is worth noting that the center aration of real embedding vectors from the fakes. Simul- is iterated continuously. taneously, the Triplet Loss is used to achieve intra-class Based on [21], we can get the generic definition about compactness and inter-class separation, so as to initially the penalty weight of sample pair. Then the penalty separate the two types of sample clusters. weight of the center sample { (; ) , (; )} in manipulated, and for other samples {︀, +}︀, we perform no transformation. Then the reduction in the distance between negative pairs will create rise of the hard level, so that the measurement process is always at an appropriate level of dificulty during the training cycle. As shown in the Figure 4, in order to simplify the representation, we use , +, − to represent the anchor embedding vector (; ), the positive embedding vector (; ), and the negative embedding vector (; ), respectively.

Firstly, a toy example that constructs an augmented harder negative sample ˜− by linear interpolation, is presented: ˜− = + (︀− − )︀ , ∈ [0, 1] (6)

In this section, we first explore the optimal settings for

our approach and then present extensive experimental results to demonstrate the efectiveness of our method.

3.2. Adaptive Hardness-aware Expander In the end-stage of training, considering that original

samples are already well separable under the action of For all real/fake video frames, we use face extractor Multi-metric Loss. Continuing to train original samples MTCNN to detect faces and save the aligned facial images cannot further improve the model’s feature description as inputs with the size of 256 × 256. , in Eq.1 and ability. To address this limitations, we propose the Adap- in Eq.3 is set to 2.0, 1.0, 2.0 to impose diferent levels of tive Hardness-aware Expander, as shown in Figure 4. classification constraints. The margin of Triplet Loss in

We construct the hardness-aware triplet {︀, +, ˜−}︀ Eq.2 is set to 1.0. Optimization is performed using SGD in the embedding space, where manipulation of the dis- optimizer with weight decay 5−4. The initial learning tances among samples will directly alter the hard level rate is kept at 0.01 and divided by 10 after every 3000 of the triple. The distances of negative pairs {︀, ˜−}︀ is iterations. We adopt ResNet-34, which is pre-trained on

4.1. Implement Details

hardness-aware samples forces the network to pay more attention to some key features that characterize the truth and counterfeit under the constraint of Multi-metric Loss, thereby improving the feature description ability of the model and achieving better classification performance. Therefore, our method can achieve state-of-the-art performance on FaceForensics++, DFDC and CelebDF datasets.

4.3. Ablation Study To verify the efectiveness of Multi-metric Loss and Adap

tive Hardness-aware Expander, we conduct ablation studies and results are shown in Table 3, Figure 5, Figure 6. 4.3.1. Efe ctiveness of Multi-metric Loss Table 3 To confirm the efectiveness of Multi-metric Loss, we The ablation study about Multi-metric Loss. evaluate how diferent levels classification constraints df f fs nt afect the detection accuracy. We train the model on

Triplet Loss 0.946 0.925 0.939 0.810 FF++ (c23), other hyperparameters are kept the same as + Cross-Entropy Loss 0.962 0.933 0.942 0.870 settings in Table 1. + Weight-center Loss 0.985 0.974 0.995 0.938 The t-SNE plots of four diferent manipulation methods in FF++ datasets are reported in Figure 5. It can be the ImageNet dataset, as the backbone network. Our found that the separability of the sample is poor when model is trained on 4 RTX 2080Ti GPUs with batch Triple Loss acts independently, as shown in the first row size 16 and the total number of iterations is set to 10, 000. of Figure 5. The reason is that the data selection in the 4.2. Comparsion with Previous Methods batch results in uneven data distribution, which makes it dificult to divide the interface. When Cross-entropy In this section, we compare our method with previous Loss is introduced, the data distribution of diferent maDeepFake detection methods. The performance of var- nipulation in FF++ datasets is shown in the second row ious methods on FaceForensics++ [6], DFDC [18] and of Figure 5. Among them, Cross-Entropy Loss encourCeleb-DF [19] dataset is shown. We adopt ACC (accu- ages the separation of real embedding vectors from fake racy) and AUC (area under Receiver Operating Charac- embedding vectors, and the Triple Loss helps constrain teristic Curve) as the evaluation metrics for experiments. the intra-class compactness and inter-class separation,

The evaluation results of the individual datasets are thereby improving the separability of samples. In the shown in Table 1 and Table 2. The results indicate that third row of Figure 5, Weight-Center Loss is added and our model trained with Multi-metric Loss and AHE have it only acts on the real cluster. By mining the features significant improvement over previous methods with representing authenticity, the real sample clusters are metric learning [15, 16], especially in DFDC and Celeb- tightly clustered, thereby further extending distance beDF dataset. The reason is that diferent levels of classifica- tween two types of sample clusters in the embedding tion constraints based on the phenomena of distribution space. The ACC ablation studys about Multi-metric Loss discrepancy is imposed to mine the fundamental distinc- on FF++ are reported in Table 3, which further confirm tion between spurious videos and genuine videos, so that the efectiveness of Multi-metric Loss. it can still work on tampered videos without obvious Note that Triplet Loss and Cross-Entropy Loss work artifacts. At the same time, the generation of adaptive during the entire training stage, while Weight-Center Real Samples

Fake Samples tive hardness force the network paying more attention to some key features that characterize the authenticity and the counterfeit under the constraint of Multi-metric Loss, thereby improving the feature description ability of the model. For example, in Figure 6, NeuralTextures (nt), a tampering scheme only modifies the mouth area. Before the Adaptive Hardness-aware Expander is used, class activation map shows that the nose and mouth regions together provide evidence that the video is tampered. After the Adaptive Hardness-aware Expander is used, class activation map shows that the network will pay more attention to the mouth area tampered, which demonstrates the interpretability of our proposed method.

5. Conclusion In this work, we propose the DeepFake detection method

based on Multi-metric Loss, considering the distribution Loss only works in the middle and end stages of training. discrepancy that the embedding vectors of genuine faces The main reason is that the center point of the real sam- are tightly distributed in the embedding space, while tamples is unstable at the beginning of the training, which pered faces are comparatively scattered. Multi-metric will cause the network to optimize in the wrong direction. Loss improves the separability of genuine and tampered 4.3.2. Efe ctiveness of AHE samples through further widening distance between the To confirm the efectiveness of Adaptive Hardness-aware two types of sample clusters. Besides, adaptive hardnessExpander, we analyze the class activation maps for four aware samples is generated to make the metric be always diferent manipulation methods, as shown in Figure 6. in the proper dificulty, so as to improve the feature de

Class activation maps corresponding to the operation scription ability of the model. Our method achieves good of Expander indicate that synthetic samples with adap- improvements in extensive metrics.

Acknowledgments Detection, in: 2020 IEEE/CVF Conference on Com

puter Vision and Pattern Recognition (CVPR), 2020, This work was supported in part by the National Key Re- pp. 5000–5009. search and Development of China (2018YFC0807306), Na- [12] D. Güera, E. J. Delp, Deepfake Video Detection tional NSF of China (U1936212), Beijing Fund-Municipal Using Recurrent Neural Networks, in: 2018 15th Education Commission Joint Project (KZ202010015023). IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018, pp. 1–6. [13] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, References I. Masi, P. Natarajan, Recurrent Convolutional Strategies for Face Manipulation Detection in [1] Y. Li, M. Chang, S. Lyu, In Ictu Oculi: Exposing AI Videos, in: 2019 IEEE/CVF Conference on ComCreated Fake Videos by Detecting Eye Blinking, in: puter Vision and Pattern Recognition Workshops 2018 IEEE International Workshop on Information (CVPRW), 2019, pp. 80–87.

Forensics and Security (WIFS), 2018, pp. 1–7. [14] K. Chugh, P. Gupta, A. Dhall, R. Subramanian, Not [2] F. Matern, C. Riess, M. Stamminger, Exploiting Made for Each Other- Audio-Visual DissonanceVisual Artifacts to Expose Deepfakes and Face Ma- Based Deepfake Detection and Localization, in: nipulations, in: 2019 IEEE Winter Applications of Proceedings of the 28th ACM International ConferComputer Vision Workshops (WACVW), 2019, pp. ence on Multimedia, 2020, p. 439–447. 83–92. [15] A. Kumar, A. Bhavsar, R. Verma, Detecting Deep[3] Y. Li, S. Lyu, Exposing DeepFake Videos By Detect- fakes with Metric Learning, in: 2020 8th Intering Face Warping Artifacts, in: IEEE/CVF Confer- national Workshop on Biometrics and Forensics ence on Computer Vision and Pattern Recognition (IWBF), 2020, pp. 1–6.

(CVPR) Workshops, 2019, pp. 46–52. [16] K. Feng, J. Wu, M. Tian, A Detect method for [4] X. Yang, Y. Li, S. Lyu, Exposing Deep Fakes Using deepfake video based on full face recognition, in: Inconsistent Head Poses, in: IEEE International 2020 IEEE International Conference on InformaConference on Acoustics, Speech and Signal Pro- tion Technology,Big Data and Artificial Intelligence cessing, ICASSP, 2019, pp. 8261–8265. (ICIBA), 2020, pp. 1121–1125. [5] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, [17] N. Lei, Z. Luo, S. Yau, X. D. Gu, GeometMesoNet: a Compact Facial Video Forgery Detec- ric Understanding of Deep Learning, 2018. tion Network, in: 2018 IEEE International Work- arXiv:1805.10451. shop on Information Forensics and Security (WIFS), [18] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, 2018, pp. 1–7. M. Wang, C. Canton, The DeepFake Detection [6] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, Challenge Dataset, 2020. arXiv:2006.07397.

J. Thies, M. Niessner, FaceForensics++: Learn- [19] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-DF: ing to Detect Manipulated Facial Images, in: 2019 A Large-Scale Challenging Dataset for DeepFake IEEE/CVF International Conference on Computer Forensics, in: 2020 IEEE/CVF Conference on ComVision (ICCV), 2019, pp. 1–11. puter Vision and Pattern Recognition, CVPR, 2020, [7] H. Nguyen, J. Yamagishi, I. Echizen, Use of a Cap- pp. 3204–3213.

sule Network to Detect Fake Images and Videos, [20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, 2019. arXiv:1910.12467v2. D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, [8] I. Ganiyusufoglu, L. M. Ngô, N. Savov, S. Karaoglu, Generative Adversarial Nets, in: 2014 Annual ConT. Gevers, Spatio-temporal Features for Gen- ference on Neural Information Processing Systems eralized Detection of Deepfake Videos, 2020. (NIPS), 2014, pp. 2672—-2680.

arXiv:2010.11844. [21] X. Wang, X. Han, W. Huang, D. Dong, M. R. Scott, [9] P. Zhou, X. Han, V. I. Morariu, L. S. Davis, Two- Multi-Similarity Loss With General Pair WeightStream Neural Networks for Tampered Face Detec- ing for Deep Metric Learning, in: 2019 IEEE/CVF tion, in: IEEE/CVF Conference on Computer Vision Conference on Computer Vision and Pattern Recogand Pattern Recognition (CVPR) Workshops, 2017. nition (CVPR), 2019, pp. 5017–5025. [10] N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, [22] Y. Wen, K. Zhang, Z. Li, Y. Qiao, A Discriminative P. Bestagini, S. Tubaro, Video Face Manipulation Feature Learning Approach for Deep Face RecogDetection Through Ensemble of CNNs, in: 2020 nition, in: Computer Vision - ECCV 2016 - 14th 25th International Conference on Pattern Recogni- European Conference, Amsterdam, 2016, pp. 499– tion (ICPR), 2020, pp. 5012–5019. 515. [11] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen,

B. Guo, Face X-Ray for More General Face Forgery