1. Introduction

Virtual " kai hong@njust.edu.cn (K. Hong); duxy@njust.edu.cn (X. Du)

Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies

Kai Hong

0 1

Xiaoyu Du

0 1 0 Nanjing University Of Science And Technology , Nanjing, 210014 , China 1 State Key Laboratory of Communication Content Cognition , Beijing, 100733 , China

2021

000 0 0003

Recent works demonstrate the significance of textures for the neural deepfake detection methods, yet the reason is still in explorations. In this paper, we claim that the artifact discrepancies caused by the face manipulation operations are the key diference between pristine videos and deepfakes. To imitate the discrepant situation from pristine videos, we propose an artifact-discrepant data generator to generate the negative samples by adjusting the artifacts in the facial regions with conventional processing tools. We then propose Deepfake Artifact Discrepancy Detector (DADD) method to discover the discrepancies. DADD adopts the multi-task architecture, associates each sub-task with a specific artifact set, and assembles all the sub-tasks for the final prediction. We term DADD as a self-supervised method since it never meets any deepfakes during the training process. The experimental results on the FaceForensics++ and Celeb-DF datasets demonstrate the efectiveness and generalizability of DADD.

eol>deepfake self-supervised artifact discrepancies

1. Introduction Videos were a natural and convincing medium to

spread information due to their abundant and strongly co-associated details, including appearances, actions, sounds, etc. This situation has changed due to the emergence of Deepfakes, the model-synthetic media in which the face or voice may be replaced with someone else’s.

The synthetic videos are resulting in negative impacts on individuals and society. Moreover, with the rapid development of generative techniques, the procedures making deepfakes have become substantially simple, while the products seem more realistic. This situation facilitates many domains, i.e., the film industry, but potentially increases the probability of social issues. Therefore, the deepfake detection methods have garnered widespread attention.

Recent deepfake detection methods are mainly devised from two perspectives. The first one is used by the bioinspired methods based on the observations and intuitive hypotheses over the datasets. Li et al. [1] focused on the abnormal eye blinking. Yang et al. [2] noted the inconsistency between the facial expressions and the corresponding head postures. Qi et al. [3] magnified the heart rhythm signal in videos and detected the disrupted heart rhythm signal. Li et al. [4] located the blending boundaries made by facial replacement methods to make the detection; the second perspective is to capture the forged features via the neural networks, including customized deep networks [5], classic neural networks [6], et al. .

The neural methods achieve an extremely high performance [6]. But the dependence on the training datasets severely limits the model generalizability, which is very important in practical applications. For instance, the well-trained models may not work across datasets [7, 8], since the deepfakes are made by a variant of methods.

To retain the model efectiveness across the datasets, the traditional measures including data augmentation [9] and transfer learning [7] are introduced. However, these methods hardly reveal the inherent diference between pristine videos and deepfakes. To address this issue, self-supervised learning scheme is introduced to produce negative samples as the substitutes of true deepfakes to make the model learn specific features [ 10, 4].

The negative samples rely on the manual hypothesis of the diferences between pristine videos and deepfakes, facilitating the construction of interpretable detection methods. Two typical works are FWA [10] and Face Xray [4], where the former assumes that the artifacts are caused by the resizing and blurring operations on the facial regions, and the latter believes that deepfakes always have unseen blending boundaries. Their results demonstrate that recent neural networks mostly focus on generic visual artifacts rather than the videos themselves. Therefore, the negative samples generated with intuitive and empirical operations can facilitate the detection model and enhance the generalizability further.

In addition, many works point out that the videos and images have inherent signals like fingerprints, which are produced by the devices, the post-processing or the 2. Related Work generative models [11, 12]. Inspired by these works, we make a bold hypothesis that the artifact discrepancies caused by the face manipulation operations are the key to Bio-inspired methods. Some works have found that detect deepfakes. Intuitively, all the frames in a pristine the actors’ physiological characteristics in deepfakes are video have the same operation flow, thus they should diferent from the real world. Li et al. [1] found that the have consistent fingerprints ( i.e., artifacts). In contrast, actors in deepfakes have an abnormal blinking frequency the replaced facial regions in deepfakes inevitably in- and some even don’t blink. Yang et al. [2] found that face troduce discrepant artifacts. Focusing on the discrepant orientation and head poses are related, but the correlaartifacts, we propose a self-supervised deepfake detection tion is destroyed in deepfakes. Due to the development of approach which comprises an Artifact-Discrepant Data remote visual photoplethysmography (rppg) technology, Generator (ADDG) and a Deepfake Artifact Discrepancy the heart rate of actors in videos can be detected [15]. Detector (DADD) to discover the discrepancy from the Based on this technology, Qi et al. [3] found the irregular generated data. ADDG just uses the pristine video frames heart rhythm of actors in deepfake. Similarly, Ciftci et al. and perturbs the facial regions with the conventional pro- [16] explored the biological signal diference between cessing tools, e.g., blurring, scaling, rotation, replacement, fake videos and real videos. However, the physiologetc. Although the perturbations do not change the frames ical signal artifacts reflected by diferent data sets are in human sense, we believe that they have introduced diferent, so specific data needs specific analysis. the discrepancy in the artifact level. Thus the perturbed frames are taken as the negative samples (i.e., substitutes Neural methods. Since deep neural networks can auof deepfakes) in our approach. DADD adopts the multi- tomatically extract images’ deep features, many DNNtask learning scheme, associates each sub-task with a based detection methods have achieved satisfactory retype of generated data, and assembles all the sub-tasks sults. Zhou et al. [17] divided the image into diferent for the final prediction. The prediction is constrained patches, and proposed a Two-stream network to detect by ℓ2,1 norm[13, 14] which is a classic regularization for the diference between patches. Afchar et al. [5] profeature selection. The experimental results on the pub- posed a compact network structure MesoNet to detect lic datasets demonstrate that the model trained on the fake videos. Nguyen et al. [18] proposed the use of capgenerated data can achieve a competitive performance, sule networks for deepfake detection. These methods ineven it never sees the real deepfakes. This verifies the dicate that a simple CNN network can indeed capture the efectiveness and generalizability of our approach, and relevant features of fake videos. In addition to these detecreveals that our hypothesis is a feasible perspective to tion methods based on single-frame images, there are also detect the deepfakes. methods based on multi-frame sequences. Guera et al.

The main contributions of our work are as follows: [19] extracted features from each frame by using CNN, then made decisions based on the feature sequence by using RNN. To capture the correlation of diferent frame features better, Sabir et al. [20] used a Bi-directional RNN.

These neural methods can detect specific Deepfakes perfectly [6], but for unseen data, the detection performance will be greatly reduced [8]. • We hypothesize that the artifact discrepancies caused by the face manipulations are the key to detect deepfakes, thus propose a self-supervised deepfake detection approach to discover the discrepancy. The core is the Artifact-Discrepant Data Generator, which uses the pristine video frames only and perturbs the facial region with the conventional processing tools to generate the negative samples. • To better address the artifact discrepancies, we propose

Deepfake Artifact Discrepancy Detector, which adopts the multi-task learning scheme, associates each subtask with a type of generated data, and makes the final prediction by integrating the sub-tasks. To guide the task feature selection, we adopt ℓ2,1 norm to constraint the learning process. • Extensive experiments are conducted to demonstrate the efectiveness and generalizability of our proposed self-supervised approach, though it has never seen any real deepfakes through the training process.

Cross-data methods. Recently, the generalizability

of detection methods has been emphasized. Xuan et al. [21] preprocessed training images to reduce obvious artifacts, forcing models to learn more intrinsic features.

Cozzolino et al. [22] introduced an auto-encoder method that enabled real and fake images to be decoupled in latent space. Du et al. [7] believed that the detection model needs to focus on the forgery area, not the irrelevant ones, so they located the modified region and proposed an active learning method. Nirkin et al. [23] believed that the face and content of the fake image have inconsistent identity information, so they used face recognition method to detect deepfakes. However, these methods still require corresponding fake videos to complete the training, resulting in limited generalizability. Diferent amounts of data are bound to produce diferent results [ 24]. Another

Perturbation Strategy 1). GaussBlur 2). Scaling 3). ISONoise 4). Rotation 5). SB-Rand 6). SB-Sim

Frame Perturbation

XP: Positive Sample X: Pristine Frame

Random Select

Deformation LandmarksCandidate Masks

Mask Generation

M:Mask

Reversed

XN Negative Sample Positive Sample novel idea is not to use any fake image during training. bations proposed have difering impacts, we introduce FWA[10] expected to simulate face warping artifacts by the ℓ2,1 regularization for feature selection. adjusting the face area to diferent sizes and blurring it to produce similar texture artifacts. Face X-ray[4] generated 3.1. Artifact-Discrepant Data Generator images with boundary information during training dynamically. Zhao et al. [25] also used Face X-ray’s method As shown in Figure 1, ADDG takes in the pristine imof generating training data and proposed a model for age , and generate the negative sample (i.e., artifactlearning diferent patches’ consistency. Therefore, dis- discrepant sample) with three modules, frame perturbacovering generation methods’ common steps can facili- tion, mask generation, and negative sample synthesis. tate the generalizability of the model. Frame perturbation uses common image processing tools to change the fingerprint of the pristine frame like data augmentation, mask generation selects the perturbation 3. Method area, and negative sample synthesis blends the pristine and perturbed frames to produce the discrepant artifacts.

We will introduce the modules, respectively.

The images from diferent sources have diferent fingerprints which is caused by the devices, the post-processing operations and the generative models. The fusion of two diferent images leads to artifact discrepancies. This Frame Perturbation. We utilize one of the convenwould be the key features of deepfakes, since deepfakes tional image processing methods to change the fingeralways have manipulated facial regions.Therefore, we print of the pristine frames, but ensure all the frames are propose Artifact-Discrepant Data Generator (ADDG). In still pristine. In this work, we use GaussBlur, Scaling, order to better address the artifact discrepancy, we pro- ISONoise, Rotation, SB-Rand, SB-Sim, as the pose Deepfake Artifact Discrepancy Detector (DADD), • GaussBlur is a commonly used data augmentation which adopts the multi-task learning scheme to learn method in deepfake detection [21]. the features from each type of discrepancy data, respectively, and make the final prediction by incorporating the • Scaling refers to zooming out and then zooming in sub-tasks. Finally, considering that the diferent pertur- the image, which will change the texture.

Concat N*F

Sub-Tasks

ST_1 ST_2

ST_3 ...

ST_N Final-Task

FT • ISONoise is the inherent noise signal generated when the sensor captures photos, and we can obtain it by accessing the Albumentations library. • Rotation’s purpose is to slightly adjust the face to produce artifacts in the form of a boundary.

Negative Sample Synthesis. This module produces the negative samples by synthesizing the pristine and perturbed frames according the mask. Let be the input pristine frame, be the perturbed yet pristine frame, and be the generated negative sample, the negative sample is generated by, = ⊙ + ⊙ (1 − ),

(1) where ⊙ indicates the element-wise product.

Finally, we list our nine categories of negative samples.

We term Inner- as the synthesis with common mask, that leaves the perturbation in the foreground. Similarly, we also term Outer- as the synthesis with reversed mask, that leaves the erturbation in the background. The categories without the previous prefixes only use the common mask, also. Specifically, the categories are Inner-GaussBlur, Outer-GaussBlur, Inner-Scaling, Outer-Scaling, InnerISONoise, Outer-ISONoise, Rotation, SB-Rand, and SBSim. Some generated examples are shown in Figure 2.

Some samples showed no diference, this is due to the small degree of modification.

3.2. Deepfake Artifact Discrepancy Detector

• SB-Rand and SB-Sim refer to using the frames of somebody else’s as the perturbation. ‘-Rand’ indicates the frame is randomly selected, while ‘-Sim’ indicates the frame has a similar face to the pristine frame, i.e., the landmarks of the faces in these two frames are close. This operation is to introduce some more diverse texture information.

The simplest yet efective measure to detect the artifact

discrepancy is to train a model on the artifact-discrepant data. However, the fused dataset contains too many information, thus it is hard to force the model learn efective Note that all the operations in this module operate on features. To address this problem, we propose Deepfake the whole input. The generated results are pristine. Thus Artifact Discrepancy Detector (DADD), which adopts the we denote it as . multi-task learning scheme to learn the characteristics of each category, and then summarize the features to make Mask Generation. This module decides the to-be- the final prediction. The structure is shown in Figure 3. modified region in the frame. First, we locate the face In DADD, we first extract a common feature with a landmarks in the pristine frame by using Dlib to guaran- CNN (e.g., Xception [26] in this work). Then, to suit the tee the mask is associated with the given input. Then, by requirements of each sub-task, we project to private analyzing the usual operations of deepfakes, the modified features of size . The predictions of the sub-tasks are area in the deepfake usually occurs on the face or the based on these private features. Then, we devise a final mouth, we empirically select some key points and preset task based on the concatenation of all the private features. ifve candidate masks and their reverse. The shape of the To train DADD, we first train the sub-tasks in turn. masks is presented in Figure 1. We randomly select a When training sub-task ST_1, the data associated with mask from the candidates. Because the key point detec- ST_1 would be fed. We train the sub-tasks for iterations tion may be inaccurate, and considering the generaliza- and the train the final-tasks for iterations, iteratively. tion performance, we made a slight random deformation Then the common and private features could both retain of the mask. Since the key is to the discrepancies between the significant features for the prediction. Eventually, in regions but not the perturbed facial region, we use the re- the test process, the prediction is the output of the final versed region to perturb the corresponding background. task.

The final mask is denoted as , a matrix with the same shape of the input. We set two version of it. 3.3. Training

The basic is a 0-1 matrix, which has solid boundaries and may be easily recognized by the model. To gen- For all sub-tasks and final task, we adopt the cross entropy erate hard samples, we generate with soft boundary, loss as the learning target. Let ℒ be the sub-task loss where the values of are smooth near the boundaries. and ℒ be the final task loss, they are defined as, where represents the parameter matrix, represents the number of columns of the matrix, and represents the number of rows of the matrix. The function of ℓ2,1 regularization is to sparse our parameter matrix’s rows.

In our task, each row’s parameters represent the weights corresponding to the feature vectors extracted by each sub-task. We add ℓ2,1 regularization to the Final-Task training process, and the overall loss function is defined as follows:

Datasets. To evaluate our approach, we leverage two ℒ = ℒ = − 1 ∑︁ log() + (1 − ) log(1 − ), dataset, FaceForensics++ [6] and Celeb-DF [8]. =1 FaceForensics++ [6] comprises a set of pristine (2) video (P) and four categories of fake videos, including where indicates the ground truth, indicates the out- DeepFakes (DF), Face2Face (FF), FaceSwap (FS) and Neuput of the model, indicates the number of the samples. ralTextures (NT) . Each category contains 1,000 videos.

In addition, the purpose of DADD is to use the most The dataset publisher give an oficial splitting list, that suitable features. This is a feature selection task. There- 720, 140, and 140 videos of each category are used for fore, we introduce a feature selection regularization ℓ2,1 training, validation and test, respectively. In our experinorm [27, 28, 29] to perform feature selection. Formally, ments, we extract 20 frames per video. Then we adopt ℓ2,1 regularization is, the training set of pristine videos only to train our model.

We choose the parameters according to the validation set ℓ2,1( ) = ‖ ‖2,1 = ∑︁ ⎷⎸⎯⎸∑︁ |, |2, (3) Celeb-DF [8] is a challenging data set, which is mostly and evaluate the model on the test set. =1 =1 used for cross-dataset test. There are 38 real videos and 62 fake videos in this test set. We extract all frames from these videos. We select the model via the validation set of Faceforensics++ and evaluate our model Celeb-DF.

Note that the test data never appeared in the training datasets, especially the deepfakes. Moreover, Celeb-DF is an independent and hard dataset. Thus the test results can demonstrate the efectiveness of our generalizability across datasets.

Input: Training images ;

1 repeat 2 for =0 to k do 3 for =1 to N do 4 Generate (), ();

Methods. To make fair comparison, we introduce

ℒ = ℒ + · ℒ2,1, (4) two recent self-supervised deepfake detection methods, FWA [10] and Face X-ray [4], which also used real frames where ℒ2,1 indicates the regularization on the concate- to generate training data during training dynamically. nated private features, and is a hyper-parameter. Dur- FWA believes that GaussBlur could construct warped ing training, we perform regular data augmentation on faces, so they used diferent degrees of GaussBlur to all types of data More detailed training procedure are construct negative samples. Face X-ray dynamically listed in Algorithm 1. generates images with boundary information. In our experiments, we use ‘-FWA’ to denote the data Algorithm 1: Multi-Task learning Framework generated by FWA, ‘-BI’ to denote the data generated by Face X-ray, and ‘-ADDG’ to denote the data generated by our proposed method. We also use ‘Xcep-’ to denote the Xception model, ‘Xray-’ to denote the X-ray model, and ‘DADD-’ to denote our method. 5

4. Experiment In this section, we conduct extensive experiments to demonstrate the efectiveness of our approach. 4.2. Performances

Table 1 demonstrates the results on DF and Celeb-DF.

The results marked with references indicate that they are from the original. Two-stream [17] was trained on the SwapMe dataset [17]. Meso4 [5] was trained on an internal DeepFake dataset collected by the authors. HeadPose [2] was trained on the UADFV dataset [2]. For FWA, the dataset was collected from the Internet. The supervised methods perform badly when testing cross data.

In contrast, the self-supervised methods in the second part of Table 1 mostly do well. That reveals the significance of the explorations on self-supervised methods. Since FWA only considers the use of GaussBlur to sim- Figure 4: Visual result of feature selection implemented by ulate the warped face during the deepfake generation ℓ2,1 regularization (=0.1). The left indicates that no ℓ2,1, and process, its generalizability is limited. As can be seen the right indicates that ℓ2,1 is used. from Xcep-FWA, only DF and FF perform slightly higher.

For Xcep-BI, the results are diferent because the specific settings of my experiment are diferent from the original our model does not merely detect specific texture features paper. Xray-BI and our method DADD-ADDG (ℓ2,1) per- but captures the diference between internal and external form evenly on DF, FF, FS, and NT. DADD-ADDG (ℓ2,1) textures. Since the data set is heavily compressed, a lot have an average improvement of 0.17% on FaceForen- of information is lost. The results on the five test data sics++. But on the more dificult Celeb-DF, our method demonstrate that, diferent perturbation would benifit improves by 8.17%. This verifies our hypothesis on the diferent types of deepfakes. artifact discrepancy. Since our task is to improve generalization performance, that is, test results on completely unrelated data sets, a slight decrease in performance on 4.4. The Impact of DADD FS and NT is acceptable.

4.3. The Impact of Perturbations We presents the impact of diferent perturbations in Fig

ure 5. We finetune the Xception on diferent categories of perturbed frames, respectively.

From figure 5, we have the following observations. For DF, all methods have good performance except OuterScaling. For FF, the Rotation we proposed has reached the best response, indicating that FF artifacts show more edge information.For FS, the two methods that the texture of other images to perturb the original image have the best response, indicating that it is meaningful to introduce various textures. This also explains that the blending boundary constructed by rotating does not perform as well as replacing the image. For NT, Inner-GaussBlur, InnerScaling, and Rotation have a high response. Compared with other types of data sets, it is dificult for Celeb-DF to get a good response with a single perturbation method.

The impact of GaussBlur and Scaling are similar. When the face’s interior is disturbed, it responds very well to DF, FS, and NT, while it is deficient to FS. However, when the modified area is the background, the result is the opposite. The efect will be better on FS. It verifies that

4.5. The Impact of ℓ2,1 Regularization

Table 2 demonstrates the impact of ℓ2,1 regularization. It improve the final result by 1.51% on cross-data CelebDF. This indicates that the feature selection benefit the model performances. We also test the feature selection hyper-parameter , and log the impact of in Figure 7. When = 0.1, the model achieves the best performance. Lower improve the performance in a small ratio, while

We also visualize the layer with regularization in Fig

ure 4. The features from ST_3, ST_7, and ST_9 contribute most. The corresponding perturbations are Inner-Scaling, Rotation, and Sim-Swap. This means they could be the delegates in the final prediction. Note that this doesn’t mean only these three sub-tasks are necessary. Their performances are based on the shared features, which is learned from all the sub-tasks.

5. CONCLUSION In this paper, we made a hypothesis that the discrepant artifacts caused by the frame manipulations are the key diferences between pristine videos and deepfakes. To address the discrepancy, we proposed a self-supervised Acknowledgments Special thanks are given to the SSDL2021’s organizing

committee and I am also very grateful to the reviewers for their valuable comments on this paper. This research was supported by Open Funding Project of the State Key Laboratory of Communication Content Cognition (No.20K03). The completion of this paper can not be separated from the Intelligent Media Analysis Group (IMAG) to help. The author would like to also thank Cheng Zhuang, Jiangnan Dai and Shaocong Yang in the IMAG for their valuable discussions. [26] F. Chollet, Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258. [27] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, X. Zhou, l2, 1-norm regularized discriminative feature selection for unsupervised learning, in: IJCAI International Joint Conference on Artificial Intelligence, AAAI Press/International Joint Conferences on Artificial Intelligence, 2011, pp. 1589–1594. [28] L. Fu, Z. Li, Q. Ye, H. Yin, Q. Liu, X. Chen, X. Fan, W. Yang, G. Yang, Learning robust discriminant subspace based on joint l2, p-and l2, s-norm distance metrics, IEEE Transactions on Neural Networks and Learning Systems (2020). [29] Q. Ye, Z. Li, L. Fu, Z. Zhang, W. Yang, G. Yang, Nonpeaked discriminant analysis for data representation, IEEE transactions on neural networks and learning systems 30 (2019) 3818–3832.