=Paper=
{{Paper
|id=Vol-3084/paper2
|storemode=property
|title=Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies
|pdfUrl=https://ceur-ws.org/Vol-3084/paper2.pdf
|volume=Vol-3084
|authors=Kai Hong,Xiaoyu Du
}}
==Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies==
Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies Kai Hong1,2 , Xiaoyu Du1,2 1 Nanjing University Of Science And Technology, Nanjing, 210014, China 2 State Key Laboratory of Communication Content Cognition, Beijing, 100733, China Abstract Recent works demonstrate the significance of textures for the neural deepfake detection methods, yet the reason is still in explorations. In this paper, we claim that the artifact discrepancies caused by the face manipulation operations are the key difference between pristine videos and deepfakes. To imitate the discrepant situation from pristine videos, we propose an artifact-discrepant data generator to generate the negative samples by adjusting the artifacts in the facial regions with conventional processing tools. We then propose Deepfake Artifact Discrepancy Detector (DADD) method to discover the discrepancies. DADD adopts the multi-task architecture, associates each sub-task with a specific artifact set, and assembles all the sub-tasks for the final prediction. We term DADD as a self-supervised method since it never meets any deepfakes during the training process. The experimental results on the FaceForensics++ and Celeb-DF datasets demonstrate the effectiveness and generalizability of DADD. Keywords deepfake, self-supervised, artifact discrepancies 1. Introduction detection; the second perspective is to capture the forged features via the neural networks, including customized Videos were a natural and convincing medium to deep networks [5], classic neural networks [6], et al. . spread information due to their abundant and strongly The neural methods achieve an extremely high perfor- co-associated details, including appearances, actions, mance [6]. But the dependence on the training datasets sounds, etc. This situation has changed due to the emer- severely limits the model generalizability, which is very gence of Deepfakes, the model-synthetic media in which important in practical applications. For instance, the the face or voice may be replaced with someone else’s. well-trained models may not work across datasets [7, 8], The synthetic videos are resulting in negative impacts on since the deepfakes are made by a variant of methods. individuals and society. Moreover, with the rapid devel- To retain the model effectiveness across the datasets, opment of generative techniques, the procedures making the traditional measures including data augmentation deepfakes have become substantially simple, while the [9] and transfer learning [7] are introduced. However, products seem more realistic. This situation facilitates these methods hardly reveal the inherent difference be- many domains, i.e., the film industry, but potentially in- tween pristine videos and deepfakes. To address this creases the probability of social issues. Therefore, the issue, self-supervised learning scheme is introduced to deepfake detection methods have garnered widespread produce negative samples as the substitutes of true deep- attention. fakes to make the model learn specific features [10, 4]. Recent deepfake detection methods are mainly devised The negative samples rely on the manual hypothesis of from two perspectives. The first one is used by the bio- the differences between pristine videos and deepfakes, inspired methods based on the observations and intuitive facilitating the construction of interpretable detection hypotheses over the datasets. Li et al. [1] focused on the methods. Two typical works are FWA [10] and Face X- abnormal eye blinking. Yang et al. [2] noted the incon- ray [4], where the former assumes that the artifacts are sistency between the facial expressions and the corre- caused by the resizing and blurring operations on the sponding head postures. Qi et al. [3] magnified the heart facial regions, and the latter believes that deepfakes al- rhythm signal in videos and detected the disrupted heart ways have unseen blending boundaries. Their results rhythm signal. Li et al. [4] located the blending bound- demonstrate that recent neural networks mostly focus aries made by facial replacement methods to make the on generic visual artifacts rather than the videos them- * Xiaoyu Du is the corresponding author. selves. Therefore, the negative samples generated with 2021 International Workshop on Safety & Security of Deep Learning, intuitive and empirical operations can facilitate the de- August 21th, 2021, Virtual tection model and enhance the generalizability further. " kai hong@njust.edu.cn (K. Hong); duxy@njust.edu.cn (X. Du) In addition, many works point out that the videos and 0000-0003-3567-6396 (K. Hong); 0000-0002-4641-1994 (X. Du) © 2021 Copyright for this paper by its authors. Use permitted under Creative images have inherent signals like fingerprints, which CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) are produced by the devices, the post-processing or the generative models [11, 12]. Inspired by these works, we 2. Related Work make a bold hypothesis that the artifact discrepancies caused by the face manipulation operations are the key to Bio-inspired methods. Some works have found that detect deepfakes. Intuitively, all the frames in a pristine the actors’ physiological characteristics in deepfakes are video have the same operation flow, thus they should different from the real world. Li et al. [1] found that the have consistent fingerprints (i.e., artifacts). In contrast, actors in deepfakes have an abnormal blinking frequency the replaced facial regions in deepfakes inevitably in- and some even don’t blink. Yang et al. [2] found that face troduce discrepant artifacts. Focusing on the discrepant orientation and head poses are related, but the correla- artifacts, we propose a self-supervised deepfake detection tion is destroyed in deepfakes. Due to the development of approach which comprises an Artifact-Discrepant Data remote visual photoplethysmography (rppg) technology, Generator (ADDG) and a Deepfake Artifact Discrepancy the heart rate of actors in videos can be detected [15]. Detector (DADD) to discover the discrepancy from the Based on this technology, Qi et al. [3] found the irregular generated data. ADDG just uses the pristine video frames heart rhythm of actors in deepfake. Similarly, Ciftci et al. and perturbs the facial regions with the conventional pro- [16] explored the biological signal difference between cessing tools, e.g., blurring, scaling, rotation, replacement, fake videos and real videos. However, the physiolog- etc. Although the perturbations do not change the frames ical signal artifacts reflected by different data sets are in human sense, we believe that they have introduced different, so specific data needs specific analysis. the discrepancy in the artifact level. Thus the perturbed frames are taken as the negative samples (i.e., substitutesNeural methods. Since deep neural networks can au- of deepfakes) in our approach. DADD adopts the multi- tomatically extract images’ deep features, many DNN- task learning scheme, associates each sub-task with a based detection methods have achieved satisfactory re- type of generated data, and assembles all the sub-tasks sults. Zhou et al. [17] divided the image into different for the final prediction. The prediction is constrained patches, and proposed a Two-stream network to detect by ℓ2,1 norm[13, 14] which is a classic regularization for the difference between patches. Afchar et al. [5] pro- feature selection. The experimental results on the pub- posed a compact network structure MesoNet to detect lic datasets demonstrate that the model trained on the fake videos. Nguyen et al. [18] proposed the use of cap- generated data can achieve a competitive performance, sule networks for deepfake detection. These methods in- even it never sees the real deepfakes. This verifies the dicate that a simple CNN network can indeed capture the effectiveness and generalizability of our approach, and relevant features of fake videos. In addition to these detec- reveals that our hypothesis is a feasible perspective to tion methods based on single-frame images, there are also detect the deepfakes. methods based on multi-frame sequences. Guera et al. The main contributions of our work are as follows: [19] extracted features from each frame by using CNN, then made decisions based on the feature sequence by • We hypothesize that the artifact discrepancies caused using RNN. To capture the correlation of different frame by the face manipulations are the key to detect deep- features better, Sabir et al. [20] used a Bi-directional RNN. fakes, thus propose a self-supervised deepfake detec- These neural methods can detect specific Deepfakes per- tion approach to discover the discrepancy. The core fectly [6], but for unseen data, the detection performance is the Artifact-Discrepant Data Generator, which uses will be greatly reduced [8]. the pristine video frames only and perturbs the facial region with the conventional processing tools to gen- Cross-data methods. Recently, the generalizability erate the negative samples. of detection methods has been emphasized. Xuan et al. • To better address the artifact discrepancies, we propose [21] preprocessed training images to reduce obvious ar- Deepfake Artifact Discrepancy Detector, which adopts tifacts, forcing models to learn more intrinsic features. the multi-task learning scheme, associates each sub- Cozzolino et al. [22] introduced an auto-encoder method task with a type of generated data, and makes the final that enabled real and fake images to be decoupled in la- prediction by integrating the sub-tasks. To guide the tent space. Du et al. [7] believed that the detection model task feature selection, we adopt ℓ2,1 norm to constraint needs to focus on the forgery area, not the irrelevant ones, the learning process. so they located the modified region and proposed an ac- tive learning method. Nirkin et al. [23] believed that the • Extensive experiments are conducted to demonstrate face and content of the fake image have inconsistent iden- the effectiveness and generalizability of our proposed tity information, so they used face recognition method self-supervised approach, though it has never seen any to detect deepfakes. However, these methods still require real deepfakes through the training process. corresponding fake videos to complete the training, re- sulting in limited generalizability. Different amounts of data are bound to produce different results [24]. Another Negative Sample Synthesis Perturbation Strategy 1). GaussBlur 2). Scaling 3). ISONoise 4). Rotation 5). SB-Rand 6). SB-Sim Frame Perturbation XP: Positive Sample XN X: Pristine Frame Random Deformation Select M:Mask Reversed Negative Sample Landmarks Candidate Masks Mask Generation Positive Sample Figure 1: Overview of ADDG. Through the three modules, Frame Perturbation, Mask Generation, and Negative Sample Synthesis, the pristine frame 𝑋 is converted to a negative sample 𝑋𝑛 . The green boundary indicates that the frame should be treated as positive sample, while the red one indicates that the frame is negative, i.e., it has discrepant artifacts. Figure 2: The perturbed examples of ADDG. novel idea is not to use any fake image during training. bations proposed have differing impacts, we introduce FWA[10] expected to simulate face warping artifacts by the ℓ2,1 regularization for feature selection. adjusting the face area to different sizes and blurring it to produce similar texture artifacts. Face X-ray[4] generated 3.1. Artifact-Discrepant Data Generator images with boundary information during training dy- namically. Zhao et al. [25] also used Face X-ray’s method As shown in Figure 1, ADDG takes in the pristine im- of generating training data and proposed a model for age 𝑋, and generate the negative sample (i.e., artifact- learning different patches’ consistency. Therefore, dis- discrepant sample) with three modules, frame perturba- covering generation methods’ common steps can facili- tion, mask generation, and negative sample synthesis. tate the generalizability of the model. Frame perturbation uses common image processing tools to change the fingerprint of the pristine frame like data augmentation, mask generation selects the perturbation 3. Method area, and negative sample synthesis blends the pristine and perturbed frames to produce the discrepant artifacts. The images from different sources have different finger- We will introduce the modules, respectively. prints which is caused by the devices, the post-processing operations and the generative models. The fusion of Frame Perturbation. We utilize one of the conven- two different images leads to artifact discrepancies. This tional image processing methods to change the finger- would be the key features of deepfakes, since deepfakes print of the pristine frames, but ensure all the frames are always have manipulated facial regions.Therefore, we still pristine. In this work, we use GaussBlur, Scaling, propose Artifact-Discrepant Data Generator (ADDG). In ISONoise, Rotation, SB-Rand, SB-Sim, as the order to better address the artifact discrepancy, we pro- pose Deepfake Artifact Discrepancy Detector (DADD), • GaussBlur is a commonly used data augmentation which adopts the multi-task learning scheme to learn method in deepfake detection [21]. the features from each type of discrepancy data, respec- tively, and make the final prediction by incorporating the • Scaling refers to zooming out and then zooming in sub-tasks. Finally, considering that the different pertur- the image, which will change the texture. Sub-Tasks Negative Sample Synthesis. This module produces F ST_1 the negative samples by synthesizing the pristine and F ST_2 perturbed frames according the mask. Let 𝑋 be the input CNN Z F ST_3 pristine frame, 𝑋𝑃 be the perturbed yet pristine frame, ... ... and 𝑋𝑁 be the generated negative sample, the negative F ST_N sample is generated by, Concat Final-Task N*F FT 𝑋𝑁 = 𝑋𝑃 ⊙ 𝑀 + 𝑋 ⊙ (1 − 𝑀 ), (1) where ⊙ indicates the element-wise product. Figure 3: Overview of Deepfake Artifact Discrepancy Detec- Finally, we list our nine categories of negative samples. tor (DADD). Each sub-task is associated with a category of We term Inner- as the synthesis with common mask, that artifact discrepancy. The final task combines all the features leaves the perturbation in the foreground. Similarly, we from sub-tasks and do the final prediction. also term Outer- as the synthesis with reversed mask, that leaves the erturbation in the background. The categories without the previous prefixes only use the common mask, • ISONoise is the inherent noise signal generated when also. Specifically, the categories are Inner-GaussBlur, the sensor captures photos, and we can obtain it by Outer-GaussBlur, Inner-Scaling, Outer-Scaling, Inner- accessing the Albumentations library. ISONoise, Outer-ISONoise, Rotation, SB-Rand, and SB- Sim. Some generated examples are shown in Figure 2. • Rotation’s purpose is to slightly adjust the face to Some samples showed no difference, this is due to the produce artifacts in the form of a boundary. small degree of modification. • SB-Rand and SB-Sim refer to using the frames of somebody else’s as the perturbation. ‘-Rand’ indicates 3.2. Deepfake Artifact Discrepancy the frame is randomly selected, while ‘-Sim’ indicates Detector the frame has a similar face to the pristine frame, i.e., the landmarks of the faces in these two frames are The simplest yet effective measure to detect the artifact close. This operation is to introduce some more diverse discrepancy is to train a model on the artifact-discrepant texture information. data. However, the fused dataset contains too many infor- mation, thus it is hard to force the model learn effective Note that all the operations in this module operate on features. To address this problem, we propose Deepfake the whole input. The generated results are pristine. Thus Artifact Discrepancy Detector (DADD), which adopts the we denote it as 𝑋𝑃 . multi-task learning scheme to learn the characteristics of each category, and then summarize the features to make Mask Generation. This module decides the to-be- the final prediction. The structure is shown in Figure 3. modified region in the frame. First, we locate the face In DADD, we first extract a common feature 𝑍 with a landmarks in the pristine frame by using Dlib to guaran- CNN (e.g., Xception [26] in this work). Then, to suit the tee the mask is associated with the given input. Then, by requirements of each sub-task, we project 𝑍 to private analyzing the usual operations of deepfakes, the modified features of size 𝐹 . The predictions of the sub-tasks are area in the deepfake usually occurs on the face or the based on these private features. Then, we devise a final mouth, we empirically select some key points and preset task based on the concatenation of all the private features. five candidate masks and their reverse. The shape of the To train DADD, we first train the sub-tasks in turn. masks is presented in Figure 1. We randomly select a When training sub-task ST_1, the data associated with mask from the candidates. Because the key point detec- ST_1 would be fed. We train the sub-tasks for 𝑘 iterations tion may be inaccurate, and considering the generaliza- and the train the final-tasks for 𝑡 iterations, iteratively. tion performance, we made a slight random deformation Then the common and private features could both retain of the mask. Since the key is to the discrepancies between the significant features for the prediction. Eventually, in regions but not the perturbed facial region, we use the re- the test process, the prediction is the output of the final versed region to perturb the corresponding background. task. The final mask is denoted as 𝑀 , a matrix with the same shape of the input. We set two version of it. 3.3. Training The basic 𝑀 is a 0-1 matrix, which has solid bound- aries and may be easily recognized by the model. To gen- For all sub-tasks and final task, we adopt the cross entropy erate hard samples, we generate 𝑀 with soft boundary, loss as the learning target. Let ℒ𝑆 be the sub-task loss where the values of 𝑀 are smooth near the boundaries. and ℒ𝐹 be the final task loss, they are defined as, 4.1. Experimental Setting 𝑁 Datasets. To evaluate our approach, we leverage two 1 ∑︁ ℒ𝑆 = ℒ𝐹 = − dataset, FaceForensics++ [6] and Celeb-DF [8]. 𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 ), 𝑁 𝑖=1 FaceForensics++ [6] comprises a set of pristine (2)video (P) and four categories of fake videos, including where 𝑦𝑖 indicates the ground truth, 𝑝𝑖 indicates the out- DeepFakes (DF), Face2Face (FF), FaceSwap (FS) and Neu- put of the model, 𝑁 indicates the number of the samples. ralTextures (NT) . Each category contains 1,000 videos. In addition, the purpose of DADD is to use the most The dataset publisher give an official splitting list, that suitable features. This is a feature selection task. There- 720, 140, and 140 videos of each category are used for fore, we introduce a feature selection regularization ℓ2,1 training, validation and test, respectively. In our experi- norm [27, 28, 29] to perform feature selection. Formally, ments, we extract 20 frames per video. Then we adopt ℓ2,1 regularization is, the training set of pristine videos only to train our model. ⎯ We choose the parameters according to the validation set ∑︁ ⎸ 𝑛 𝑑 ⎸∑︁ and evaluate the model on the test set. ℓ2,1 (𝑊 ) = ‖𝑊 ‖2,1 = ⎷ |𝑊𝑖,𝑗 |2 , (3) Celeb-DF [8] is a challenging data set, which is mostly 𝑖=1 𝑗=1 used for cross-dataset test. There are 38 real videos and 62 fake videos in this test set. We extract all frames from where 𝑊 represents the parameter matrix, 𝑛 represents these videos. We select the model via the validation set the number of columns of the matrix, and 𝑑 represents of Faceforensics++ and evaluate our model Celeb-DF. the number of rows of the matrix. The function of ℓ2,1 Note that the test data never appeared in the training regularization is to sparse our parameter matrix’s rows. datasets, especially the deepfakes. Moreover, Celeb-DF In our task, each row’s parameters represent the weights is an independent and hard dataset. Thus the test results corresponding to the feature vectors extracted by each can demonstrate the effectiveness of our generalizability sub-task. We add ℓ2,1 regularization to the Final-Task across datasets. training process, and the overall loss function is defined as follows: Methods. To make fair comparison, we introduce ℒ = ℒ𝐹 + 𝜆 · ℒ2,1 , (4) two recent self-supervised deepfake detection methods, FWA [10] and Face X-ray [4], which also used real frames where ℒ2,1 indicates the regularization on the concate- to generate training data during training dynamically. nated private features, and 𝜆 is a hyper-parameter. Dur- FWA believes that GaussBlur could construct warped ing training, we perform regular data augmentation on faces, so they used different degrees of GaussBlur to all types of data More detailed training procedure are construct negative samples. Face X-ray dynamically listed in Algorithm 1. generates images with boundary information. In our experiments, we use ‘-FWA’ to denote the data Algorithm 1: Multi-Task learning Framework generated by FWA, ‘-BI’ to denote the data generated by Input: Training images 𝑋; Face X-ray, and ‘-ADDG’ to denote the data generated 1 repeat by our proposed method. We also use ‘Xcep-’ to denote 2 for 𝑖 =0 to k do the Xception model, ‘Xray-’ to denote the X-ray model, 3 for 𝑛 =1 to N do and ‘DADD-’ to denote our method. 4 Generate 𝑋 (𝑛) , 𝑦 (𝑛) ; 5 Minimize ℒ𝑆 (𝑆𝑇𝑛 (𝑋 (𝑛) ), 𝑦 (𝑛) ) 4.2. Performances 6 for 𝑖 =0 to t do Table 1 demonstrates the results on DF and Celeb-DF. The results marked with references indicate that they 7 Generate 𝑋 (1,...,𝑁 ) , 𝑦 (1,...,𝑁 ) ; are from the original. Two-stream [17] was trained on 8 Minimize ℒ(𝐹 𝑇𝑛 (𝑋 (1,...,𝑁 ) ), 𝑦 (1,,,.,𝑁 ) ) the SwapMe dataset [17]. Meso4 [5] was trained on an 9 until convergence; internal DeepFake dataset collected by the authors. Head- Pose [2] was trained on the UADFV dataset [2]. For FWA, the dataset was collected from the Internet. The super- vised methods perform badly when testing cross data. 4. Experiment In contrast, the self-supervised methods in the second part of Table 1 mostly do well. That reveals the signif- In this section, we conduct extensive experiments to icance of the explorations on self-supervised methods. demonstrate the effectiveness of our approach. Table 1 Table 2 Comparison with baselines (AUC (%)). The first part is based Ablation study (AUC (%)). on supervised methods, the second part is based on self- FaceForensics++ supervised methods Method Celeb-DF DF FF FS NT ALL FaceForensics++ Xcep-ADDG 99.99 99.40 98.38 97.47 98.81 77.60 Method Celeb-DF DADD-ADDG 99.94 99.21 98.50 97.50 98.79 81.42 DF FF FS NT ALL Two-stream [17] 70.10 - - - - 55.70 DADD-ADDG (ℓ2,1 ) 99.92 99.21 97.72 97.90 98.69 82.93 Meso4 [5] 84.70 - - - - 53.60 HeadPose [2] 47.30 - - - - 54.80 FWA [10] 79.20 - - - - 53.80 Xcep-BI [4] 98.95 97.86 89.29 97.29 95.85 - Xray-BI [4] 99.17 98.57 98.21 98.13 98.52 74.76 Xcep-FWA 94.09 91.89 62.55 85.78 83.58 53.76 Xcep-BI 99.52 94.76 95.95 90.64 95.22 76.36 DADD-ADDG (ℓ2,1 ) 99.92 99.21 97.72 97.90 98.69 82.93 Since FWA only considers the use of GaussBlur to sim- Figure 4: Visual result of feature selection implemented by ulate the warped face during the deepfake generation ℓ2,1 regularization (𝜆=0.1). The left indicates that no ℓ2,1 , and process, its generalizability is limited. As can be seen the right indicates that ℓ2,1 is used. from Xcep-FWA, only DF and FF perform slightly higher. For Xcep-BI, the results are different because the specific settings of my experiment are different from the original our model does not merely detect specific texture features paper. Xray-BI and our method DADD-ADDG (ℓ2,1 ) per- but captures the difference between internal and external form evenly on DF, FF, FS, and NT. DADD-ADDG (ℓ2,1 ) textures. Since the data set is heavily compressed, a lot have an average improvement of 0.17% on FaceForen- of information is lost. The results on the five test data sics++. But on the more difficult Celeb-DF, our method demonstrate that, different perturbation would benifit improves by 8.17%. This verifies our hypothesis on the different types of deepfakes. artifact discrepancy. Since our task is to improve gener- alization performance, that is, test results on completely unrelated data sets, a slight decrease in performance on 4.4. The Impact of DADD FS and NT is acceptable. Table 2 demonstrates the results of the methods training on the data generated by the ADDG. It is clear that the re- 4.3. The Impact of Perturbations sults on the four categories of FaceForensics++ are close. But the results on Celeb-DF are different. Our proposed We presents the impact of different perturbations in Fig- multi-task learning framework’s performance is 3.82% ure 5. We finetune the Xception on different categories higher than that when using the Xception network only. of perturbed frames, respectively. We also report the test results of each sub-task in the From figure 5, we have the following observations. For Figure 6. Compared with Figure 5, it is obvious that all the DF, all methods have good performance except Outer- sub-tasks achieves a better performance. This means the Scaling. For FF, the Rotation we proposed has reached the multi-task scheme have improved the information in the best response, indicating that FF artifacts show more edge common features. For example, the Rotation perturbation information.For FS, the two methods that the texture of in Figure 5 is about 70% for FS and Celeb-DF, while its other images to perturb the original image have the best corresponding sub-task ST_7 in Figure 6 achieves 99% response, indicating that it is meaningful to introduce var- and 80% respectively. This reveals that DADD introduces ious textures. This also explains that the blending bound- significant improvements. ary constructed by rotating does not perform as well as replacing the image. For NT, Inner-GaussBlur, Inner- Scaling, and Rotation have a high response. Compared 4.5. The Impact of ℓ2,1 Regularization with other types of data sets, it is difficult for Celeb-DF to Table 2 demonstrates the impact of ℓ2,1 regularization. get a good response with a single perturbation method. It improve the final result by 1.51% on cross-data Celeb- The impact of GaussBlur and Scaling are similar. When DF. This indicates that the feature selection benefit the the face’s interior is disturbed, it responds very well to model performances. We also test the feature selection DF, FS, and NT, while it is deficient to FS. However, when hyper-parameter 𝜆, and log the impact of 𝜆 in Figure 7. the modified area is the background, the result is the When 𝜆 = 0.1, the model achieves the best performance. opposite. The effect will be better on FS. It verifies that Lower 𝜆 improve the performance in a small ratio, while Figure 5: The results of Xceptions trained on different perturbations. Figure 6: The predicted results of sub-tasks. higher 𝜆 causes a sharp performance drop. approach composed of the artifact-discrepant data gener- ator and deepfake artifact discrepancy detector, to learn the discrepancy with pristine videos only. We conducted extensive experiments to demonstrate the effectiveness of our proposed approaches. Deepfake detection is a special domain that aims at challenging beyond the common senses. Since the deep- fakes become more realistic, the models have to pay more attention to high-frequency signals and the inher- ent video fingerprints. This work tried to associate the deepfake artifacts with some common noises, as a pow- Figure 7: The average performance of different 𝜆 on Celeb- erful tool to understand the unseen artifacts. In future, DF (AUC%). we plan to leverage this tool to explore the impact of the widely used manipulation methods. Moreover, taking this work as a reference, we are interested in extracting We also visualize the layer with regularization in Fig- the key artifacts from deepfakes directly. ure 4. The features from ST_3, ST_7, and ST_9 contribute most. The corresponding perturbations are Inner-Scaling, Rotation, and Sim-Swap. This means they could be the Acknowledgments delegates in the final prediction. Note that this doesn’t mean only these three sub-tasks are necessary. Their Special thanks are given to the SSDL2021’s organizing performances are based on the shared features, which is committee and I am also very grateful to the reviewers learned from all the sub-tasks. for their valuable comments on this paper. This research was supported by Open Funding Project of the State Key Laboratory of Communication Content Cognition 5. CONCLUSION (No.20K03). The completion of this paper can not be separated from the Intelligent Media Analysis Group In this paper, we made a hypothesis that the discrepant (IMAG) to help. The author would like to also thank artifacts caused by the frame manipulations are the key Cheng Zhuang, Jiangnan Dai and Shaocong Yang in the differences between pristine videos and deepfakes. To IMAG for their valuable discussions. address the discrepancy, we proposed a self-supervised References (2016) 114–129. [14] Q. Ye, H. Zhao, Z. Li, X. Yang, S. Gao, T. Yin, N. Ye, [1] Y. Li, M.-C. Chang, S. Lyu, In ictu oculi: Expos- L1-norm distance minimization-based fast robust ing ai generated fake face videos by detecting eye twin support vector 𝑘-plane clustering, IEEE trans- blinking, arXiv preprint arXiv:1806.02877 (2018). actions on neural networks and learning systems [2] X. Yang, Y. Li, S. Lyu, Exposing deep fakes using 29 (2017) 4494–4503. inconsistent head poses, in: ICASSP, IEEE, 2019, pp. [15] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote 8261–8265. heart rate measurement from highly compressed [3] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng, facial videos: an end-to-end deep learning solution Y. Liu, J. Zhao, Deeprhythm: Exposing deepfakes with video enhancement, in: Proceedings of the with attentional visual heartbeat rhythms, in: Pro- IEEE International Conference on Computer Vision, ceedings of the 28th ACM International Conference 2019, pp. 151–160. on Multimedia, 2020, pp. 4318–4327. [16] U. A. Ciftci, I. Demir, L. Yin, Fakecatcher: Detection [4] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, of synthetic portrait videos using biological signals, B. Guo, Face x-ray for more general face forgery IEEE Transactions on Pattern Analysis and Machine detection, in: Proceedings of the IEEE/CVF Confer- Intelligence (2020). ence on Computer Vision and Pattern Recognition, [17] P. Zhou, X. Han, V. I. Morariu, L. S. Davis, Two- 2020, pp. 5001–5010. stream neural networks for tampered face detection, [5] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, in: 2017 IEEE Conference on Computer Vision and Mesonet: a compact facial video forgery detection Pattern Recognition Workshops (CVPRW), IEEE, network, in: IEEE International Workshop on Infor- 2017, pp. 1831–1839. mation Forensics and Security (WIFS), IEEE, 2018, [18] H. H. Nguyen, J. Yamagishi, I. Echizen, Capsule- pp. 1–7. forensics: Using capsule networks to detect forged [6] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, images and videos, in: IEEE International Confer- J. Thies, M. Nießner, Faceforensics++: Learning to ence on Acoustics, Speech and Signal Processing detect manipulated facial images, in: Proceedings (ICASSP), IEEE, 2019, pp. 2307–2311. of the IEEE International Conference on Computer [19] D. Güera, E. J. Delp, Deepfake video detection using Vision, 2019, pp. 1–11. recurrent neural networks, in: IEEE International [7] M. Du, S. Pentyala, Y. Li, X. Hu, Towards generaliz- Conference on Advanced Video and Signal Based able forgery detection with locality-aware autoen- Surveillance (AVSS), IEEE, 2018, pp. 1–6. coder, arXiv preprint arXiv:1909.05999 (2019). [20] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, [8] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A I. Masi, P. Natarajan, Recurrent convolutional new dataset for deepfake forensics, arXiv preprint strategies for face manipulation detection in videos, arXiv:1909.12962 (2019). Interfaces (GUI) 3 (2019). [9] P. Chen, J. Liu, T. Liang, G. Zhou, H. Gao, J. Dai, [21] X. Xuan, B. Peng, W. Wang, J. Dong, On the gener- J. Han, Fsspotter: Spotting face-swapped video by alization of gan image forensics, in: Chinese Con- spatial and temporal clues, in: 2020 IEEE Interna- ference on Biometric Recognition, Springer, 2019, tional Conference on Multimedia and Expo (ICME), pp. 134–141. IEEE, 2020, pp. 1–6. [22] D. Cozzolino, J. Thies, A. Rössler, C. Riess, [10] Y. Li, S. Lyu, Exposing deepfake videos by detect- M. Nießner, L. Verdoliva, Forensictransfer: Weakly- ing face warping artifacts, in: Proceedings of the supervised domain adaptation for forgery detection, IEEE Conference on Computer Vision and Pattern arXiv preprint arXiv:1812.02510 (2018). Recognition Workshops, 2019, pp. 46–52. [23] Y. Nirkin, L. Wolf, Y. Keller, T. Hassner, Deepfake de- [11] A. Swaminathan, M. Wu, K. R. Liu, Digital image tection based on the discrepancy between the face forensics via intrinsic fingerprints, IEEE transac- and its context, arXiv preprint arXiv:2008.12262 tions on information forensics and security 3 (2008) (2020). 101–117. [24] H. Tang, Z. Li, Z. Peng, J. Tang, Blockmix: meta reg- [12] N. Yu, L. S. Davis, M. Fritz, Attributing fake images ularization and self-calibrated inference for metric- to gans: Learning and analyzing gan fingerprints, based meta-learning, in: Proceedings of the in: Proceedings of the IEEE International Confer- 28th ACM International Conference on Multimedia, ence on Computer Vision, 2019, pp. 7556–7566. 2020, pp. 610–618. [13] Q. Ye, J. Yang, F. Liu, C. Zhao, N. Ye, T. Yin, L1-norm [25] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, distance linear discriminant analysis based on an W. Xia, Learning to recognize patch-wise con- effective iterative algorithm, IEEE Transactions sistency for deepfake detection, arXiv preprint on Circuits and Systems for Video Technology 28 arXiv:2012.09311 (2020). [26] F. Chollet, Xception: Deep learning with depth- wise separable convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258. [27] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, X. Zhou, l2, 1-norm regularized discriminative feature selection for unsupervised learning, in: IJCAI International Joint Conference on Artificial Intelligence, AAAI Press/International Joint Conferences on Artificial Intelligence, 2011, pp. 1589–1594. [28] L. Fu, Z. Li, Q. Ye, H. Yin, Q. Liu, X. Chen, X. Fan, W. Yang, G. Yang, Learning robust discriminant sub- space based on joint l2, p-and l2, s-norm distance metrics, IEEE Transactions on Neural Networks and Learning Systems (2020). [29] Q. Ye, Z. Li, L. Fu, Z. Zhang, W. Yang, G. Yang, Nonpeaked discriminant analysis for data represen- tation, IEEE transactions on neural networks and learning systems 30 (2019) 3818–3832.