=Paper=
{{Paper
|id=Vol-3084/paper2
|storemode=property
|title=Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies
|pdfUrl=https://ceur-ws.org/Vol-3084/paper2.pdf
|volume=Vol-3084
|authors=Kai Hong,Xiaoyu Du
}}
==Self-Supervised Deepfake Detection by Discovering Artifact Discrepancies==
<pdf width="1500px">https://ceur-ws.org/Vol-3084/paper2.pdf</pdf>
<pre>
Self-Supervised Deepfake Detection by Discovering
Artifact Discrepancies
Kai Hong1,2 , Xiaoyu Du1,2
1
    Nanjing University Of Science And Technology, Nanjing, 210014, China
2
    State Key Laboratory of Communication Content Cognition, Beijing, 100733, China


                                             Abstract
                                             Recent works demonstrate the significance of textures for the neural deepfake detection methods, yet the reason is still
                                             in explorations. In this paper, we claim that the artifact discrepancies caused by the face manipulation operations are the
                                             key difference between pristine videos and deepfakes. To imitate the discrepant situation from pristine videos, we propose
                                             an artifact-discrepant data generator to generate the negative samples by adjusting the artifacts in the facial regions with
                                             conventional processing tools. We then propose Deepfake Artifact Discrepancy Detector (DADD) method to discover the
                                             discrepancies. DADD adopts the multi-task architecture, associates each sub-task with a specific artifact set, and assembles all
                                             the sub-tasks for the final prediction. We term DADD as a self-supervised method since it never meets any deepfakes during
                                             the training process. The experimental results on the FaceForensics++ and Celeb-DF datasets demonstrate the effectiveness
                                             and generalizability of DADD.

                                             Keywords
                                             deepfake, self-supervised, artifact discrepancies


1. Introduction                                                                                                       detection; the second perspective is to capture the forged
                                                                                                                      features via the neural networks, including customized
Videos were a natural and convincing medium to                                                                        deep networks [5], classic neural networks [6], et al. .
spread information due to their abundant and strongly                                                                 The neural methods achieve an extremely high perfor-
co-associated details, including appearances, actions,                                                                mance [6]. But the dependence on the training datasets
sounds, etc. This situation has changed due to the emer-                                                              severely limits the model generalizability, which is very
gence of Deepfakes, the model-synthetic media in which                                                                important in practical applications. For instance, the
the face or voice may be replaced with someone else’s.                                                                well-trained models may not work across datasets [7, 8],
The synthetic videos are resulting in negative impacts on                                                             since the deepfakes are made by a variant of methods.
individuals and society. Moreover, with the rapid devel-                                                                 To retain the model effectiveness across the datasets,
opment of generative techniques, the procedures making                                                                the traditional measures including data augmentation
deepfakes have become substantially simple, while the                                                                 [9] and transfer learning [7] are introduced. However,
products seem more realistic. This situation facilitates                                                              these methods hardly reveal the inherent difference be-
many domains, i.e., the film industry, but potentially in-                                                            tween pristine videos and deepfakes. To address this
creases the probability of social issues. Therefore, the                                                              issue, self-supervised learning scheme is introduced to
deepfake detection methods have garnered widespread                                                                   produce negative samples as the substitutes of true deep-
attention.                                                                                                            fakes to make the model learn specific features [10, 4].
   Recent deepfake detection methods are mainly devised                                                               The negative samples rely on the manual hypothesis of
from two perspectives. The first one is used by the bio-                                                              the differences between pristine videos and deepfakes,
inspired methods based on the observations and intuitive                                                              facilitating the construction of interpretable detection
hypotheses over the datasets. Li et al. [1] focused on the                                                            methods. Two typical works are FWA [10] and Face X-
abnormal eye blinking. Yang et al. [2] noted the incon-                                                               ray [4], where the former assumes that the artifacts are
sistency between the facial expressions and the corre-                                                                caused by the resizing and blurring operations on the
sponding head postures. Qi et al. [3] magnified the heart                                                             facial regions, and the latter believes that deepfakes al-
rhythm signal in videos and detected the disrupted heart                                                              ways have unseen blending boundaries. Their results
rhythm signal. Li et al. [4] located the blending bound-                                                              demonstrate that recent neural networks mostly focus
aries made by facial replacement methods to make the                                                                  on generic visual artifacts rather than the videos them-
 * Xiaoyu Du is the corresponding author.                                                                             selves. Therefore, the negative samples generated with
2021 International Workshop on Safety & Security of Deep Learning,                                                    intuitive and empirical operations can facilitate the de-
August 21th, 2021, Virtual                                                                                            tection model and enhance the generalizability further.
" kai hong@njust.edu.cn (K. Hong); duxy@njust.edu.cn (X. Du)                                                             In addition, many works point out that the videos and
 0000-0003-3567-6396 (K. Hong); 0000-0002-4641-1994 (X. Du)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   images have inherent signals like fingerprints, which
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        are produced by the devices, the post-processing or the
generative models [11, 12]. Inspired by these works, we          2. Related Work
make a bold hypothesis that the artifact discrepancies
caused by the face manipulation operations are the key to        Bio-inspired methods. Some works have found that
detect deepfakes. Intuitively, all the frames in a pristine      the actors’ physiological characteristics in deepfakes are
video have the same operation flow, thus they should             different from the real world. Li et al. [1] found that the
have consistent fingerprints (i.e., artifacts). In contrast,     actors in deepfakes have an abnormal blinking frequency
the replaced facial regions in deepfakes inevitably in-          and some even don’t blink. Yang et al. [2] found that face
troduce discrepant artifacts. Focusing on the discrepant         orientation and head poses are related, but the correla-
artifacts, we propose a self-supervised deepfake detection       tion is destroyed in deepfakes. Due to the development of
approach which comprises an Artifact-Discrepant Data             remote visual photoplethysmography (rppg) technology,
Generator (ADDG) and a Deepfake Artifact Discrepancy             the heart rate of actors in videos can be detected [15].
Detector (DADD) to discover the discrepancy from the             Based on this technology, Qi et al. [3] found the irregular
generated data. ADDG just uses the pristine video frames         heart rhythm of actors in deepfake. Similarly, Ciftci et al.
and perturbs the facial regions with the conventional pro-       [16] explored the biological signal difference between
cessing tools, e.g., blurring, scaling, rotation, replacement,   fake videos and real videos. However, the physiolog-
etc. Although the perturbations do not change the frames         ical signal artifacts reflected by different data sets are
in human sense, we believe that they have introduced             different, so specific data needs specific analysis.
the discrepancy in the artifact level. Thus the perturbed
frames are taken as the negative samples (i.e., substitutesNeural methods. Since deep neural networks can au-
of deepfakes) in our approach. DADD adopts the multi-      tomatically extract images’ deep features, many DNN-
task learning scheme, associates each sub-task with a      based detection methods have achieved satisfactory re-
type of generated data, and assembles all the sub-tasks    sults. Zhou et al. [17] divided the image into different
for the final prediction. The prediction is constrained    patches, and proposed a Two-stream network to detect
by ℓ2,1 norm[13, 14] which is a classic regularization for the difference between patches. Afchar et al. [5] pro-
feature selection. The experimental results on the pub-    posed a compact network structure MesoNet to detect
lic datasets demonstrate that the model trained on the     fake videos. Nguyen et al. [18] proposed the use of cap-
generated data can achieve a competitive performance,      sule networks for deepfake detection. These methods in-
even it never sees the real deepfakes. This verifies the   dicate that a simple CNN network can indeed capture the
effectiveness and generalizability of our approach, and    relevant features of fake videos. In addition to these detec-
reveals that our hypothesis is a feasible perspective to   tion methods based on single-frame images, there are also
detect the deepfakes.                                      methods based on multi-frame sequences. Guera et al.
   The main contributions of our work are as follows:      [19] extracted features from each frame by using CNN,
                                                           then made decisions based on the feature sequence by
• We hypothesize that the artifact discrepancies caused using RNN. To capture the correlation of different frame
  by the face manipulations are the key to detect deep- features better, Sabir et al. [20] used a Bi-directional RNN.
  fakes, thus propose a self-supervised deepfake detec- These neural methods can detect specific Deepfakes per-
  tion approach to discover the discrepancy. The core fectly [6], but for unseen data, the detection performance
  is the Artifact-Discrepant Data Generator, which uses will be greatly reduced [8].
  the pristine video frames only and perturbs the facial
  region with the conventional processing tools to gen-
                                                           Cross-data methods. Recently, the generalizability
  erate the negative samples.
                                                           of detection methods has been emphasized. Xuan et al.
• To better address the artifact discrepancies, we propose [21] preprocessed training images to reduce obvious ar-
  Deepfake Artifact Discrepancy Detector, which adopts tifacts, forcing models to learn more intrinsic features.
  the multi-task learning scheme, associates each sub- Cozzolino et al. [22] introduced an auto-encoder method
  task with a type of generated data, and makes the final that enabled real and fake images to be decoupled in la-
  prediction by integrating the sub-tasks. To guide the tent space. Du et al. [7] believed that the detection model
  task feature selection, we adopt ℓ2,1 norm to constraint needs to focus on the forgery area, not the irrelevant ones,
  the learning process.                                    so they located the modified region and proposed an ac-
                                                           tive learning method. Nirkin et al. [23] believed that the
• Extensive experiments are conducted to demonstrate face and content of the fake image have inconsistent iden-
  the effectiveness and generalizability of our proposed tity information, so they used face recognition method
  self-supervised approach, though it has never seen any to detect deepfakes. However, these methods still require
  real deepfakes through the training process.             corresponding fake videos to complete the training, re-
                                                           sulting in limited generalizability. Different amounts of
                                                           data are bound to produce different results [24]. Another
                                                                                          Negative Sample Synthesis
                              Perturbation Strategy
                            1). GaussBlur   2). Scaling
                            3). ISONoise    4). Rotation
                            5). SB-Rand     6). SB-Sim
                                            Frame Perturbation      XP: Positive Sample


                                                                                                                      XN
      X: Pristine Frame                       Random         Deformation
                                               Select
                                                                           M:Mask          Reversed                   Negative Sample
                          Landmarks
                                  Candidate Masks
                                             Mask Generation
                                                                                                                      Positive Sample


Figure 1: Overview of ADDG. Through the three modules, Frame Perturbation, Mask Generation, and Negative Sample
Synthesis, the pristine frame 𝑋 is converted to a negative sample 𝑋𝑛 . The green boundary indicates that the frame should
be treated as positive sample, while the red one indicates that the frame is negative, i.e., it has discrepant artifacts.


                                                    Figure 2: The perturbed examples of ADDG.

novel idea is not to use any fake image during training.                       bations proposed have differing impacts, we introduce
FWA[10] expected to simulate face warping artifacts by                         the ℓ2,1 regularization for feature selection.
adjusting the face area to different sizes and blurring it to
produce similar texture artifacts. Face X-ray[4] generated                     3.1. Artifact-Discrepant Data Generator
images with boundary information during training dy-
namically. Zhao et al. [25] also used Face X-ray’s method  As shown in Figure 1, ADDG takes in the pristine im-
of generating training data and proposed a model for       age 𝑋, and generate the negative sample (i.e., artifact-
learning different patches’ consistency. Therefore, dis-   discrepant sample) with three modules, frame perturba-
covering generation methods’ common steps can facili-      tion, mask generation, and negative sample synthesis.
tate the generalizability of the model.                    Frame perturbation uses common image processing tools
                                                           to change the fingerprint of the pristine frame like data
                                                           augmentation, mask generation selects the perturbation
3. Method                                                  area, and negative sample synthesis blends the pristine
                                                           and perturbed frames to produce the discrepant artifacts.
The images from different sources have different finger- We will introduce the modules, respectively.
prints which is caused by the devices, the post-processing
operations and the generative models. The fusion of
                                                           Frame Perturbation. We utilize one of the conven-
two different images leads to artifact discrepancies. This
                                                           tional image processing methods to change the finger-
would be the key features of deepfakes, since deepfakes
                                                           print of the pristine frames, but ensure all the frames are
always have manipulated facial regions.Therefore, we
                                                           still pristine. In this work, we use GaussBlur, Scaling,
propose Artifact-Discrepant Data Generator (ADDG). In
                                                           ISONoise, Rotation, SB-Rand, SB-Sim, as the
order to better address the artifact discrepancy, we pro-
pose Deepfake Artifact Discrepancy Detector (DADD), • GaussBlur is a commonly used data augmentation
which adopts the multi-task learning scheme to learn          method in deepfake detection [21].
the features from each type of discrepancy data, respec-
tively, and make the final prediction by incorporating the • Scaling refers to zooming out and then zooming in
sub-tasks. Finally, considering that the different pertur-    the image, which will change the texture.
                                               Sub-Tasks
                                                              Negative Sample Synthesis. This module produces
                                  F                    ST_1   the negative samples by synthesizing the pristine and
                                  F                    ST_2   perturbed frames according the mask. Let 𝑋 be the input
                 CNN        Z     F                    ST_3   pristine frame, 𝑋𝑃 be the perturbed yet pristine frame,
                                 ...             ...          and 𝑋𝑁 be the generated negative sample, the negative
                                  F                    ST_N   sample is generated by,
                                      Concat   Final-Task

                                 N*F                    FT
                                                                         𝑋𝑁 = 𝑋𝑃 ⊙ 𝑀 + 𝑋 ⊙ (1 − 𝑀 ),                   (1)

                                                              where ⊙ indicates the element-wise product.
Figure 3: Overview of Deepfake Artifact Discrepancy Detec-        Finally, we list our nine categories of negative samples.
tor (DADD). Each sub-task is associated with a category of We term Inner- as the synthesis with common mask, that
artifact discrepancy. The final task combines all the features leaves the perturbation in the foreground. Similarly, we
from sub-tasks and do the final prediction.                    also term Outer- as the synthesis with reversed mask, that
                                                              leaves the erturbation in the background. The categories
                                                              without the previous prefixes only use the common mask,
• ISONoise is the inherent noise signal generated when        also. Specifically, the categories are Inner-GaussBlur,
  the sensor captures photos, and we can obtain it by         Outer-GaussBlur, Inner-Scaling, Outer-Scaling, Inner-
  accessing the Albumentations library.                       ISONoise, Outer-ISONoise, Rotation, SB-Rand, and SB-
                                                              Sim. Some generated examples are shown in Figure 2.
• Rotation’s purpose is to slightly adjust the face to        Some samples showed no difference, this is due to the
  produce artifacts in the form of a boundary.                small degree of modification.
• SB-Rand and SB-Sim refer to using the frames of
   somebody else’s as the perturbation. ‘-Rand’ indicates 3.2. Deepfake Artifact Discrepancy
   the frame is randomly selected, while ‘-Sim’ indicates          Detector
   the frame has a similar face to the pristine frame, i.e.,
   the landmarks of the faces in these two frames are The simplest yet effective measure to detect the artifact
   close. This operation is to introduce some more diverse discrepancy is to train a model on the artifact-discrepant
   texture information.                                      data. However, the fused dataset contains too many infor-
                                                             mation, thus it is hard to force the model learn effective
Note that all the operations in this module operate on features. To address this problem, we propose Deepfake
the whole input. The generated results are pristine. Thus Artifact Discrepancy Detector (DADD), which adopts the
we denote it as 𝑋𝑃 .                                         multi-task learning scheme to learn the characteristics of
                                                             each category, and then summarize the features to make
Mask Generation. This module decides the to-be- the final prediction. The structure is shown in Figure 3.
modified region in the frame. First, we locate the face        In DADD, we first extract a common feature 𝑍 with a
landmarks in the pristine frame by using Dlib to guaran- CNN (e.g., Xception [26] in this work). Then, to suit the
tee the mask is associated with the given input. Then, by requirements of each sub-task, we project 𝑍 to private
analyzing the usual operations of deepfakes, the modified features of size 𝐹 . The predictions of the sub-tasks are
area in the deepfake usually occurs on the face or the based on these private features. Then, we devise a final
mouth, we empirically select some key points and preset task based on the concatenation of all the private features.
five candidate masks and their reverse. The shape of the       To train DADD, we first train the sub-tasks in turn.
masks is presented in Figure 1. We randomly select a When training sub-task ST_1, the data associated with
mask from the candidates. Because the key point detec- ST_1 would be fed. We train the sub-tasks for 𝑘 iterations
tion may be inaccurate, and considering the generaliza- and the train the final-tasks for 𝑡 iterations, iteratively.
tion performance, we made a slight random deformation Then the common and private features could both retain
of the mask. Since the key is to the discrepancies between the significant features for the prediction. Eventually, in
regions but not the perturbed facial region, we use the re- the test process, the prediction is the output of the final
versed region to perturb the corresponding background. task.
The final mask is denoted as 𝑀 , a matrix with the same
shape of the input. We set two version of it.                3.3. Training
   The basic 𝑀 is a 0-1 matrix, which has solid bound-
aries and may be easily recognized by the model. To gen- For all sub-tasks and final task, we adopt the cross entropy
erate hard samples, we generate 𝑀 with soft boundary, loss as the learning target. Let ℒ𝑆 be the sub-task loss
where the values of 𝑀 are smooth near the boundaries.
and ℒ𝐹 be the final task loss, they are defined as,             4.1. Experimental Setting
                      𝑁                                     Datasets. To evaluate our approach, we leverage two
                  1 ∑︁
ℒ𝑆 = ℒ𝐹 = −                                                 dataset, FaceForensics++ [6] and Celeb-DF [8].
                        𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 ),
                 𝑁 𝑖=1                                         FaceForensics++ [6] comprises a set of pristine
                                                         (2)video (P) and four categories of fake videos, including
where 𝑦𝑖 indicates the ground truth, 𝑝𝑖 indicates the out-  DeepFakes (DF), Face2Face (FF), FaceSwap (FS) and Neu-
put of the model, 𝑁 indicates the number of the samples.    ralTextures (NT) . Each category contains 1,000 videos.
   In addition, the purpose of DADD is to use the most      The dataset publisher give an official splitting list, that
suitable features. This is a feature selection task. There- 720, 140, and 140 videos of each category are used for
fore, we introduce a feature selection regularization ℓ2,1  training, validation and test, respectively. In our experi-
norm [27, 28, 29] to perform feature selection. Formally,   ments, we extract 20 frames per video. Then we adopt
ℓ2,1 regularization is,                                     the training set of pristine videos only to train our model.
                                     ⎯                      We choose the parameters according to the validation set
                                ∑︁   ⎸ 𝑛
                                  𝑑 ⎸∑︁
                                                            and evaluate the model on the test set.
        ℓ2,1 (𝑊 ) = ‖𝑊 ‖2,1 =        ⎷      |𝑊𝑖,𝑗 |2 ,   (3)   Celeb-DF [8] is a challenging data set, which is mostly
                               𝑖=1     𝑗=1
                                                            used for cross-dataset test. There are 38 real videos and
                                                            62 fake videos in this test set. We extract all frames from
where 𝑊 represents the parameter matrix, 𝑛 represents
                                                            these videos. We select the model via the validation set
the number of columns of the matrix, and 𝑑 represents
                                                            of Faceforensics++ and evaluate our model Celeb-DF.
the number of rows of the matrix. The function of ℓ2,1
                                                               Note that the test data never appeared in the training
regularization is to sparse our parameter matrix’s rows.
                                                            datasets, especially the deepfakes. Moreover, Celeb-DF
In our task, each row’s parameters represent the weights
                                                            is an independent and hard dataset. Thus the test results
corresponding to the feature vectors extracted by each
                                                            can demonstrate the effectiveness of our generalizability
sub-task. We add ℓ2,1 regularization to the Final-Task
                                                            across datasets.
training process, and the overall loss function is defined
as follows:
                                                            Methods. To make fair comparison, we introduce
                    ℒ = ℒ𝐹 + 𝜆 · ℒ2,1 ,                 (4) two recent self-supervised deepfake detection methods,
                                                            FWA [10] and Face X-ray [4], which also used real frames
where ℒ2,1 indicates the regularization on the concate- to generate training data during training dynamically.
nated private features, and 𝜆 is a hyper-parameter. Dur- FWA believes that GaussBlur could construct warped
ing training, we perform regular data augmentation on faces, so they used different degrees of GaussBlur to
all types of data More detailed training procedure are construct negative samples. Face X-ray dynamically
listed in Algorithm 1.                                      generates images with boundary information.
                                                               In our experiments, we use ‘-FWA’ to denote the data
  Algorithm 1: Multi-Task learning Framework                generated by FWA, ‘-BI’ to denote the data generated by
   Input: Training images 𝑋;                                Face X-ray, and ‘-ADDG’ to denote the data generated
 1 repeat
                                                            by our proposed method. We also use ‘Xcep-’ to denote
 2      for 𝑖 =0 to k do                                    the Xception model, ‘Xray-’ to denote the X-ray model,
 3          for 𝑛 =1 to N do                                and ‘DADD-’ to denote our method.
4              Generate 𝑋 (𝑛) , 𝑦 (𝑛) ;
5              Minimize ℒ𝑆 (𝑆𝑇𝑛 (𝑋 (𝑛) ), 𝑦 (𝑛) )               4.2. Performances
6      for 𝑖 =0 to t do                                         Table 1 demonstrates the results on DF and Celeb-DF.
                                                                The results marked with references indicate that they
7          Generate 𝑋 (1,...,𝑁 ) , 𝑦 (1,...,𝑁 ) ;
                                                                are from the original. Two-stream [17] was trained on
8          Minimize ℒ(𝐹 𝑇𝑛 (𝑋 (1,...,𝑁 ) ), 𝑦 (1,,,.,𝑁 ) )      the SwapMe dataset [17]. Meso4 [5] was trained on an
9   until convergence;                                          internal DeepFake dataset collected by the authors. Head-
                                                                Pose [2] was trained on the UADFV dataset [2]. For FWA,
                                                                the dataset was collected from the Internet. The super-
                                                                vised methods perform badly when testing cross data.
4. Experiment                                                   In contrast, the self-supervised methods in the second
                                                                part of Table 1 mostly do well. That reveals the signif-
In this section, we conduct extensive experiments to            icance of the explorations on self-supervised methods.
demonstrate the effectiveness of our approach.
Table 1                                                          Table 2
Comparison with baselines (AUC (%)). The first part is based     Ablation study (AUC (%)).
on supervised methods, the second part is based on self-                                   FaceForensics++
supervised methods                                                Method                                          Celeb-DF
                                                                                     DF    FF    FS    NT ALL
                             FaceForensics++                      Xcep-ADDG         99.99 99.40 98.38 97.47 98.81 77.60
 Method                                              Celeb-DF     DADD-ADDG         99.94 99.21 98.50 97.50 98.79  81.42
                      DF     FF    FS     NT   ALL
 Two-stream [17]     70.10    -     -      -    -     55.70       DADD-ADDG (ℓ2,1 ) 99.92 99.21 97.72 97.90 98.69 82.93
 Meso4 [5]           84.70    -     -      -    -     53.60
 HeadPose [2]        47.30    -     -      -    -     54.80
 FWA [10]          79.20   -     -     -     -        53.80
 Xcep-BI [4]       98.95 97.86 89.29 97.29 95.85        -
 Xray-BI [4]       99.17 98.57 98.21 98.13 98.52      74.76
 Xcep-FWA          94.09 91.89 62.55 85.78 83.58      53.76
 Xcep-BI           99.52 94.76 95.95 90.64 95.22      76.36
 DADD-ADDG (ℓ2,1 ) 99.92 99.21 97.72 97.90 98.69      82.93


Since FWA only considers the use of GaussBlur to sim-            Figure 4: Visual result of feature selection implemented by
ulate the warped face during the deepfake generation             ℓ2,1 regularization (𝜆=0.1). The left indicates that no ℓ2,1 , and
process, its generalizability is limited. As can be seen         the right indicates that ℓ2,1 is used.
from Xcep-FWA, only DF and FF perform slightly higher.
For Xcep-BI, the results are different because the specific
settings of my experiment are different from the original        our model does not merely detect specific texture features
paper. Xray-BI and our method DADD-ADDG (ℓ2,1 ) per-             but captures the difference between internal and external
form evenly on DF, FF, FS, and NT. DADD-ADDG (ℓ2,1 )             textures. Since the data set is heavily compressed, a lot
have an average improvement of 0.17% on FaceForen-               of information is lost. The results on the five test data
sics++. But on the more difficult Celeb-DF, our method           demonstrate that, different perturbation would benifit
improves by 8.17%. This verifies our hypothesis on the           different types of deepfakes.
artifact discrepancy. Since our task is to improve gener-
alization performance, that is, test results on completely
unrelated data sets, a slight decrease in performance on         4.4. The Impact of DADD
FS and NT is acceptable.                                         Table 2 demonstrates the results of the methods training
                                                                 on the data generated by the ADDG. It is clear that the re-
4.3. The Impact of Perturbations                                 sults on the four categories of FaceForensics++ are close.
                                                                 But the results on Celeb-DF are different. Our proposed
We presents the impact of different perturbations in Fig-        multi-task learning framework’s performance is 3.82%
ure 5. We finetune the Xception on different categories          higher than that when using the Xception network only.
of perturbed frames, respectively.                                  We also report the test results of each sub-task in the
   From figure 5, we have the following observations. For        Figure 6. Compared with Figure 5, it is obvious that all the
DF, all methods have good performance except Outer-              sub-tasks achieves a better performance. This means the
Scaling. For FF, the Rotation we proposed has reached the        multi-task scheme have improved the information in the
best response, indicating that FF artifacts show more edge       common features. For example, the Rotation perturbation
information.For FS, the two methods that the texture of          in Figure 5 is about 70% for FS and Celeb-DF, while its
other images to perturb the original image have the best         corresponding sub-task ST_7 in Figure 6 achieves 99%
response, indicating that it is meaningful to introduce var-     and 80% respectively. This reveals that DADD introduces
ious textures. This also explains that the blending bound-       significant improvements.
ary constructed by rotating does not perform as well
as replacing the image. For NT, Inner-GaussBlur, Inner-
Scaling, and Rotation have a high response. Compared             4.5. The Impact of ℓ2,1 Regularization
with other types of data sets, it is difficult for Celeb-DF to   Table 2 demonstrates the impact of ℓ2,1 regularization.
get a good response with a single perturbation method.           It improve the final result by 1.51% on cross-data Celeb-
   The impact of GaussBlur and Scaling are similar. When         DF. This indicates that the feature selection benefit the
the face’s interior is disturbed, it responds very well to       model performances. We also test the feature selection
DF, FS, and NT, while it is deficient to FS. However, when       hyper-parameter 𝜆, and log the impact of 𝜆 in Figure 7.
the modified area is the background, the result is the           When 𝜆 = 0.1, the model achieves the best performance.
opposite. The effect will be better on FS. It verifies that      Lower 𝜆 improve the performance in a small ratio, while
                         Figure 5: The results of Xceptions trained on different perturbations.


                                     Figure 6: The predicted results of sub-tasks.


higher 𝜆 causes a sharp performance drop.                  approach composed of the artifact-discrepant data gener-
                                                           ator and deepfake artifact discrepancy detector, to learn
                                                           the discrepancy with pristine videos only. We conducted
                                                           extensive experiments to demonstrate the effectiveness
                                                           of our proposed approaches.
                                                              Deepfake detection is a special domain that aims at
                                                           challenging beyond the common senses. Since the deep-
                                                           fakes become more realistic, the models have to pay
                                                           more attention to high-frequency signals and the inher-
                                                           ent video fingerprints. This work tried to associate the
                                                           deepfake artifacts with some common noises, as a pow-
Figure 7: The average performance of different 𝜆 on Celeb- erful tool to understand the unseen artifacts. In future,
DF (AUC%).                                                 we plan to leverage this tool to explore the impact of the
                                                           widely used manipulation methods. Moreover, taking
                                                           this work as a reference, we are interested in extracting
   We also visualize the layer with regularization in Fig-
                                                           the key artifacts from deepfakes directly.
ure 4. The features from ST_3, ST_7, and ST_9 contribute
most. The corresponding perturbations are Inner-Scaling,
Rotation, and Sim-Swap. This means they could be the Acknowledgments
delegates in the final prediction. Note that this doesn’t
mean only these three sub-tasks are necessary. Their Special thanks are given to the SSDL2021’s organizing
performances are based on the shared features, which is committee and I am also very grateful to the reviewers
learned from all the sub-tasks.                            for their valuable comments on this paper. This research
                                                           was supported by Open Funding Project of the State
                                                           Key Laboratory of Communication Content Cognition
5. CONCLUSION                                              (No.20K03). The completion of this paper can not be
                                                           separated from the Intelligent Media Analysis Group
In this paper, we made a hypothesis that the discrepant
                                                           (IMAG) to help. The author would like to also thank
artifacts caused by the frame manipulations are the key
                                                           Cheng Zhuang, Jiangnan Dai and Shaocong Yang in the
differences between pristine videos and deepfakes. To
                                                           IMAG for their valuable discussions.
address the discrepancy, we proposed a self-supervised
References                                                          (2016) 114–129.
                                                               [14] Q. Ye, H. Zhao, Z. Li, X. Yang, S. Gao, T. Yin, N. Ye,
 [1] Y. Li, M.-C. Chang, S. Lyu, In ictu oculi: Expos-              L1-norm distance minimization-based fast robust
     ing ai generated fake face videos by detecting eye             twin support vector 𝑘-plane clustering, IEEE trans-
     blinking, arXiv preprint arXiv:1806.02877 (2018).              actions on neural networks and learning systems
 [2] X. Yang, Y. Li, S. Lyu, Exposing deep fakes using              29 (2017) 4494–4503.
     inconsistent head poses, in: ICASSP, IEEE, 2019, pp.      [15] Z. Yu, W. Peng, X. Li, X. Hong, G. Zhao, Remote
     8261–8265.                                                     heart rate measurement from highly compressed
 [3] H. Qi, Q. Guo, F. Juefei-Xu, X. Xie, L. Ma, W. Feng,           facial videos: an end-to-end deep learning solution
     Y. Liu, J. Zhao, Deeprhythm: Exposing deepfakes                with video enhancement, in: Proceedings of the
     with attentional visual heartbeat rhythms, in: Pro-            IEEE International Conference on Computer Vision,
     ceedings of the 28th ACM International Conference              2019, pp. 151–160.
     on Multimedia, 2020, pp. 4318–4327.                       [16] U. A. Ciftci, I. Demir, L. Yin, Fakecatcher: Detection
 [4] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen,             of synthetic portrait videos using biological signals,
     B. Guo, Face x-ray for more general face forgery               IEEE Transactions on Pattern Analysis and Machine
     detection, in: Proceedings of the IEEE/CVF Confer-             Intelligence (2020).
     ence on Computer Vision and Pattern Recognition,          [17] P. Zhou, X. Han, V. I. Morariu, L. S. Davis, Two-
     2020, pp. 5001–5010.                                           stream neural networks for tampered face detection,
 [5] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen,                in: 2017 IEEE Conference on Computer Vision and
     Mesonet: a compact facial video forgery detection              Pattern Recognition Workshops (CVPRW), IEEE,
     network, in: IEEE International Workshop on Infor-             2017, pp. 1831–1839.
     mation Forensics and Security (WIFS), IEEE, 2018,         [18] H. H. Nguyen, J. Yamagishi, I. Echizen, Capsule-
     pp. 1–7.                                                       forensics: Using capsule networks to detect forged
 [6] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess,              images and videos, in: IEEE International Confer-
     J. Thies, M. Nießner, Faceforensics++: Learning to             ence on Acoustics, Speech and Signal Processing
     detect manipulated facial images, in: Proceedings              (ICASSP), IEEE, 2019, pp. 2307–2311.
     of the IEEE International Conference on Computer          [19] D. Güera, E. J. Delp, Deepfake video detection using
     Vision, 2019, pp. 1–11.                                        recurrent neural networks, in: IEEE International
 [7] M. Du, S. Pentyala, Y. Li, X. Hu, Towards generaliz-           Conference on Advanced Video and Signal Based
     able forgery detection with locality-aware autoen-             Surveillance (AVSS), IEEE, 2018, pp. 1–6.
     coder, arXiv preprint arXiv:1909.05999 (2019).            [20] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed,
 [8] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A             I. Masi, P. Natarajan, Recurrent convolutional
     new dataset for deepfake forensics, arXiv preprint             strategies for face manipulation detection in videos,
     arXiv:1909.12962 (2019).                                       Interfaces (GUI) 3 (2019).
 [9] P. Chen, J. Liu, T. Liang, G. Zhou, H. Gao, J. Dai,       [21] X. Xuan, B. Peng, W. Wang, J. Dong, On the gener-
     J. Han, Fsspotter: Spotting face-swapped video by              alization of gan image forensics, in: Chinese Con-
     spatial and temporal clues, in: 2020 IEEE Interna-             ference on Biometric Recognition, Springer, 2019,
     tional Conference on Multimedia and Expo (ICME),               pp. 134–141.
     IEEE, 2020, pp. 1–6.                                      [22] D. Cozzolino, J. Thies, A. Rössler, C. Riess,
[10] Y. Li, S. Lyu, Exposing deepfake videos by detect-             M. Nießner, L. Verdoliva, Forensictransfer: Weakly-
     ing face warping artifacts, in: Proceedings of the             supervised domain adaptation for forgery detection,
     IEEE Conference on Computer Vision and Pattern                 arXiv preprint arXiv:1812.02510 (2018).
     Recognition Workshops, 2019, pp. 46–52.                   [23] Y. Nirkin, L. Wolf, Y. Keller, T. Hassner, Deepfake de-
[11] A. Swaminathan, M. Wu, K. R. Liu, Digital image                tection based on the discrepancy between the face
     forensics via intrinsic fingerprints, IEEE transac-            and its context, arXiv preprint arXiv:2008.12262
     tions on information forensics and security 3 (2008)           (2020).
     101–117.                                                  [24] H. Tang, Z. Li, Z. Peng, J. Tang, Blockmix: meta reg-
[12] N. Yu, L. S. Davis, M. Fritz, Attributing fake images          ularization and self-calibrated inference for metric-
     to gans: Learning and analyzing gan fingerprints,              based meta-learning, in: Proceedings of the
     in: Proceedings of the IEEE International Confer-              28th ACM International Conference on Multimedia,
     ence on Computer Vision, 2019, pp. 7556–7566.                  2020, pp. 610–618.
[13] Q. Ye, J. Yang, F. Liu, C. Zhao, N. Ye, T. Yin, L1-norm   [25] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong,
     distance linear discriminant analysis based on an              W. Xia, Learning to recognize patch-wise con-
     effective iterative algorithm, IEEE Transactions               sistency for deepfake detection, arXiv preprint
     on Circuits and Systems for Video Technology 28                arXiv:2012.09311 (2020).
[26] F. Chollet, Xception: Deep learning with depth-
     wise separable convolutions, in: Proceedings of the
     IEEE conference on computer vision and pattern
     recognition, 2017, pp. 1251–1258.
[27] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, X. Zhou, l2,
     1-norm regularized discriminative feature selection
     for unsupervised learning, in: IJCAI International
     Joint Conference on Artificial Intelligence, AAAI
     Press/International Joint Conferences on Artificial
     Intelligence, 2011, pp. 1589–1594.
[28] L. Fu, Z. Li, Q. Ye, H. Yin, Q. Liu, X. Chen, X. Fan,
     W. Yang, G. Yang, Learning robust discriminant sub-
     space based on joint l2, p-and l2, s-norm distance
     metrics, IEEE Transactions on Neural Networks
     and Learning Systems (2020).
[29] Q. Ye, Z. Li, L. Fu, Z. Zhang, W. Yang, G. Yang,
     Nonpeaked discriminant analysis for data represen-
     tation, IEEE transactions on neural networks and
     learning systems 30 (2019) 3818–3832.

</pre>