=Paper=
{{Paper
|id=Vol-3084/paper5
|storemode=property
|title=Detecting Deepfakes with Multi-Metric Loss
|pdfUrl=https://ceur-ws.org/Vol-3084/paper5.pdf
|volume=Vol-3084
|authors=Ziwei Zhang,Xin Li,Rongrong Ni,Yao Zhao
}}
==Detecting Deepfakes with Multi-Metric Loss==
<pdf width="1500px">https://ceur-ws.org/Vol-3084/paper5.pdf</pdf>
<pre>
Detecting Deepfakes with Multi-Metric Loss
Ziwei Zhang1 , Xin Li1 , Rongrong Ni1 and Yao Zhao1
1
    Beijing Jiaotong University


                                             Abstract
                                             In recent years, DeepFake techniques have advanced to generate so realistic forged content that it could jeopardize personal
                                             privacy and national security. We observe the distribution discrepancy between genuine faces and tampered faces manip-
                                             ulated by DeepFake techniques. It can be described that embedding vectors of genuine faces are tightly distributed in the
                                             embedding space, while tampered faces are comparatively scattered. We, therefore, propose a novel DeepFake detection
                                             method based on Multi-metric Loss. Specifically, real and fake faces are mapped onto the embedding space, which is of
                                             intra-class compactness and inter-class separation. Then by adding Weight-Center Loss to project genuine faces onto a
                                             more compact region in the embedding space, the distance between the two types of sample clusters is further expanded,
                                             thereby improving the separability of genuine and tampered samples. Moreover, the Adaptive Hardness-aware Expander is
                                             designed to further improve feature description ability of the model because the metric is always challenged with proper
                                             difficulty. Extensive experiments show that our approach can achieve state-of-the-art performance on present datasets.

                                             Keywords
                                             Deepfakes, Multi-metric Loss, Adaptive Hardness-aware Expander


1. Introduction
Of various digital media, videos containing digital hu-
man faces, especially the ones involving personal identi-
fication information, are most vulnerable to be attacked.
These assaults are collectively referred to as DeepFake
manipulations. Therefore, to develop effective methods
capable of detecting DeepFake videos carries substant
weight. Since the existing manipulations tamper with
specific areas frame by frame, the artifacts and noises                                                                          (a) DFDC                     (b) Celeb-DF
appear in the spurious videos. So previous researchers
have proposed many handcrafted methods [1, 2, 3, 4] and                                                               Figure 1: DFDC and Celeb-DF dataset distribution visualiza-
                                                                                                                      tion by t-SNE. The projections of real face features are tightly
data-driven methods [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
                                                                                                                      distributed, while the fakes are comparatively scattered.
to find manipulation traces.
   Due to uncertain counterfeit methods and manipu-
lation quality in DeepFake videos, the spurious data is                                                               ent levels and face sample cluster with diverse labels
scattered in the whole feature space. Relatively, gen-                                                                (real/fake). Under the restriction of Triplet Loss and
uine human faces concentrate close to a non-linear low-                                                               Cross-Entropy Loss, the real faces and fake faces are
dimensional manifold [17] in the feature space. As shown                                                              mapped onto the embedding space, which is intra-class
in Figure 1, the vectors of real faces are tightly distributed,                                                       compactness and inter-class separation. Then through
while the fakes are comparatively scattered. Therefore,                                                               adding Weight-Center Loss, the real faces are projected
we consider that this distribution discrepancy also exists                                                            to a more compact region. The method of excavating
in the embedding space obtained by the feature space                                                                  fundamental distinction between the two types of sam-
mapping. The existing detection schemes, however, do                                                                  ples is, therefore, to extend the distance between the
not consider the distribution discrepancy between the                                                                 two types of sample clusters in the embedding space,
two types of samples.                                                                                                 thereby improving the separability of genuine and spu-
   To this end, we propose the DeepFake detection frame-                                                              rious videos. In the end-stage of training, in order to
work with Multi-metric Loss, as shown in Figure 2.                                                                    further improve the feature description ability of the
Triplet Loss, Cross-Entropy Loss and Weight-Center Loss                                                               model, we designed the Adaptive Hardness-aware Ex-
together constitute Multi-metric Loss acting on differ-                                                               pander (AHE). The rigorous experiments on FaceForen-
                                                                                                                      sics++ [6], DFDC [18] and Celeb-DF [19] datasets show
International Workshop on Safety & Security of Deep Learning,                                                         that the proposed method based on Multi-metric Loss
August 19th -26th, 2021, Montreal-themed Virtual Reality                                                              is highly effective and achieves state-of-the-art perfor-
" rrni@bjtu.edu.cn (R. Ni); yzhao@bjtu.edu.cn (Y. Zhao)                                                               mance.
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 2: The framework of our method. Use MTCNN to crop the video frames into facial area maps and send them to the
backbone network to get the embedding vectors. The Cross-Entropy Loss and Triple Loss of all original embedding vectors and
the Weight-Center Loss of genuine embedding vectors, are calculated. In the end-stage of training, Adaptive Hardness-aware
Expander continuously synthesizes samples with adaptive hardness.


2. Related works                                                though they also mapped data onto the embedding space
With huge risks posed by face forgery technology, there         based on Deep Metric Learning, they just followed the tra-
is currently an urge to investigate DeepFake detection          ditional metric strategy and imposed the same constraint
methods. Existing detection techniques mainly fall into         on two types of samples. In our work, considering the
two categories: handcrafted and data-driven methods.            distribution discrepancy of real and fake data, different
Handcrafted Methods. For the limited face manipu-               levels of classification constraints are imposed on these
lation techniques at that time, early works achieved the        two kinds of sample clusters. Specifically, we design the
DeepFake detection through handcraft features. This             Multi-metric Loss to further widen the distance between
methods mainly include eye blinking [1], incomplete de-         the real cluster and the fakes by capturing fundamental
tails in the eye and teeth [2], face warping [3] and head       distinction between spurious videos and genuine videos,
poses [4]. With the development of generative adver-            and the Adaptive Hardness-aware Expander to further
sarial network (GAN) [20],a variety of tampering tech-          improve the feature description ability of the model.
nologies have emerged and forgery faces have become
more realistic. Therefore, the effectiveness of formerly
                                                                3. Proposed Approach
handcrafted methods has gradually been weakened.                In this section, we give an overview of our framework.
Data-driven Methods. Given the powerful feature                 As aforementioned, the embedding vectors of real faces
representation capabilities of deep neural network, the         are aggregative in the embedding space, while the fakes
data-driven methods have received widespread attention.         are relatively scattered. Motivated by this observation,
Firstly, some classification networks were applied to de-       two key components are integrated into the framework:
tect fake faces like MesoNet [5], XceptionNet [6], Capsule      1) Multi-metric Loss is designed to mine fundamental
network [7], R3D and C3D [8] etc. Then Zhou et al. [9]          distinction between real and fake faces so as to improve
proposed to use the Two-stream neural network to cap-           separability; 2) Adaptive Hardness-aware Expander can
ture tampering artifacts and local noise residuals. The         be used to further improve the feature description ability
adaptive face weighting layer [10] was designed with            of the model. The framework is depicted in Figure 2.
it focus forgery details. The model [11] was trained to         3.1. Multi-metric Loss
mark the blending boundary for forged images. Consid-
ering inconsistent warping left by manipulation in the          Let 𝒳 denote the data space where we sample a set of
inter-frame, the methods [12, 13, 14] were proposed. The        facial area maps X = [𝑥1 , 𝑥2 , · · · , 𝑥𝑁 ]. Each data 𝑥𝑖
methods [15, 16] introduced the Deep Metric Learning            has a label 𝑙𝑖 ∈ {0, 1} representing real or fake. Let
                                                                        ℎ
to DeepFake detection for the first time.                       ℎ : 𝒳 −→ 𝒴 be the mapping from the data space to the
   Kumar et al. [15] mainly explored the method’s ef-           feature space, where the extracted feature 𝑦𝑖 preserves
fectiveness for detecting videos with high compression          semantic characteristics of its corresponding data point
factor. Feng et al. [16] used the difference of the full face   𝑥𝑖 . Then the feature is projected onto the embedding
                                                                                                       𝑔
image in videos as the feature for DeepFake detection. Al-      space 𝒵 with the mapping 𝑔 : 𝒴 −→ 𝒵. Since the
Figure 3: The proposed Multi-metric Loss with Triplet Loss, Cross-Entropy Loss and Weight-Center Loss. Weight-Center
Loss, which only acts on the cluster of real samples, imposes larger penalty on samples that deviate from the center, and
imposes smaller penalty on adjacent samples, while continuously updating the center of the real sample cluster.


projection can be incorporated into the deep network, we        3.1.3. Weight-Center Loss
                                                       𝑓
can directly learn the mapping 𝑓 (·; 𝜃) = ℎ∘𝑔 : 𝒳 −→ 𝒵 Considering the distribution discrepancy of genuine and
from the data space to the embedding space, where 𝜃 is tampered data, we hope to further widen the distance
network parameters.                                              between two categories of sample clusters by capturing
   Based on the data distribution discrepancy, namely, the fundamental distinction between real videos and fake
embedding vectors of real faces are tightly distributed, videos. Under the action of Triplet Loss and CE Loss, the
while the fakes are comparatively scattered. We deem network has acquired preliminary classification capabil-
that various levels of classification constraints should be ity. On this basis, we design Weight-Center Loss for real
imposed, so as to mine fundamental distinction between sample cluster to capture the fundamental distinction
spurious videos and genuine videos, as shown in the between two types of samples.
Figure 3. Multi-metric Loss is formulated as follows:               Some embedding vectors are far from the center of
 Loss = ℒ𝑊 𝑒𝑖𝑔ℎ𝑡−𝐶𝑒𝑛𝑡𝑒𝑟 + 𝛽ℒ𝑇 𝑟𝑖𝑝𝑙𝑒𝑡 + 𝜒ℒ𝐶𝐸                  (1) the real sample cluster, it may be due to certain interfer-
                                                                 ence, which has nothing to do with judging real and fake
3.1.1. Triplet Loss                                              videos. Therefore, Weight-Center Loss is proposed which
Under the constraint of Triplet Loss, the mapping from only acts on the cluster of real samples. We define the
high-dimensional sparse features into low-dimensional sample that is far from the center of the real sample clus-
dense vectors is learned. Reflected in the embedding ter compared to the surrounding samples as the deviating
space, the distribution of data is characterized by intra- sample. It adaptively imposes larger penalty on deviat-
class compactness and inter-class separation.                    ing samples, and imposes smaller penalty on adjacent
   Let 𝑓 (𝑥𝑎 ; 𝜃) be the anchor embedding vector. The samples. Simultaneously, the center of the real sample
embedding vector with the same and different label rel- cluster is continuously updated. Based on the above oper-
ative to 𝑓 (𝑥𝑎 ; 𝜃), are defined as 𝑓 (𝑥𝑝 ; 𝜃) and 𝑓 (𝑥𝑛 ; 𝜃), ations, real faces are projected to a more compact region
respectively. Triplet Loss is formulated as follows:             in the embedding space, so as to broaden the distance be-
                                                                 tween the real sample cluster and the fake sample cluster.
              ℒ𝑇 𝑟𝑖𝑝𝑙𝑒𝑡 := [S𝑎𝑛 − S𝑎𝑝 + 𝜅]+                  (2)
                                                                 Weight-Center Loss is formulated       as follows:
 where S𝑎𝑝 = ⟨𝑓 (𝑥𝑎 ; 𝜃) , 𝑓 (𝑥𝑝 ; 𝜃)⟩ indicates the simi-
                                                                                               [︃                      ]︃
                                                                                         1            ∑︁ −𝛼(𝑆 −𝜆)
larity of positive pair, S𝑎𝑛 = ⟨𝑓 (𝑥𝑎 ; 𝜃) , 𝑓 (𝑥𝑛 ; 𝜃)⟩ is the ℒ𝑊 𝑒𝑖𝑔ℎ𝑡−𝐶𝑒𝑛𝑡𝑒𝑟 = log 1 +                  𝑒     𝑘𝑐
                                                                                                                           (3)
                                                                                         𝛼
similarity of negative pair, ⟨·, ·⟩ denotes dot product, and                                          𝑘∈𝑃

𝜅 is metric margin.                                                 where 𝑃 is the collection of real embedding vec-
                                                                 tors, 𝑆𝑘𝑐 is the similarity of the center sample pair
3.1.2. Cross-Entropy Loss
                                                                 {𝑓 (𝑥𝑘 ; 𝜃) , 𝑓 (𝑥𝑐 ; 𝜃)}, 𝑓 (𝑥𝑘 ; 𝜃) and 𝑓 (𝑥𝑐 ; 𝜃) are real
In our approach, Cross-Entropy (CE) Loss and the Triplet embedding vectors and the iterative center and 𝛼, 𝜆 are
Loss act jointly. Specifically, CE Loss encourages the sep- fixed hyperparameters. It is worth noting that the center
aration of real embedding vectors from the fakes. Simul- is iterated continuously.
taneously, the Triplet Loss is used to achieve intra-class          Based on [21], we can get the generic definition about
compactness and inter-class separation, so as to initially the penalty weight of sample pair. Then the penalty
separate the two types of sample clusters.                       weight of the center sample {𝑓 (𝑥𝑘 ; 𝜃) , 𝑓 (𝑥𝑐 ; 𝜃)} in
                                                                manipulated, and for other samples 𝑧, 𝑧 + , we perform
                                                                                                        {︀       }︀

                                                                no transformation. Then the reduction in the distance be-
                                                                tween negative pairs will create rise of the hard level, so
                                                                that the measurement process is always at an appropriate
                                                                level of difficulty during the training cycle. As shown in
                                                                the Figure 4, in order to simplify the representation, we
                                                                use 𝑧, 𝑧 + , 𝑧 − to represent the anchor embedding vector
                                                                𝑓 (𝑥𝑎 ; 𝜃), the positive embedding vector 𝑓 (𝑥𝑝 ; 𝜃), and
                                                                the negative embedding vector 𝑓 (𝑥𝑛 ; 𝜃), respectively.
                                                                   Firstly, a toy example that constructs an augmented
                                                                harder negative sample 𝑧˜− by linear interpolation, is
                                                                presented:
Figure 4: Adaptive{︀ Hardness-aware       Expander module. The            𝑧˜− = 𝑧 + 𝜔 𝑧 − − 𝑧 , 𝜔 ∈ [0, 1]                    (5)
                                                                                        (︀        )︀
                        − −         −  }︀
synthetic samples 𝑧˜1 ,𝑧˜2 ,. . .,𝑧˜𝑖 , which are generated by
                                                                   However,     samples too close  to the  anchor   are likely to
the distance of tuple 𝑧, 𝑧 + , 𝑧 − and CE Loss.
                       {︀           }︀
                                                                cause confusion in the label. Therefore, we exploit the CE
                                                                Loss in the previous section to control the hardness of the
ℒ𝑊 𝑒𝑖𝑔ℎ𝑡−𝐶𝑒𝑛𝑡𝑒𝑟 is calculated as follows:                       generated negative samples, since it is a good indicator
                                 1
  𝑤𝑘𝑐 = −𝛼(𝜆−𝑆 )           ∑︀              −𝛼(𝑆𝑖𝑐 −𝑆𝑘𝑐 )
                                                            (4) of training process. If the CE Loss is small, the generated
           𝑒          𝑘𝑐 +
                               𝑖∈𝑃,𝑖̸=𝑘 𝑒                       negative sample will be closer to the anchor point, but
   where 𝑓 (𝑥𝑘 ; 𝜃) is one embedding vector of set, will not cross the positive sample. Adaptive Hardness-
𝑓 (𝑥𝑖 ; 𝜃) is the other in the set except sample 𝑓 (𝑥𝑘 ; 𝜃). aware Expander can be represented as:
𝑆𝑘𝑐 and 𝑆𝑖𝑐 indicate the similarity of sample pair
                                                                       {︃
                                                                                                 ]︀ (𝑧−−𝑧)
                                                                          𝑧 + 𝜂𝑑− +(1−𝜂) 𝑑+ 𝑑−                if 𝑑− > 𝑑+
                                                                              [︀
                                                                  −
{𝑓 (𝑥𝑘 ; 𝜃) , 𝑓 (𝑥𝑐 ; 𝜃)} and {𝑓 (𝑥𝑖 ; 𝜃) , 𝑓 (𝑥𝑐 ; 𝜃)}.        𝑧
                                                                ˜   =                                                           (6)
   Eq.4 shows that the penalty weight for sample pair is                  𝑧 −
                                                                                                              if 𝑑− ≤ 𝑑+
                                                                                     − 𝛾
determined by its relative similarity, measured by com-            where 𝜂 = 𝑒 ℒ𝐶𝐸 is a balance factor to control
paring it with the distance from surrounding samples the hardness of the generated negative samples, 𝛾 is
with the center, which is fundamentally different from the pulling              factor
                                                                                   ⃦ used to balance        the
                                                                                                            ⃦ scale of ℒ𝐶𝐸 ,
Center Loss [22]. According to the relative position rela- 𝑑+ = ⃦𝑧 − − 𝑧 ⃦2 and 𝑑− = ⃦𝑧 + − 𝑧 ⃦2 are the distance
                                                                        ⃦                       ⃦

tionship in the set, there are two different situations, as between positive pair and negative pair, respectively.
shown in Figure 3. Firstly, the embedding vector 𝑓 (𝑥𝑘 ; 𝜃)        In the early stage of training, the generated hard sam-
is far from the center of set relative to other samples ples can not represent related face information, consider-
𝑓 (𝑥𝑖 ; 𝜃), as described by the deviating sample in Fig- ing that the embedding space has no accurate semantic
ure 3, and the formula is expressed as 𝑆𝑖𝑐 > 𝑆𝑘𝑐 . We structure. It may even cause the model to be trained in
consider that the current embedding vector extracted con- the wrong direction from the beginning. As the training
tains certain interference, which has nothing to do with progresses, however, the model is growing more tolerant
judging real and fake videos, so the larger penalty weight of hard samples, that is, the metric is always challenged
is imposed. When the embedding vector is closer to the with proper difficulty. Thereby Adaptive Hardness-aware
center of set relative to other samples, as described by Expander can improve the feature description ability of
the adjacent sample in Figure 3, the formula is expressed the model.
as 𝑆𝑖𝑐 ≤ 𝑆𝑘𝑐 . Smaller penalty weight is imposed and the
network parameters are fine-tuned to find the features 4. Experiments
that could best represent the fundamental distinction In this section, we first explore the optimal settings for
between spurious videos and authentic videos.                   our approach and then present extensive experimental
                                                                results to demonstrate the effectiveness of our method.
3.2. Adaptive Hardness-aware Expander
In the end-stage of training, considering that original           4.1. Implement Details
samples are already well separable under the action of            For all real/fake video frames, we use face extractor
Multi-metric Loss. Continuing to train original samples           MTCNN to detect faces and save the aligned facial images
cannot further improve the model’s feature description            as inputs with the size of 256 × 256. 𝛽, 𝜒 in Eq.1 and 𝛼
ability. To address this limitations, we propose the Adap-        in Eq.3 is set to 2.0, 1.0, 2.0 to impose different levels of
tive Hardness-aware Expander, as shown in{︀Figure 4. }︀           classification constraints. The margin of Triplet Loss in
   We construct the hardness-aware triplet 𝑧, 𝑧 + , 𝑧˜− 𝑖         Eq.2 is set to 1.0. Optimization is performed using SGD
in the embedding space, where manipulation of the dis-            optimizer with weight decay 5𝑒−4 . The initial learning
tances among samples will directly alter the {︀hard level         rate is kept at 0.01 and divided by 10 after every 3000
of the triple. The distances of negative pairs 𝑧, 𝑧˜−     is      iterations. We adopt ResNet-34, which is pre-trained on
                                                       }︀
                                                     𝑖
Table 1
Testing ACC(%) and AUC(%) score of our method and other methods on FaceForensics++ dataset.
                                    FF++/df                FF++/ff                FF++/fs               FF++/nt
          Methods
                               ACC        AUC         ACC        AUC         ACC        AUC         ACC       AUC
        MesoNet [5]            0.827      0.853       0.562       0.634      0.611       0.679      0.502     0.596
      XceptionNet [6]          0.948      0.986       0.928       0.972      0.903       0.933      0.807     0.835
         Li et al. [3]         0.969      0.995       0.972       0.987      0.963       0.990      0.890     0.913
         Capsule[7]            0.941      0.960       0.963       0.958      0.972       0.974      0.887     0.948
       Feng et al. [16]        0.953      0.991       0.938       0.957      0.921       0.940      0.841     0.902
      Kumar et al. [15]        0.960      0.990       0.932       0.962      0.944       0.978      0.832     0.872
     Bonettini et al.[10]      0.981      0.992       0.955       0.970      0.973       0.980      0.845     0.863
            Ours               0.985      0.998       0.974      0.991       0.995      1.000       0.938     0.968

Table 2                                                          hardness-aware samples forces the network to pay more
Testing ACC(%) and AUC(%) score of our method and                attention to some key features that characterize the truth
other methods on DFDC and Celeb-DF dataset.                      and counterfeit under the constraint of Multi-metric Loss,
                                DFDC            Celeb-DF         thereby improving the feature description ability of the
                            ACC    AUC       ACC      AUC        model and achieving better classification performance.
     MesoNet [5]            0.746  0.818     0.482    0.536      Therefore, our method can achieve state-of-the-art perfor-
   XceptionNet [6]          0.845  0.909     0.788    0.832      mance on FaceForensics++, DFDC and CelebDF datasets.
      Li et al. [3]         0.793  0.861     0.571    0.628
     Capsule [7]            0.861  0.933     0.791    0.879      4.3. Ablation Study
    Feng et al. [16]        0.883  0.963     0.814    0.867      To verify the effectiveness of Multi-metric Loss and Adap-
   Kumar et al. [15]        0.825  0.899     0.792    0.943      tive Hardness-aware Expander, we conduct ablation stud-
  Bonettini et al.[10]      0.944  0.967     0.903    0.959      ies and results are shown in Table 3, Figure 5, Figure 6.
         Ours               0.962  0.979     0.927    0.968
                                                                 4.3.1. Effectiveness of Multi-metric Loss
Table 3                                                          To confirm the effectiveness of Multi-metric Loss, we
The ablation study about Multi-metric Loss.                      evaluate how different levels classification constraints
                               df      ff       fs      nt       affect the detection accuracy. We train the model on
      Triplet Loss           0.946   0.925    0.939   0.810      FF++ (c23), other hyperparameters are kept the same as
  + Cross-Entropy Loss       0.962   0.933    0.942   0.870      settings in Table 1.
  + Weight-center Loss       0.985   0.974    0.995   0.938         The t-SNE plots of four different manipulation meth-
                                                                 ods in FF++ datasets are reported in Figure 5. It can be
the ImageNet dataset, as the backbone network. Our               found that the separability of the sample is poor when
model is trained on 4 RTX 2080Ti GPUs with batch                 Triple Loss acts independently, as shown in the first row
size 16 and the total number of iterations is set to 10, 000.    of Figure 5. The reason is that the data selection in the
                                                                 batch results in uneven data distribution, which makes
4.2. Comparsion with Previous Methods
                                                                 it difficult to divide the interface. When Cross-entropy
In this section, we compare our method with previous             Loss is introduced, the data distribution of different ma-
DeepFake detection methods. The performance of var-              nipulation in FF++ datasets is shown in the second row
ious methods on FaceForensics++ [6], DFDC [18] and               of Figure 5. Among them, Cross-Entropy Loss encour-
Celeb-DF [19] dataset is shown. We adopt ACC (accu-              ages the separation of real embedding vectors from fake
racy) and AUC (area under Receiver Operating Charac-             embedding vectors, and the Triple Loss helps constrain
teristic Curve) as the evaluation metrics for experiments.       the intra-class compactness and inter-class separation,
   The evaluation results of the individual datasets are         thereby improving the separability of samples. In the
shown in Table 1 and Table 2. The results indicate that          third row of Figure 5, Weight-Center Loss is added and
our model trained with Multi-metric Loss and AHE have            it only acts on the real cluster. By mining the features
significant improvement over previous methods with               representing authenticity, the real sample clusters are
metric learning [15, 16], especially in DFDC and Celeb-          tightly clustered, thereby further extending distance be-
DF dataset. The reason is that different levels of classifica-   tween two types of sample clusters in the embedding
tion constraints based on the phenomena of distribution          space. The ACC ablation studys about Multi-metric Loss
discrepancy is imposed to mine the fundamental distinc-          on FF++ are reported in Table 3, which further confirm
tion between spurious videos and genuine videos, so that         the effectiveness of Multi-metric Loss.
it can still work on tampered videos without obvious                Note that Triplet Loss and Cross-Entropy Loss work
artifacts. At the same time, the generation of adaptive          during the entire training stage, while Weight-Center
                             FF++/df                  FF++/ff                   FF++/fs                   FF++/nt


    Triplet Loss


  + Cross-Entropy
       Loss


 + Weight-Center
      Loss


                                       Real Samples                                        Fake Samples

Figure 5: t-SNE plots of ablation study about Triplet Loss, Cross-Entropy Loss and Weight-Center Loss in FF++ dataset.

                   FF++/df   FF++/ff   FF++/fs   FF++/nt        tive hardness force the network paying more attention to
                                                                some key features that characterize the authenticity and
  Tampering
                                                                the counterfeit under the constraint of Multi-metric Loss,
    figure
                                                                thereby improving the feature description ability of the
                                                                model. For example, in Figure 6, NeuralTextures (nt), a
  Grad-CAM                                                      tampering scheme only modifies the mouth area. Before
   without                                                      the Adaptive Hardness-aware Expander is used, class
     AHE                                                        activation map shows that the nose and mouth regions
                                                                together provide evidence that the video is tampered. Af-
  Grad-CAM                                                      ter the Adaptive Hardness-aware Expander is used, class
  with AHE                                                      activation map shows that the network will pay more at-
                                                                tention to the mouth area tampered, which demonstrates
                                                                the interpretability of our proposed method.
Figure 6: Heatmaps generated by Grad-CAM about with or
without AHE in four manipulation methods.
                                                                5. Conclusion
                                                                In this work, we propose the DeepFake detection method
                                                                based on Multi-metric Loss, considering the distribution
Loss only works in the middle and end stages of training.
                                                                discrepancy that the embedding vectors of genuine faces
The main reason is that the center point of the real sam-
                                                                are tightly distributed in the embedding space, while tam-
ples is unstable at the beginning of the training, which
                                                                pered faces are comparatively scattered. Multi-metric
will cause the network to optimize in the wrong direction.
                                                                Loss improves the separability of genuine and tampered
4.3.2. Effectiveness of AHE                                     samples through further widening distance between the
To confirm the effectiveness of Adaptive Hardness-aware         two types of sample clusters. Besides, adaptive hardness-
Expander, we analyze the class activation maps for four         aware samples is generated to make the metric be always
different manipulation methods, as shown in Figure 6.           in the proper difficulty, so as to improve the feature de-
   Class activation maps corresponding to the operation         scription ability of the model. Our method achieves good
of Expander indicate that synthetic samples with adap-          improvements in extensive metrics.
Acknowledgments                                                 Detection, in: 2020 IEEE/CVF Conference on Com-
                                                                puter Vision and Pattern Recognition (CVPR), 2020,
This work was supported in part by the National Key Re-         pp. 5000–5009.
search and Development of China (2018YFC0807306), Na-      [12] D. Güera, E. J. Delp, Deepfake Video Detection
tional NSF of China (U1936212), Beijing Fund-Municipal          Using Recurrent Neural Networks, in: 2018 15th
Education Commission Joint Project (KZ202010015023).            IEEE International Conference on Advanced Video
                                                                and Signal Based Surveillance (AVSS), 2018, pp. 1–6.
                                                           [13] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed,
References                                                      I. Masi, P. Natarajan, Recurrent Convolutional
                                                                Strategies for Face Manipulation Detection in
 [1] Y. Li, M. Chang, S. Lyu, In Ictu Oculi: Exposing AI        Videos, in: 2019 IEEE/CVF Conference on Com-
     Created Fake Videos by Detecting Eye Blinking, in:         puter Vision and Pattern Recognition Workshops
     2018 IEEE International Workshop on Information            (CVPRW), 2019, pp. 80–87.
     Forensics and Security (WIFS), 2018, pp. 1–7.         [14] K. Chugh, P. Gupta, A. Dhall, R. Subramanian, Not
 [2] F. Matern, C. Riess, M. Stamminger, Exploiting             Made for Each Other- Audio-Visual Dissonance-
     Visual Artifacts to Expose Deepfakes and Face Ma-          Based Deepfake Detection and Localization, in:
     nipulations, in: 2019 IEEE Winter Applications of          Proceedings of the 28th ACM International Confer-
     Computer Vision Workshops (WACVW), 2019, pp.               ence on Multimedia, 2020, p. 439–447.
     83–92.                                                [15] A. Kumar, A. Bhavsar, R. Verma, Detecting Deep-
 [3] Y. Li, S. Lyu, Exposing DeepFake Videos By Detect-         fakes with Metric Learning, in: 2020 8th Inter-
     ing Face Warping Artifacts, in: IEEE/CVF Confer-           national Workshop on Biometrics and Forensics
     ence on Computer Vision and Pattern Recognition            (IWBF), 2020, pp. 1–6.
     (CVPR) Workshops, 2019, pp. 46–52.                    [16] K. Feng, J. Wu, M. Tian, A Detect method for
 [4] X. Yang, Y. Li, S. Lyu, Exposing Deep Fakes Using          deepfake video based on full face recognition, in:
     Inconsistent Head Poses, in: IEEE International            2020 IEEE International Conference on Informa-
     Conference on Acoustics, Speech and Signal Pro-            tion Technology,Big Data and Artificial Intelligence
     cessing, ICASSP, 2019, pp. 8261–8265.                      (ICIBA), 2020, pp. 1121–1125.
 [5] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen,       [17] N. Lei, Z. Luo, S. Yau, X. D. Gu,           Geomet-
     MesoNet: a Compact Facial Video Forgery Detec-             ric Understanding of Deep Learning,             2018.
     tion Network, in: 2018 IEEE International Work-            arXiv:1805.10451.
     shop on Information Forensics and Security (WIFS),    [18] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes,
     2018, pp. 1–7.                                             M. Wang, C. Canton, The DeepFake Detection
 [6] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess,          Challenge Dataset, 2020. arXiv:2006.07397.
     J. Thies, M. Niessner, FaceForensics++: Learn-        [19] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-DF:
     ing to Detect Manipulated Facial Images, in: 2019          A Large-Scale Challenging Dataset for DeepFake
     IEEE/CVF International Conference on Computer              Forensics, in: 2020 IEEE/CVF Conference on Com-
     Vision (ICCV), 2019, pp. 1–11.                             puter Vision and Pattern Recognition, CVPR, 2020,
 [7] H. Nguyen, J. Yamagishi, I. Echizen, Use of a Cap-         pp. 3204–3213.
     sule Network to Detect Fake Images and Videos,        [20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
     2019. arXiv:1910.12467v2.                                  D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,
 [8] I. Ganiyusufoglu, L. M. Ngô, N. Savov, S. Karaoglu,        Generative Adversarial Nets, in: 2014 Annual Con-
     T. Gevers, Spatio-temporal Features for Gen-               ference on Neural Information Processing Systems
     eralized Detection of Deepfake Videos, 2020.               (NIPS), 2014, pp. 2672—-2680.
     arXiv:2010.11844.                                     [21] X. Wang, X. Han, W. Huang, D. Dong, M. R. Scott,
 [9] P. Zhou, X. Han, V. I. Morariu, L. S. Davis, Two-          Multi-Similarity Loss With General Pair Weight-
     Stream Neural Networks for Tampered Face Detec-            ing for Deep Metric Learning, in: 2019 IEEE/CVF
     tion, in: IEEE/CVF Conference on Computer Vision           Conference on Computer Vision and Pattern Recog-
     and Pattern Recognition (CVPR) Workshops, 2017.            nition (CVPR), 2019, pp. 5017–5025.
[10] N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi,    [22] Y. Wen, K. Zhang, Z. Li, Y. Qiao, A Discriminative
     P. Bestagini, S. Tubaro, Video Face Manipulation           Feature Learning Approach for Deep Face Recog-
     Detection Through Ensemble of CNNs, in: 2020               nition, in: Computer Vision - ECCV 2016 - 14th
     25th International Conference on Pattern Recogni-          European Conference, Amsterdam, 2016, pp. 499–
     tion (ICPR), 2020, pp. 5012–5019.                          515.
[11] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen,
     B. Guo, Face X-Ray for More General Face Forgery

</pre>