Lip Forgery Video Detection via Multi-Phoneme Selection
Jiaying Lin1 , Wenbo Zhou1 * , Honggu Liu1 , Hang Zhou2 , Weiming Zhang1 * and Nenghai Yu1
1
    University of Science and Technology of China
2
    Simon Fraser University


                                             Abstract
                                             Deepfake technique can produce realistic manipulation videos including full-face synthesis and local region forgery. General
                                             methods work well in detecting the former but are usually intractable in capturing local artifacts especially for lip forgery
                                             detection. In this paper, we focus on the lip forgery detection task. We first establish a robust mapping from audio to lip
                                             shapes. Then we classify the lip shapes of each video frame according to different spoken phonemes, enable the network in
                                             capturing the dissonances between lip shapes and phonemes in fake videos, increasing the interpretability. Each lip shape-
                                             phoneme set is used to train a sub-model, those with better discrimination will be selected to obtain an ensemble classification
                                             model. Extensive experimental results demonstrate that our method outperforms the most state-of-the-art methods on both
                                             the public DFDC dataset and a self-organized lip forgery dataset.

                                             Keywords
                                             Lip Forgery, Deepfake Detection, Phoneme and Viseme


1. Introduction
                                                                                                                          Real
Thanks to the tremendous success of deep generative
models, face forgery becomes an emerging research topic
in very recent years and various methods have been pro-
                                                                                                                          Fake
posed [1, 2]. Depending on the manipulated region, they
can be roughly categorized into two types: full-face syn-
thesis [3, 4] that usually swaps the whole synthesized
source face to a target face, and local face region forgery                                                           Figure 1: The lip shapes of speaking the word “apple” in real
[5, 6] that only modifies partial face region, e.g., modify-                                                          (top) and fake (bottom) video. In the real video, the lips are
ing the lip shape to match the audio content. Especially                                                              more widely opened with clear teeth texture, while opposite
when the lips of politicians have been tampered with                                                                  in the fake.
to make inappropriate speeches, it can lead to serious
political crisis.
   To alleviate the risks brought by malicious uses of face                                                           specific targets. [11, 12] employ features such as audio
forgery, many detection methods have been proposed                                                                    and expression to detect synchronization between differ-
[7, 8, 9]. These methods usually consider the forgery                                                                 ent modalities.
detection from different aspects and extract visual fea-                                                                 To address the problem of local region forgery de-
tures from the whole face region, achieving impressive                                                                tection, in this paper, we proposed a complete multi-
detection results on public datasets FF++ and DFDC, in                                                                phoneme selection-based framework. To take full ad-
which most of the fake videos are tampered in a full-face                                                             vantage of the particularity of lip forgery videos that
synthesized manner. But this type of detection meth-                                                                  contain audios, we need to establish a robust mapping
ods struggle to handle the local region forgery cases like                                                            relationship between the lip shapes and the audio con-
lip-sync [5]. Recently, [10] attempt to detect lip-sync                                                               tents. Prior studies in the realm of Audio-Visual Speech
forgery video with single phoneme-viseme matching for                                                                 Recognition have demonstrated that the phoneme is the
                                                                                                                      smallest identifiable unit correlated with a particular lip
Woodstock’21: Symposium on the irreproducible science, June 07–11,                                                    shape. Motivated by [13], we divide audio contents into
2021, Woodstock, NY
*                                                                                                                     12 phoneme classes and classify all the video frames. For
  Corresponding Author.
" vivian19@mail.ustc.edu.cn (J. Lin); welbeckz@ustc.edu.cn                                                            each phoneme-lip set, we measure the deviation on open-
(W. Zhou); lhg9754@mail.ustc.edu.cn (H. Liu);                                                                         close amplitude between real and fake lip shapes, and
zhouhang2991@gmail.com (H. Zhou); zhangwm@ustc.edu.cn                                                                 train a sub-model for real/fake classification.
(W. Zhang); ynh@ustc.edu.cn (N. Yu)                                                                                      Usually, a large deviation represents the obvious dis-
 0000-0001-5553-9482 (J. Lin); 0000-0002-4703-4641 (W. Zhou);                                                        crepancy between the real and fake lip shapes, which
0000-0001-9294-9624 (H. Liu); 0000-0001-7860-8452 (H. Zhou);
0000-0001-5576-6108 (W. Zhang); 0000-0003-4417-9316 (N. Yu)                                                           also indicates the great difficulty in synthesizing the lip
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      shape for the corresponding phoneme. Simultaneously, it
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        shows the robustness of correlated phoneme-lip mapping
against physical changes in different videos, e.g., volume methods have become mainstream in very recent years.
and face angle. This precisely provides a distinguishing   [7] uses XceptionNet [17] to extract features from spa-
feature for forgery detection. By selecting the phonemes   tial domain. F3 -Net [9] achieves state-of-the-art using
with the top-5 deviations, we integrate the corresponding  frequency-aware decomposition. However, since the au-
5 well-trained sub-models into an ensemble model for       dios are lacking in most public deepfake datasets, these
maximizing the discriminability of real and fake videos.   methods are designed in a universal manner with no
   To verify the effectiveness, we have conducted exten-   consideration of audios matching. They perform well
sive experiments on both the public DFDC dataset and a     in full-face synthesis detection but is not adequate to
self-organized lip forgery video dataset which contains    recognize the subtle artifacts in local region forgery.
four sub-datasets. The experimental results demonstrate       Recently, [11, 12] utilize Siamese network to calcu-
that our method outperforms the current state-of-the-art   late the feature distances in multi-modalities. If manip-
detection methods on cross-dataset evaluation and mul-     ulation is conducted on a small segment of the video,
tiple class classification. In addition, our method is alsothis will weaken the inconsistency among these modali-
competitive on single dataset classification.              ties at the video level, leading to a decrease in detection
                                                           performance.[10] establishes one single phoneme-viseme
     • We propose a multi-phonemes selection based mapping for a specific person, which severely restricts
       framework for lip forgery detection task, which the application scenario. To address the above limitations,
       takes full advantage of the visual and aural infor- we propose a multi-phoneme selection based framework
       mation in lip forgery videos.                       for lip forgery video detection.
     • We establish 12 categories of phoneme-lip map-
       ping relationships and explore the robustness be-
       tween the open-close amplitudes on each pair for 3. Method
       real/fake classification. We also organize a new
       lip forgery dataset which is helpful to facilitate In this section we will elaborate the multi-phoneme se-
       the development of lip forgery detection methods. lection based framework. Before that, an important ob-
                                                           servation of lip forgery will be introduced first.
     • Extensive experiments demonstrate that our
       method outperforms state-of-the-art approaches
       for lip forgery detection on both the public DFDC 3.1. Motivation
       dataset and a self-organized lip forgery dataset. Lip forgery modifies a specific person’s lip shape to match
                                                              arbitrary audio contents, thus establishing a close rela-
2. Related work                                               tionship between them. However, due to imperfections
                                                              in the manipulation, uncontrollable artifacts may be gen-
2.1. Deep Face Forgery                                        erated to hinder the matching.
                                                                 As shown in Figure 1, when saying the word “apple",
According to different forgery regions, existing methods      the lips in the forgery videos are more blurred to open
can be divided into two categories: full-face synthesis       well. Although this nuance is not easy to perceive by
and local region forgery. Full-face synthesis usually syn-    human eyes, a well-designed detector can capture it. Nev-
thesizes a whole source face and swaps it to the target.      ertheless, the lip shape itself fluctuates in a certain range
Typical works are [4, 14].                                    under different expressions, large fluctuation indicates
   Local region forgery is a more common type, focus-         poor robustness.
ing on slight manipulation of partial facial regions, eg,        Based on this observation, it is necessary to establish
eyebrow locations and lip shapes. Lip-sync [5] is able        a robust mapping from audios to lip shapes. Inspired by
to modify the lip shapes in Obama’s talking videos to         recent works in Audio-Visual Speech Recognition [18],
accurately synchronize with a given audio sequence. [15]      we divide all audio contents into 12 phonemes categories
leverages 3D modeling for specific face videos to make        as the smallest identifiable units. Each phoneme set con-
the control of lip shapes more flexible. First Order Motion   sists of various vowels, consonants and quiet soundmark,
[16] uses video to drive a single source portrait image to    which can be used to train sub-model independently to
generate a talking video. The detection of local region       distinguish real/fake lips. Eventually, we select several
forgery is more challenging due to the subtle and local       sub-models to integrate the final classifier considering
nature.                                                       the trade-off between efficiency and performance. The
                                                              framework is depicted in Figure 2.
2.2. Face Forgery Detection
Early works explored visual artifacts, eg, the abnormal-
ity of eye blinking and teeth. Learning-based detection
                                                                                                                                                           P2FA                            Audio in Real Videos
                                                                                                                                                                                          Audio in Fake Videos

                                                  Audio
                                                                                                              dividing


                                                                                               Real                           Fake


                                                                                                   12 Phonme-Lip Mapping                                 Multi-Phonemes Selection


                                                                                                               LDA
                                                                                                             Classifier
                                            Lip Frames
                      Video                                                                                                                                  Amplitude
                                                                                                                              12 Phoneme Categories
                                                                                                                                                             Deviation

                                                                               48 IPA Phonetic Symbols


Figure 2: The framework of ours. Through 12 phoneme-lip shape mapping and multi-phonemes selection, we obtain the
final ensemble detection model.


       Phoneme                Real   Fake   Open-close Amplitude          Phoneme           Real      Fake     Open-close Amplitude
                                                                                                                                      Real
 W1    m     b       p                                             W2    t d n s z l r                                                Fake
                                                                                                                                              Here, 𝑝(𝑐 | x) is the probability of x belongs to class
                                                                   W4      f       v
                                                                                                                                             c, which is computed as the ratio between the in-class
 W3    k   g         ŋ
                                                                                                                                             and the out-of-class distribution from the previous dis-
 W5 ʃ ʒ tʃ           dʒ                                            W6      θ       ð                                                         tance 𝑑𝑐 , following the Gaussian distribution with means
                                                                   W8
                                                                                                                                             𝜇𝑐 , 𝜇˜𝑐 and variances 𝜎𝑐 , 𝜎𝑐̃︀, respectively :
 W7 i: e ɪ eɪ             j                                                    æ

                                                                                                                                                                                  (︁            )︁
 W9 ɑ: ɑ ʌ       ə    h                                            W10 u: ɔ: ɒ w       ɔɪ                                                                              1 − Φ 𝑑𝑐 (x)−𝜇   𝜎𝑐
                                                                                                                                                                                              𝑐

                                                                                                                                                            𝑝(𝑐 | x) =         (︁            )︁     (3)
 W11       ɜː                                                      W12         #
                                                                                                                                                                          Φ 𝑑𝑐 (x)−𝜇 𝜎˜
                                                                                                                                                                                      𝑐
                                                                                                                                                                                           ˜
                                                                                                                                                                                           𝑐


Figure 3: Illustration of the robust phoneme categories. We     After obtaining the mapping, a multi-class LDA classi-
                                                             fier pre-trained on [20] is utilized for classification. How-
exhibit the basic lip patterns with similar phonetics, visually
compare the real and fake lip shapes and the average open-   ever, different classes may share the same lip shape ap-
close amplitude curves.                                      pearance, e.g., m,b,p. By iteratively merging similar pho-
                                                             netic symbol classes, we obtain 12 distinguishable real
                                                             lip shapes named “phoneme" (from W1 to W12) with
3.2. Correlations Establishment from                         robustness. A visual example is given in Figure 3.
       Phonemes to Lip shapes                                   In fake videos, the lip shapes have been manipulated.
                                                             As illustrated in Figure 1, the opening amplitudes of fake
For a given talking video, we use OpenFace [19] to align lips are quite different from real ones, thus directly using
each frame and crop the lip area to 128×128. These lip the phoneme classifier trained on real lips may lead to
images will be categorized into different phoneme set and misclassification. Since the audio contents in fake videos
used as training/testing data for real/fake classification. are not modified, we decide to use them as the guidance
   To establish the mapping from phonemes to lip shapes, for fake lips classification. First, Google’s Speech-to-Text
we first process all the real videos. According to the API is used to obtain the corresponding transcribed texts
International Phonetic Alphabet (IPA) we divide the lip from the audios. Both the texts and audios are then fed
shapes into 48 classes. For a given lip shape, we calculate into the P2FA toolkit [21]. By conducting forced align-
the Mahalanobis distance 𝑑𝑐 of the open-close amplitude ment on phonemes and words, we get the start and end
between the current lip shape x and mean xc of each time for each phoneme, the lip images during this period
class.                                                       will be categorized into the current phoneme. In Figure 2,
                  √︁                                         the P2FA section clearly shows the alignment procedure.
         𝑑𝑐 (x) = (x − x̄𝑐 )𝑇 · Σ−1 𝑐 · (x − x̄𝑐 )       (1)
                                                                                                                                             3.3. Multiple Phonemes Selection
  Next, we estimate the probabilities of it belonging to                                                                                     Although the lip shapes in one phoneme set are similar,
each class, and assign the sample to the class with the                                                                                      the open-close amplitudes among phonemes are quite
highest normalized probability 𝑃𝑐 :                                                                                                          different. We use dlib 68 face landmarks detector [22]
                                                                                                                                             to compute the vertical axis value between the 63th and
                                                 𝑝(𝑐 | x)
                                     𝑃𝑐 (x) = ∑︀𝐶                                                                                      (2)   67th landmarks: 𝐷 = (𝑦63 -𝑦67 ). Here 𝐷 represents the
                                                𝑐=1 𝑝(𝑐 | x)
Table 1
Amplitude Deviation Values for 12 phonemes in self-organized dataset. The Top-5 phonemes with the largest amplitude
deviation for each sub-dataset are in bold.
   Forgery Methods        W1      W2      W3      W4      W5         W6       W7      W8       W9     W10          W11     W12
   Obama Lip-sync[5]      33.00   31.13   21.63   33.12   34.87     27.625   37.50   24.37    26.87   24.00    22.38       25.25
   Audio Driven[15]       15.00   23.62   18.50   26.62   28.00     25.50    29.50   20.63    17.37   18.25    17.00       12.50
 First Order Motion[16]   25.13   23.75   34.67   37.12   34.87     22.50    23.38   25.125   33.50   29.50    21.75       20.88
      Wav2lip[6]          35.51   34.71   26.71   28.01   25.12     25.43    35.12   28.76    27.32   33.84    29.96       33.60


open-close amplitude of the current lip shape. Using the     Table 2
number of frames as the horizontal axis, we calculate        The composition of our self-organized dataset, including the
𝐷 for each frame during the period of the phoneme. In        numbers of videos and frames. The whole dataset consists of
Figure 3, we plot two average amplitude curves for each      four sub-datasets.
set, the red curves represent the real videos while the                   Dataset             Real/Fake   Total          Frames
blue for fake.
                                                                    Obama Lip-sync[5]            28           56         62534
   In W1 and W2, the real and fake curves are widely
                                                                     Audio Driven[15]            24           48         54416
separated with almost no overlap, while in W3 and W6,             First Order Motion[16]         24           48         53614
there are partially stacked areas. This observation indi-               Wav2lip[6]               28           56         63736
cates that the real and fake lips are more discriminative
in certain phoneme sets. To select the most distinguish-
able phonemes 𝑊 for classification, we calculate the
                                                             performance without importing extra complexity.
differences between the maximum and minimum values
𝐷𝑊𝑚𝑎𝑥 ,𝐷𝑊𝑚𝑖𝑛 of real/fake curves, respectively. We
define the amplitude deviation 𝐷𝑊 to represent the dis-      4. Experiments
crepancy between real and fake in each phoneme 𝑊 :
𝐷𝑊 = 21 (𝐷𝑊𝑚𝑎𝑥 + 𝐷𝑊𝑚𝑖𝑛 ).                                    In this section, we initially introduce a new lip forgery
   Considering the potential differences in forgery meth-    video dataset organized by this paper. Several parameter
ods, the amplitude deviations of a single phoneme are not    studies can verify the optimality of our settings. Fur-
identical. As listed in Table 1, the phonemes with top-5     ther experiments are provided to demonstrate the effec-
amplitude deviations are in bold, and we will introduce      tiveness of our proposed framework on DFDC and self-
the self-organized dataset in Section 4.                     organized dataset, as well as the transferability between
                                                             them.
3.4. Sub-classification Models training
     and Ensemble                                            4.1. Public Dataset and New Lip Forgery
                                                                  Dataset
After selecting the phoneme-lip sets for each forgery
method, we train sub-classification models based on them.    Many datasets [7, 23] have been public for deepfake detec-
Each sub-model can be used independently for real/fake       tion task. Although with large scale and various forgery
lips discrimination. Here we adopt XceptionNet [17] as       methods, most fake videos do not contain the audios,
the backbone and transfer it to our task by resizing the     which still tampered in a full-face synthesized manner. So
input to 128×128 and replacing the final connected layer     far, there is no dedicated dataset released for lip forgery
with two outputs.                                            detection. In this paper, we use one public audio-visual
   To obtain a stronger detection performance, we inte-      deepfake dataset and organize a new dataset targeting
grate the sub-models into an ensemble one. The average       the lip forgery detection task.
weight for each is equal to ensure the contribution is          Public DFDC Dataset [24] has been published in the
maximized. Furthermore, phoneme units in the video           Deepfake Detection Challenge, using multiple manipu-
will last for some duration, which contain several lip       lation techniques and adding audios to make the video
frames. Both the lip frame numbers 𝑓 and sub-models 𝑁        scenarios more natural. To make a fair comparison, we
will influence the detection accuracy of the final ensem-    align with the settings of [11], using 18,000 videos in the
ble model, hence we experiment on them respectively.         experiments.
The results in Section 4 demonstrate that when 𝑓 = 4            New Lip Forgery Dataset To build the new lip fogery
and 𝑁 = 5, the ensemble model can achieve excellent          dataset, we adopt four state-of-the-art methods [5, 15,
                                                               Table 4
Table 3                                                        Comparison of our method(Xception) with other techniques
Parameter study of frame selection. 𝑓 = 4 can guaran-          on the DFDC dataset using the AUC metric. We select sub-
tee the best performance and avoid the overlap with other      models of W2, W5, W7, W10, and W11 for integration, and
phonemes.                                                      our result is competitive against Syncnet and Siamese-based
Frame Numbers 𝑓 = 3 𝑓 = 4 𝑓 = 5 𝑓 = 6 𝑓 = 7 𝑓 = 8              methods.

    ACC (%)       96.21 97.73 96.21 96.97 97.73 97.73                  Methods              DFDC           Modality

    AUC (%)       97.45 98.89 97.45 97.83 98.89 98.89              Xception-c23[17]         72.20           Video
                                                                      Meso4[25]             75.30           Video
                                                                    DSP-FWA[26]             75.50           Video
                                                                      MBP[10]               80.34       Audio & Video
16, 6] to generate fake videos. The composition of the            Siamese-based[11]         84.40       Audio & Video
organized dataset is elaborated in Table 2.                          Syncnet[12]            89.50       Audio & Video
                                                                   Ours (Xception)          91.60       Audio & Video
4.2. Experimental Settings
As mentioned before, XceptionNet is the baseline. Ac-
cording to the particularity of the public DFDC dataset   ing and testing sets is 85:15. Even though we only crop
and self-organized dataset, we adopt different training   the lip region of the face, we still achieve a competitive
strategies. On the large DFDC dataset, we train our model performance. In Table 4, our method achieves 91.6% on
with a batch size of 128 for 500 epochs. Due to the dis-  AUC, which outperforms not only the vision based full-
                                                          face method but also the audio-visual based multi-modal
tinctly smaller size of the self-organized dataset, we train
                                                          method. Among them, Syncnet[12] detects the synchro-
with a batch size of 16 for 100 epochs on each sub-dataset.
For both datasets, we uniformly use the Adam optimizer    nization from audios to video frames, achieves 89.50%
with the learning rate of 0.001 and employ ACC (accu-     on AUC, while ignoring the content matching between
racy) and AUC (area under ROC curve) as evaluation        them. The improvement in ours mainly benefits from the
metrics.                                                  establishment of the phoneme-lip mapping, where the
                                                          selected phonemes W2,W5,W7,W10 and W11 are robust
                                                          to various external disturbances in DFDC such as face
4.3. Parameter Study                                      angle, illumination, and video compression, boosting the
Frame Selection. As showed in Figure 2, a single detection capability of the ensemble model.
phoneme unit will include several lip frames. We use         Moreover, we respectively visualize the Gradient-
𝑓 to represent the number of lip frames, the value of weighted Class Activation Mapping (Grad-CAM) [28] for
𝑓 has an impact on the competence of the model. Few the baseline and ours, as shown in Figure 4. It shows that
lip frames result in missing lip features of the current our method can significantly include the surrounding re-
phoneme, while extra frames may overlap with others. gions such as the upper and lower lips, which facilitates
   In order not to introduce disturbances from other fac- the network to focus on the open-close amplitudes and
tors, we experiment on the Obama Lip-sync dataset. We is in line with our motivation. In contrast, the baseline
integrate all the 12 phoneme sub-models into one and model mainly concerns the internal teeth regions, losing
take the beginning time of each phoneme as the center the edge information.
to select the surrounding frames 𝑓 . Table 3 displays the
accuracy of 𝑓 from 3 to 8. The accuracy reaches 97.73% 4.5. Evaluation on Self-organized Dataset
when 𝑓 = 4, 7 and 8. Considering the tradeoff between
accuracy and complexity, we finally choose 𝑓 = 4.         In this section, we conduct experiments on self-organized
   Phoneme Selection. Still executing on the Obama dataset to verify the performance of real/fake classifica-
Lip-sync dataset, we use 𝑁 to denote the number of tion and multiple classification.
selected phonemes. Referring to the amplitude deviations
ranking listed in Table 1, we integrate the sub-models 4.5.1. Evaluation of Real/Fake Classification
from 2 to 12, the highest accuracy is achieved when 𝑁 = For each sub-dataset, We use different phonemes to in-
5. Thus we choose phoneme sets with the top 5 amplitude tegrate the final classification model, the selections are
deviations to train sub-models.                           listed in Table 5. The baseline model (Xception) is di-
                                                               rectly trained on all continuous frames of real/fake videos.
4.4. Evaluation on DFDC Dataset                                Further, to verify that our method is not restricted by
                                                               the backbone, we adopt another network architecture
In this section, we compare our method with previous
                                                               ResNet-50 [29] which performs well in image classifica-
deepfake detection methods on DFDC. The ratio of train-
Table 5
Evaluation of Real/Fake Classification. For each dataset, the performance of our approach surpasses baselines
(Xception/ResNet-50) and existing state-of-the-art detection methods.

                              Obama Lip-sync[5]           Audio Driven[15]               First Order[16]                  Wav2lip[6]
        Methods              (W1-W2-W4-W5-W7)           (W2-W4-W5-W6-W7)              (W3-W4-W5-W9-W10)              (W1-W2-W7-W10-W12)
                             ACC (%) AUC (%)              ACC (%) AUC (%)               ACC (%)       AUC (%)         ACC (%)     AUC (%)
     MBP[10]                     93.54     96.03                  -        -                 -           -                 -          -
 Siamese-based[11]               90.53     93.01               87.47     89.86            92.03        95.21            84.77       88.64
    Syncnet[12]                  92.18     95.21               90.83     92.89            92.18        95.56            86.08       90.16
     ResNet-50                   79.38     85.72               68.65     72.62            86.97        89.40            75.23       78.96
    Xception[17]                 84.82     89.19               70.18     78.43            88.83        93.71            78.54       80.78
  Ours(ResNet-50)                96.35     97.67           94.67         96.40            96.25        97.62           95.12       96.74
  Ours(Xception)                 97.73     98.89           95.84         97.61            97.59        98.60           96.43       97.89

Table 6
Evaluation of multiple classification. In the table, except for the average AUC (%) in the last column, other data represent
the ACC (%). Here, Our method integrates the sub-models of W2, W3, W4, W7 and W8 into the ensemble one, which largely
outperforms the advanced methods.

        Methods         Real Obama Lip-sync[5] Audio Driven[15] First Order[16] Wav2lip[6] Average ACC Average AUC
Siamese-based[11] 92.91                  77.63                   70.86               85.14            79.44          81.20          88.45
   Syncnet[12]    94.89                  78.79                   74.33               88.62            81.54          83.46          90.53
   Xception[17]   92.13                  73.44                   55.13               78.01            77.27          75.37          83.12
  Ours (Xception)      96.21             95.96                   87.50              96.97             94.88          94.29          96.84


tion tasks. The results in Table 5 demonstrate that our                    4.5.2. Evaluation of Multiple Classification
method outperforms the previous methods, where MBP
                                                                           To further distinguish different forgery methods, in the
is designed for Obama lip forgery and the Audio Driven
                                                                           4 sub-datasets, we label all real lips with 0 and fake lips
dataset is challenging with low video resolution and the
                                                                           with 1 ∼ 4 individually. W2, W3, W4, W7, W8 are
blocking of microphones or arms.
                                                                           chosen to train the classification model.

          DFDC        Xception     Ours (Xception)       Obama           Xception   Ours (Xception)   Audio Driven   Xception   Ours (Xception)


 Real


 Fake


Figure 4: The Grad-CAM of the baseline Xception and ours, including DFDC dataset and two forgery methods in self-
organized dataset. Ours can easily capture more lip regions.


         (a) Siamese-based                       (b) Syncnet                        (c) Xception                     (d) Ours(Xception)
Figure 5: Feature distributions visualization from Siamese-based (a) to ours (d) on multiple classification. In the four methods,
ours contains less outliers and widely separates the real and fake classes.
Table 7                                                         under Grant YD3480002001 and the Fundamental Re-
Evaluation on cross-dataset. The testset is self-organized      search Funds for the Central Universities under Grant
dataset. Ours (W2,W5,W7,W10,W11) achieves better results.       WK2100000011.
          Methods                 ACC            AUC
         MBP[10]                  57.94          59.12          References
     Siamese-based[11]            59.51          60.68
        Syncnet[12]               60.11          61.79           [1] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt,
       ResNet-50[27]              54.74          57.67               M. Nießner, Face2face: Real-time face capture
        Xception[17]              56.80          58.89               and reenactment of rgb videos, 2016 IEEE Confer-
     Ours (ResNet-50)             62.38          63.51               ence on Computer Vision and Pattern Recognition
     Ours (Xception)              63.67          64.05               (CVPR) (2016) 2387–2395.
                                                                 [2] Y. Nirkin, Y. Keller, T. Hassner, Fsgan: Subject
                                                                     agnostic face swapping and reenactment, 2019
   Table 6 verifies that the ensemble model can be ap-               IEEE/CVF International Conference on Computer
plied to multiple classification scenarios. We also intu-            Vision (ICCV) (2019) 7183–7192.
itively visualize the t-SNE[30] feature distributions from       [3] DeepFakes, Deepfakes github, http://github.com/
Siamese-based to ours. As shown in Figure 5, our method              deepfakes/faceswap, 2017. Accessed 2020-08-18.
is superior to find latent dissimilarity in high-dimensional     [4] FaceSwap, Faceswap github, http://https://github.
space with fewer outliers.                                           com/MarekKowalski/FaceSwap, 2016. Accessed
                                                                     2020-08-18.
                                                                 [5] I. K.-S. Supasorn Suwajanakorn, Steven Seitz, Syn-
4.6. Evaluation on cross-dataset                                     thesizing obama: Learning lip sync from audio, SIG-
Transferability is evaluated by training on DFDC but                 GRAPH 36 (2017) 95.
testing on self-organized dataset where all lips are labeled     [6] R. PrajwalK, R. Mukhopadhyay, V. Namboodiri,
as real/fake. Table 7 shows better transferability of ours           C. Jawahar, A lip sync expert is all you need for
in detecting universal artifacts in various datasets.                speech to lip generation in the wild, Proceedings
                                                                     of the 28th ACM International Conference on Mul-
                                                                     timedia (2020).
5. Conclusion                                                    [7] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess,
                                                                     J. Thies, M. Nießner, Faceforensics++: Learning
Lip forgery detection is an extremely challenging task               to detect manipulated facial images, arXiv preprint
in deepfake detection due to the subtle and local mod-               arXiv:1901.08971 (2019).
ifications. In this paper, we present a multi-phoneme            [8] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen,
selection based framework. Varying from existing deep-               B. Guo, Face x-ray for more general face forgery
fake detection, it takes full advantage of the particularity         detection, in: Proceedings of the IEEE/CVF Confer-
of lip forgery videos, establishing a robust mapping from            ence on Computer Vision and Pattern Recognition,
audio to lip shapes. 12 categories of phonemes are de-               2020, pp. 5001–5010.
termined as the smallest identifiable unit for various lip       [9] Y. Qian, G. Yin, L. Sheng, Z. Chen, J. Shao, Thinking
shapes and the phonemes with top-5 distinguishability                in frequency: Face forgery detection by mining
are selected to train sub-classification models. In addition,        frequency-aware clues, in: ECCV, 2020.
we organize a new dataset consists of four sub-datasets,        [10] S. Agarwal, H. Farid, O. Fried, M. Agrawala, De-
which is the first one organized for lip forgery detection           tecting deep-fake videos from phoneme-viseme
task. Extensive experiments demonstrate the effective-               mismatches, 2020 IEEE/CVF Conference on Com-
ness of our framework, including the challenging task of             puter Vision and Pattern Recognition Workshops
cross-dataset evaluation.                                            (CVPRW) (2020) 2814–2822.
                                                                [11] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera,
Acknowledgments                                                      D. Manocha, Emotions don’t lie: A deepfake de-
                                                                     tection method using audio-visual affective cues,
This work was supported in part by the Natural Science               ArXiv abs/2003.06711 (2020).
Foundation of China under Grant U20B2047, U1636201,             [12] K. Chugh, P. Gupta, A. Dhall, R. Subramanian, Not
62002334, by the Anhui Science Foundation of China un-               made for each other- audio-visual dissonance-based
der Grant 2008085QF296, by the Exploration Fund Project              deepfake detection and localization, Proceedings
of the University of Science and Technology of China                 of the 28th ACM International Conference on Mul-
                                                                     timedia (2020).
[13] H. L. Bear, R. Harvey, Phoneme-to-viseme map-             (2016) 770–778.
     pings: the good, the bad, and the ugly, ArXiv [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
     abs/1805.02934 (2017).                                    D. Parikh, D. Batra, Grad-cam: Visual explanations
[14] L. Li, J. Bao, H. Yang, D. Chen, F. Wen, Faceshifter:     from deep networks via gradient-based localization,
     Towards high fidelity and occlusion aware face            in: Proceedings of the IEEE international confer-
     swapping, arXiv preprint arXiv:1912.13457 (2019).         ence on computer vision, 2017, pp. 618–626.
[15] R. Yi, Z. Ye, J. Zhang, H. Bao, Y. Liu, Audio-driven [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     talking face video generation with natural head           ing for image recognition, in: Proceedings of the
     pose, ArXiv abs/2002.10137 (2020).                        IEEE conference on computer vision and pattern
[16] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci,       recognition, 2016, pp. 770–778.
     N. Sebe, First order motion model for image anima- [30] L. v. d. Maaten, G. Hinton, Visualizing data using
     tion, ArXiv abs/2003.00196 (2019).                        t-sne, Journal of machine learning research 9 (2008)
[17] F. Chollet, Xception: Deep learning with depthwise        2579–2605.
     separable convolutions, 2017 IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR)
     (2017) 1800–1807.
[18] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos,
     M. Pantic, Audio-visual speech recognition with a
     hybrid ctc/attention architecture, 2018 IEEE Spoken
     Language Technology Workshop (SLT) (2018) 513–
     520.
[19] T. Baltrusaitis, P. Robinson, L.-P. Morency, Open-
     face: An open source facial behavior analysis
     toolkit, 2016 IEEE Winter Conference on Appli-
     cations of Computer Vision (WACV) (2016) 1–10.
[20] A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel,
     L. Buera, E. Zacur, Av@car: A spanish multichannel
     multimodal corpus for in-vehicle automatic audio-
     visual speech recognition, in: LREC, 2004.
[21] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li,
     M. Agrawala, Content-based tools for editing audio
     stories, in: UIST ’13, 2013.
[22] D. King, Dlib-ml: A machine learning toolkit, J.
     Mach. Learn. Res. 10 (2009) 1755–1758.
[23] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A
     large-scale challenging dataset for deepfake foren-
     sics, in: Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition, 2020,
     pp. 3207–3216.
[24] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes,
     M. Wang, C. C. Ferrer, The deepfake detection
     challenge dataset, arXiv preprint arXiv:2006.07397
     (2020).
[25] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen,
     Mesonet: a compact facial video forgery detection
     network, 2018 IEEE International Workshop on
     Information Forensics and Security (WIFS) (2018)
     1–7.
[26] Y. Li, S. Lyu, Exposing deepfake videos by detecting
     face warping artifacts, in: IEEE Conference on Com-
     puter Vision and Pattern Recognition Workshops
     (CVPRW), 2019.
[27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     ing for image recognition, 2016 IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR)