=Paper=
{{Paper
|id=Vol-3084/paper1
|storemode=property
|title=Lip Forgery Video Detection via Multi-Phoneme Selection
|pdfUrl=https://ceur-ws.org/Vol-3084/paper1.pdf
|volume=Vol-3084
|authors=Jiaying Lin,Wenbo Zhou,Honggu Liu,Hang Zhou,Weiming Zhang,Nenghai Yu
}}
==Lip Forgery Video Detection via Multi-Phoneme Selection==
Lip Forgery Video Detection via Multi-Phoneme Selection
Jiaying Lin1 , Wenbo Zhou1 * , Honggu Liu1 , Hang Zhou2 , Weiming Zhang1 * and Nenghai Yu1
1
University of Science and Technology of China
2
Simon Fraser University
Abstract
Deepfake technique can produce realistic manipulation videos including full-face synthesis and local region forgery. General
methods work well in detecting the former but are usually intractable in capturing local artifacts especially for lip forgery
detection. In this paper, we focus on the lip forgery detection task. We first establish a robust mapping from audio to lip
shapes. Then we classify the lip shapes of each video frame according to different spoken phonemes, enable the network in
capturing the dissonances between lip shapes and phonemes in fake videos, increasing the interpretability. Each lip shape-
phoneme set is used to train a sub-model, those with better discrimination will be selected to obtain an ensemble classification
model. Extensive experimental results demonstrate that our method outperforms the most state-of-the-art methods on both
the public DFDC dataset and a self-organized lip forgery dataset.
Keywords
Lip Forgery, Deepfake Detection, Phoneme and Viseme
1. Introduction
Real
Thanks to the tremendous success of deep generative
models, face forgery becomes an emerging research topic
in very recent years and various methods have been pro-
Fake
posed [1, 2]. Depending on the manipulated region, they
can be roughly categorized into two types: full-face syn-
thesis [3, 4] that usually swaps the whole synthesized
source face to a target face, and local face region forgery Figure 1: The lip shapes of speaking the word “apple” in real
[5, 6] that only modifies partial face region, e.g., modify- (top) and fake (bottom) video. In the real video, the lips are
ing the lip shape to match the audio content. Especially more widely opened with clear teeth texture, while opposite
when the lips of politicians have been tampered with in the fake.
to make inappropriate speeches, it can lead to serious
political crisis.
To alleviate the risks brought by malicious uses of face specific targets. [11, 12] employ features such as audio
forgery, many detection methods have been proposed and expression to detect synchronization between differ-
[7, 8, 9]. These methods usually consider the forgery ent modalities.
detection from different aspects and extract visual fea- To address the problem of local region forgery de-
tures from the whole face region, achieving impressive tection, in this paper, we proposed a complete multi-
detection results on public datasets FF++ and DFDC, in phoneme selection-based framework. To take full ad-
which most of the fake videos are tampered in a full-face vantage of the particularity of lip forgery videos that
synthesized manner. But this type of detection meth- contain audios, we need to establish a robust mapping
ods struggle to handle the local region forgery cases like relationship between the lip shapes and the audio con-
lip-sync [5]. Recently, [10] attempt to detect lip-sync tents. Prior studies in the realm of Audio-Visual Speech
forgery video with single phoneme-viseme matching for Recognition have demonstrated that the phoneme is the
smallest identifiable unit correlated with a particular lip
Woodstock’21: Symposium on the irreproducible science, June 07–11, shape. Motivated by [13], we divide audio contents into
2021, Woodstock, NY
* 12 phoneme classes and classify all the video frames. For
Corresponding Author.
" vivian19@mail.ustc.edu.cn (J. Lin); welbeckz@ustc.edu.cn each phoneme-lip set, we measure the deviation on open-
(W. Zhou); lhg9754@mail.ustc.edu.cn (H. Liu); close amplitude between real and fake lip shapes, and
zhouhang2991@gmail.com (H. Zhou); zhangwm@ustc.edu.cn train a sub-model for real/fake classification.
(W. Zhang); ynh@ustc.edu.cn (N. Yu) Usually, a large deviation represents the obvious dis-
0000-0001-5553-9482 (J. Lin); 0000-0002-4703-4641 (W. Zhou); crepancy between the real and fake lip shapes, which
0000-0001-9294-9624 (H. Liu); 0000-0001-7860-8452 (H. Zhou);
0000-0001-5576-6108 (W. Zhang); 0000-0003-4417-9316 (N. Yu) also indicates the great difficulty in synthesizing the lip
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
shape for the corresponding phoneme. Simultaneously, it
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) shows the robustness of correlated phoneme-lip mapping
against physical changes in different videos, e.g., volume methods have become mainstream in very recent years.
and face angle. This precisely provides a distinguishing [7] uses XceptionNet [17] to extract features from spa-
feature for forgery detection. By selecting the phonemes tial domain. F3 -Net [9] achieves state-of-the-art using
with the top-5 deviations, we integrate the corresponding frequency-aware decomposition. However, since the au-
5 well-trained sub-models into an ensemble model for dios are lacking in most public deepfake datasets, these
maximizing the discriminability of real and fake videos. methods are designed in a universal manner with no
To verify the effectiveness, we have conducted exten- consideration of audios matching. They perform well
sive experiments on both the public DFDC dataset and a in full-face synthesis detection but is not adequate to
self-organized lip forgery video dataset which contains recognize the subtle artifacts in local region forgery.
four sub-datasets. The experimental results demonstrate Recently, [11, 12] utilize Siamese network to calcu-
that our method outperforms the current state-of-the-art late the feature distances in multi-modalities. If manip-
detection methods on cross-dataset evaluation and mul- ulation is conducted on a small segment of the video,
tiple class classification. In addition, our method is alsothis will weaken the inconsistency among these modali-
competitive on single dataset classification. ties at the video level, leading to a decrease in detection
performance.[10] establishes one single phoneme-viseme
• We propose a multi-phonemes selection based mapping for a specific person, which severely restricts
framework for lip forgery detection task, which the application scenario. To address the above limitations,
takes full advantage of the visual and aural infor- we propose a multi-phoneme selection based framework
mation in lip forgery videos. for lip forgery video detection.
• We establish 12 categories of phoneme-lip map-
ping relationships and explore the robustness be-
tween the open-close amplitudes on each pair for 3. Method
real/fake classification. We also organize a new
lip forgery dataset which is helpful to facilitate In this section we will elaborate the multi-phoneme se-
the development of lip forgery detection methods. lection based framework. Before that, an important ob-
servation of lip forgery will be introduced first.
• Extensive experiments demonstrate that our
method outperforms state-of-the-art approaches
for lip forgery detection on both the public DFDC 3.1. Motivation
dataset and a self-organized lip forgery dataset. Lip forgery modifies a specific person’s lip shape to match
arbitrary audio contents, thus establishing a close rela-
2. Related work tionship between them. However, due to imperfections
in the manipulation, uncontrollable artifacts may be gen-
2.1. Deep Face Forgery erated to hinder the matching.
As shown in Figure 1, when saying the word “apple",
According to different forgery regions, existing methods the lips in the forgery videos are more blurred to open
can be divided into two categories: full-face synthesis well. Although this nuance is not easy to perceive by
and local region forgery. Full-face synthesis usually syn- human eyes, a well-designed detector can capture it. Nev-
thesizes a whole source face and swaps it to the target. ertheless, the lip shape itself fluctuates in a certain range
Typical works are [4, 14]. under different expressions, large fluctuation indicates
Local region forgery is a more common type, focus- poor robustness.
ing on slight manipulation of partial facial regions, eg, Based on this observation, it is necessary to establish
eyebrow locations and lip shapes. Lip-sync [5] is able a robust mapping from audios to lip shapes. Inspired by
to modify the lip shapes in Obama’s talking videos to recent works in Audio-Visual Speech Recognition [18],
accurately synchronize with a given audio sequence. [15] we divide all audio contents into 12 phonemes categories
leverages 3D modeling for specific face videos to make as the smallest identifiable units. Each phoneme set con-
the control of lip shapes more flexible. First Order Motion sists of various vowels, consonants and quiet soundmark,
[16] uses video to drive a single source portrait image to which can be used to train sub-model independently to
generate a talking video. The detection of local region distinguish real/fake lips. Eventually, we select several
forgery is more challenging due to the subtle and local sub-models to integrate the final classifier considering
nature. the trade-off between efficiency and performance. The
framework is depicted in Figure 2.
2.2. Face Forgery Detection
Early works explored visual artifacts, eg, the abnormal-
ity of eye blinking and teeth. Learning-based detection
P2FA Audio in Real Videos
Audio in Fake Videos
Audio
dividing
Real Fake
12 Phonme-Lip Mapping Multi-Phonemes Selection
LDA
Classifier
Lip Frames
Video Amplitude
12 Phoneme Categories
Deviation
48 IPA Phonetic Symbols
Figure 2: The framework of ours. Through 12 phoneme-lip shape mapping and multi-phonemes selection, we obtain the
final ensemble detection model.
Phoneme Real Fake Open-close Amplitude Phoneme Real Fake Open-close Amplitude
Real
W1 m b p W2 t d n s z l r Fake
Here, 𝑝(𝑐 | x) is the probability of x belongs to class
W4 f v
c, which is computed as the ratio between the in-class
W3 k g ŋ
and the out-of-class distribution from the previous dis-
W5 ʃ ʒ tʃ dʒ W6 θ ð tance 𝑑𝑐 , following the Gaussian distribution with means
W8
𝜇𝑐 , 𝜇˜𝑐 and variances 𝜎𝑐 , 𝜎𝑐̃︀, respectively :
W7 i: e ɪ eɪ j æ
(︁ )︁
W9 ɑ: ɑ ʌ ə h W10 u: ɔ: ɒ w ɔɪ 1 − Φ 𝑑𝑐 (x)−𝜇 𝜎𝑐
𝑐
𝑝(𝑐 | x) = (︁ )︁ (3)
W11 ɜː W12 #
Φ 𝑑𝑐 (x)−𝜇 𝜎˜
𝑐
˜
𝑐
Figure 3: Illustration of the robust phoneme categories. We After obtaining the mapping, a multi-class LDA classi-
fier pre-trained on [20] is utilized for classification. How-
exhibit the basic lip patterns with similar phonetics, visually
compare the real and fake lip shapes and the average open- ever, different classes may share the same lip shape ap-
close amplitude curves. pearance, e.g., m,b,p. By iteratively merging similar pho-
netic symbol classes, we obtain 12 distinguishable real
lip shapes named “phoneme" (from W1 to W12) with
3.2. Correlations Establishment from robustness. A visual example is given in Figure 3.
Phonemes to Lip shapes In fake videos, the lip shapes have been manipulated.
As illustrated in Figure 1, the opening amplitudes of fake
For a given talking video, we use OpenFace [19] to align lips are quite different from real ones, thus directly using
each frame and crop the lip area to 128×128. These lip the phoneme classifier trained on real lips may lead to
images will be categorized into different phoneme set and misclassification. Since the audio contents in fake videos
used as training/testing data for real/fake classification. are not modified, we decide to use them as the guidance
To establish the mapping from phonemes to lip shapes, for fake lips classification. First, Google’s Speech-to-Text
we first process all the real videos. According to the API is used to obtain the corresponding transcribed texts
International Phonetic Alphabet (IPA) we divide the lip from the audios. Both the texts and audios are then fed
shapes into 48 classes. For a given lip shape, we calculate into the P2FA toolkit [21]. By conducting forced align-
the Mahalanobis distance 𝑑𝑐 of the open-close amplitude ment on phonemes and words, we get the start and end
between the current lip shape x and mean xc of each time for each phoneme, the lip images during this period
class. will be categorized into the current phoneme. In Figure 2,
√︁ the P2FA section clearly shows the alignment procedure.
𝑑𝑐 (x) = (x − x̄𝑐 )𝑇 · Σ−1 𝑐 · (x − x̄𝑐 ) (1)
3.3. Multiple Phonemes Selection
Next, we estimate the probabilities of it belonging to Although the lip shapes in one phoneme set are similar,
each class, and assign the sample to the class with the the open-close amplitudes among phonemes are quite
highest normalized probability 𝑃𝑐 : different. We use dlib 68 face landmarks detector [22]
to compute the vertical axis value between the 63th and
𝑝(𝑐 | x)
𝑃𝑐 (x) = ∑︀𝐶 (2) 67th landmarks: 𝐷 = (𝑦63 -𝑦67 ). Here 𝐷 represents the
𝑐=1 𝑝(𝑐 | x)
Table 1
Amplitude Deviation Values for 12 phonemes in self-organized dataset. The Top-5 phonemes with the largest amplitude
deviation for each sub-dataset are in bold.
Forgery Methods W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12
Obama Lip-sync[5] 33.00 31.13 21.63 33.12 34.87 27.625 37.50 24.37 26.87 24.00 22.38 25.25
Audio Driven[15] 15.00 23.62 18.50 26.62 28.00 25.50 29.50 20.63 17.37 18.25 17.00 12.50
First Order Motion[16] 25.13 23.75 34.67 37.12 34.87 22.50 23.38 25.125 33.50 29.50 21.75 20.88
Wav2lip[6] 35.51 34.71 26.71 28.01 25.12 25.43 35.12 28.76 27.32 33.84 29.96 33.60
open-close amplitude of the current lip shape. Using the Table 2
number of frames as the horizontal axis, we calculate The composition of our self-organized dataset, including the
𝐷 for each frame during the period of the phoneme. In numbers of videos and frames. The whole dataset consists of
Figure 3, we plot two average amplitude curves for each four sub-datasets.
set, the red curves represent the real videos while the Dataset Real/Fake Total Frames
blue for fake.
Obama Lip-sync[5] 28 56 62534
In W1 and W2, the real and fake curves are widely
Audio Driven[15] 24 48 54416
separated with almost no overlap, while in W3 and W6, First Order Motion[16] 24 48 53614
there are partially stacked areas. This observation indi- Wav2lip[6] 28 56 63736
cates that the real and fake lips are more discriminative
in certain phoneme sets. To select the most distinguish-
able phonemes 𝑊 for classification, we calculate the
performance without importing extra complexity.
differences between the maximum and minimum values
𝐷𝑊𝑚𝑎𝑥 ,𝐷𝑊𝑚𝑖𝑛 of real/fake curves, respectively. We
define the amplitude deviation 𝐷𝑊 to represent the dis- 4. Experiments
crepancy between real and fake in each phoneme 𝑊 :
𝐷𝑊 = 21 (𝐷𝑊𝑚𝑎𝑥 + 𝐷𝑊𝑚𝑖𝑛 ). In this section, we initially introduce a new lip forgery
Considering the potential differences in forgery meth- video dataset organized by this paper. Several parameter
ods, the amplitude deviations of a single phoneme are not studies can verify the optimality of our settings. Fur-
identical. As listed in Table 1, the phonemes with top-5 ther experiments are provided to demonstrate the effec-
amplitude deviations are in bold, and we will introduce tiveness of our proposed framework on DFDC and self-
the self-organized dataset in Section 4. organized dataset, as well as the transferability between
them.
3.4. Sub-classification Models training
and Ensemble 4.1. Public Dataset and New Lip Forgery
Dataset
After selecting the phoneme-lip sets for each forgery
method, we train sub-classification models based on them. Many datasets [7, 23] have been public for deepfake detec-
Each sub-model can be used independently for real/fake tion task. Although with large scale and various forgery
lips discrimination. Here we adopt XceptionNet [17] as methods, most fake videos do not contain the audios,
the backbone and transfer it to our task by resizing the which still tampered in a full-face synthesized manner. So
input to 128×128 and replacing the final connected layer far, there is no dedicated dataset released for lip forgery
with two outputs. detection. In this paper, we use one public audio-visual
To obtain a stronger detection performance, we inte- deepfake dataset and organize a new dataset targeting
grate the sub-models into an ensemble one. The average the lip forgery detection task.
weight for each is equal to ensure the contribution is Public DFDC Dataset [24] has been published in the
maximized. Furthermore, phoneme units in the video Deepfake Detection Challenge, using multiple manipu-
will last for some duration, which contain several lip lation techniques and adding audios to make the video
frames. Both the lip frame numbers 𝑓 and sub-models 𝑁 scenarios more natural. To make a fair comparison, we
will influence the detection accuracy of the final ensem- align with the settings of [11], using 18,000 videos in the
ble model, hence we experiment on them respectively. experiments.
The results in Section 4 demonstrate that when 𝑓 = 4 New Lip Forgery Dataset To build the new lip fogery
and 𝑁 = 5, the ensemble model can achieve excellent dataset, we adopt four state-of-the-art methods [5, 15,
Table 4
Table 3 Comparison of our method(Xception) with other techniques
Parameter study of frame selection. 𝑓 = 4 can guaran- on the DFDC dataset using the AUC metric. We select sub-
tee the best performance and avoid the overlap with other models of W2, W5, W7, W10, and W11 for integration, and
phonemes. our result is competitive against Syncnet and Siamese-based
Frame Numbers 𝑓 = 3 𝑓 = 4 𝑓 = 5 𝑓 = 6 𝑓 = 7 𝑓 = 8 methods.
ACC (%) 96.21 97.73 96.21 96.97 97.73 97.73 Methods DFDC Modality
AUC (%) 97.45 98.89 97.45 97.83 98.89 98.89 Xception-c23[17] 72.20 Video
Meso4[25] 75.30 Video
DSP-FWA[26] 75.50 Video
MBP[10] 80.34 Audio & Video
16, 6] to generate fake videos. The composition of the Siamese-based[11] 84.40 Audio & Video
organized dataset is elaborated in Table 2. Syncnet[12] 89.50 Audio & Video
Ours (Xception) 91.60 Audio & Video
4.2. Experimental Settings
As mentioned before, XceptionNet is the baseline. Ac-
cording to the particularity of the public DFDC dataset ing and testing sets is 85:15. Even though we only crop
and self-organized dataset, we adopt different training the lip region of the face, we still achieve a competitive
strategies. On the large DFDC dataset, we train our model performance. In Table 4, our method achieves 91.6% on
with a batch size of 128 for 500 epochs. Due to the dis- AUC, which outperforms not only the vision based full-
face method but also the audio-visual based multi-modal
tinctly smaller size of the self-organized dataset, we train
method. Among them, Syncnet[12] detects the synchro-
with a batch size of 16 for 100 epochs on each sub-dataset.
For both datasets, we uniformly use the Adam optimizer nization from audios to video frames, achieves 89.50%
with the learning rate of 0.001 and employ ACC (accu- on AUC, while ignoring the content matching between
racy) and AUC (area under ROC curve) as evaluation them. The improvement in ours mainly benefits from the
metrics. establishment of the phoneme-lip mapping, where the
selected phonemes W2,W5,W7,W10 and W11 are robust
to various external disturbances in DFDC such as face
4.3. Parameter Study angle, illumination, and video compression, boosting the
Frame Selection. As showed in Figure 2, a single detection capability of the ensemble model.
phoneme unit will include several lip frames. We use Moreover, we respectively visualize the Gradient-
𝑓 to represent the number of lip frames, the value of weighted Class Activation Mapping (Grad-CAM) [28] for
𝑓 has an impact on the competence of the model. Few the baseline and ours, as shown in Figure 4. It shows that
lip frames result in missing lip features of the current our method can significantly include the surrounding re-
phoneme, while extra frames may overlap with others. gions such as the upper and lower lips, which facilitates
In order not to introduce disturbances from other fac- the network to focus on the open-close amplitudes and
tors, we experiment on the Obama Lip-sync dataset. We is in line with our motivation. In contrast, the baseline
integrate all the 12 phoneme sub-models into one and model mainly concerns the internal teeth regions, losing
take the beginning time of each phoneme as the center the edge information.
to select the surrounding frames 𝑓 . Table 3 displays the
accuracy of 𝑓 from 3 to 8. The accuracy reaches 97.73% 4.5. Evaluation on Self-organized Dataset
when 𝑓 = 4, 7 and 8. Considering the tradeoff between
accuracy and complexity, we finally choose 𝑓 = 4. In this section, we conduct experiments on self-organized
Phoneme Selection. Still executing on the Obama dataset to verify the performance of real/fake classifica-
Lip-sync dataset, we use 𝑁 to denote the number of tion and multiple classification.
selected phonemes. Referring to the amplitude deviations
ranking listed in Table 1, we integrate the sub-models 4.5.1. Evaluation of Real/Fake Classification
from 2 to 12, the highest accuracy is achieved when 𝑁 = For each sub-dataset, We use different phonemes to in-
5. Thus we choose phoneme sets with the top 5 amplitude tegrate the final classification model, the selections are
deviations to train sub-models. listed in Table 5. The baseline model (Xception) is di-
rectly trained on all continuous frames of real/fake videos.
4.4. Evaluation on DFDC Dataset Further, to verify that our method is not restricted by
the backbone, we adopt another network architecture
In this section, we compare our method with previous
ResNet-50 [29] which performs well in image classifica-
deepfake detection methods on DFDC. The ratio of train-
Table 5
Evaluation of Real/Fake Classification. For each dataset, the performance of our approach surpasses baselines
(Xception/ResNet-50) and existing state-of-the-art detection methods.
Obama Lip-sync[5] Audio Driven[15] First Order[16] Wav2lip[6]
Methods (W1-W2-W4-W5-W7) (W2-W4-W5-W6-W7) (W3-W4-W5-W9-W10) (W1-W2-W7-W10-W12)
ACC (%) AUC (%) ACC (%) AUC (%) ACC (%) AUC (%) ACC (%) AUC (%)
MBP[10] 93.54 96.03 - - - - - -
Siamese-based[11] 90.53 93.01 87.47 89.86 92.03 95.21 84.77 88.64
Syncnet[12] 92.18 95.21 90.83 92.89 92.18 95.56 86.08 90.16
ResNet-50 79.38 85.72 68.65 72.62 86.97 89.40 75.23 78.96
Xception[17] 84.82 89.19 70.18 78.43 88.83 93.71 78.54 80.78
Ours(ResNet-50) 96.35 97.67 94.67 96.40 96.25 97.62 95.12 96.74
Ours(Xception) 97.73 98.89 95.84 97.61 97.59 98.60 96.43 97.89
Table 6
Evaluation of multiple classification. In the table, except for the average AUC (%) in the last column, other data represent
the ACC (%). Here, Our method integrates the sub-models of W2, W3, W4, W7 and W8 into the ensemble one, which largely
outperforms the advanced methods.
Methods Real Obama Lip-sync[5] Audio Driven[15] First Order[16] Wav2lip[6] Average ACC Average AUC
Siamese-based[11] 92.91 77.63 70.86 85.14 79.44 81.20 88.45
Syncnet[12] 94.89 78.79 74.33 88.62 81.54 83.46 90.53
Xception[17] 92.13 73.44 55.13 78.01 77.27 75.37 83.12
Ours (Xception) 96.21 95.96 87.50 96.97 94.88 94.29 96.84
tion tasks. The results in Table 5 demonstrate that our 4.5.2. Evaluation of Multiple Classification
method outperforms the previous methods, where MBP
To further distinguish different forgery methods, in the
is designed for Obama lip forgery and the Audio Driven
4 sub-datasets, we label all real lips with 0 and fake lips
dataset is challenging with low video resolution and the
with 1 ∼ 4 individually. W2, W3, W4, W7, W8 are
blocking of microphones or arms.
chosen to train the classification model.
DFDC Xception Ours (Xception) Obama Xception Ours (Xception) Audio Driven Xception Ours (Xception)
Real
Fake
Figure 4: The Grad-CAM of the baseline Xception and ours, including DFDC dataset and two forgery methods in self-
organized dataset. Ours can easily capture more lip regions.
(a) Siamese-based (b) Syncnet (c) Xception (d) Ours(Xception)
Figure 5: Feature distributions visualization from Siamese-based (a) to ours (d) on multiple classification. In the four methods,
ours contains less outliers and widely separates the real and fake classes.
Table 7 under Grant YD3480002001 and the Fundamental Re-
Evaluation on cross-dataset. The testset is self-organized search Funds for the Central Universities under Grant
dataset. Ours (W2,W5,W7,W10,W11) achieves better results. WK2100000011.
Methods ACC AUC
MBP[10] 57.94 59.12 References
Siamese-based[11] 59.51 60.68
Syncnet[12] 60.11 61.79 [1] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt,
ResNet-50[27] 54.74 57.67 M. Nießner, Face2face: Real-time face capture
Xception[17] 56.80 58.89 and reenactment of rgb videos, 2016 IEEE Confer-
Ours (ResNet-50) 62.38 63.51 ence on Computer Vision and Pattern Recognition
Ours (Xception) 63.67 64.05 (CVPR) (2016) 2387–2395.
[2] Y. Nirkin, Y. Keller, T. Hassner, Fsgan: Subject
agnostic face swapping and reenactment, 2019
Table 6 verifies that the ensemble model can be ap- IEEE/CVF International Conference on Computer
plied to multiple classification scenarios. We also intu- Vision (ICCV) (2019) 7183–7192.
itively visualize the t-SNE[30] feature distributions from [3] DeepFakes, Deepfakes github, http://github.com/
Siamese-based to ours. As shown in Figure 5, our method deepfakes/faceswap, 2017. Accessed 2020-08-18.
is superior to find latent dissimilarity in high-dimensional [4] FaceSwap, Faceswap github, http://https://github.
space with fewer outliers. com/MarekKowalski/FaceSwap, 2016. Accessed
2020-08-18.
[5] I. K.-S. Supasorn Suwajanakorn, Steven Seitz, Syn-
4.6. Evaluation on cross-dataset thesizing obama: Learning lip sync from audio, SIG-
Transferability is evaluated by training on DFDC but GRAPH 36 (2017) 95.
testing on self-organized dataset where all lips are labeled [6] R. PrajwalK, R. Mukhopadhyay, V. Namboodiri,
as real/fake. Table 7 shows better transferability of ours C. Jawahar, A lip sync expert is all you need for
in detecting universal artifacts in various datasets. speech to lip generation in the wild, Proceedings
of the 28th ACM International Conference on Mul-
timedia (2020).
5. Conclusion [7] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess,
J. Thies, M. Nießner, Faceforensics++: Learning
Lip forgery detection is an extremely challenging task to detect manipulated facial images, arXiv preprint
in deepfake detection due to the subtle and local mod- arXiv:1901.08971 (2019).
ifications. In this paper, we present a multi-phoneme [8] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen,
selection based framework. Varying from existing deep- B. Guo, Face x-ray for more general face forgery
fake detection, it takes full advantage of the particularity detection, in: Proceedings of the IEEE/CVF Confer-
of lip forgery videos, establishing a robust mapping from ence on Computer Vision and Pattern Recognition,
audio to lip shapes. 12 categories of phonemes are de- 2020, pp. 5001–5010.
termined as the smallest identifiable unit for various lip [9] Y. Qian, G. Yin, L. Sheng, Z. Chen, J. Shao, Thinking
shapes and the phonemes with top-5 distinguishability in frequency: Face forgery detection by mining
are selected to train sub-classification models. In addition, frequency-aware clues, in: ECCV, 2020.
we organize a new dataset consists of four sub-datasets, [10] S. Agarwal, H. Farid, O. Fried, M. Agrawala, De-
which is the first one organized for lip forgery detection tecting deep-fake videos from phoneme-viseme
task. Extensive experiments demonstrate the effective- mismatches, 2020 IEEE/CVF Conference on Com-
ness of our framework, including the challenging task of puter Vision and Pattern Recognition Workshops
cross-dataset evaluation. (CVPRW) (2020) 2814–2822.
[11] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera,
Acknowledgments D. Manocha, Emotions don’t lie: A deepfake de-
tection method using audio-visual affective cues,
This work was supported in part by the Natural Science ArXiv abs/2003.06711 (2020).
Foundation of China under Grant U20B2047, U1636201, [12] K. Chugh, P. Gupta, A. Dhall, R. Subramanian, Not
62002334, by the Anhui Science Foundation of China un- made for each other- audio-visual dissonance-based
der Grant 2008085QF296, by the Exploration Fund Project deepfake detection and localization, Proceedings
of the University of Science and Technology of China of the 28th ACM International Conference on Mul-
timedia (2020).
[13] H. L. Bear, R. Harvey, Phoneme-to-viseme map- (2016) 770–778.
pings: the good, the bad, and the ugly, ArXiv [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
abs/1805.02934 (2017). D. Parikh, D. Batra, Grad-cam: Visual explanations
[14] L. Li, J. Bao, H. Yang, D. Chen, F. Wen, Faceshifter: from deep networks via gradient-based localization,
Towards high fidelity and occlusion aware face in: Proceedings of the IEEE international confer-
swapping, arXiv preprint arXiv:1912.13457 (2019). ence on computer vision, 2017, pp. 618–626.
[15] R. Yi, Z. Ye, J. Zhang, H. Bao, Y. Liu, Audio-driven [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
talking face video generation with natural head ing for image recognition, in: Proceedings of the
pose, ArXiv abs/2002.10137 (2020). IEEE conference on computer vision and pattern
[16] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, recognition, 2016, pp. 770–778.
N. Sebe, First order motion model for image anima- [30] L. v. d. Maaten, G. Hinton, Visualizing data using
tion, ArXiv abs/2003.00196 (2019). t-sne, Journal of machine learning research 9 (2008)
[17] F. Chollet, Xception: Deep learning with depthwise 2579–2605.
separable convolutions, 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)
(2017) 1800–1807.
[18] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos,
M. Pantic, Audio-visual speech recognition with a
hybrid ctc/attention architecture, 2018 IEEE Spoken
Language Technology Workshop (SLT) (2018) 513–
520.
[19] T. Baltrusaitis, P. Robinson, L.-P. Morency, Open-
face: An open source facial behavior analysis
toolkit, 2016 IEEE Winter Conference on Appli-
cations of Computer Vision (WACV) (2016) 1–10.
[20] A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel,
L. Buera, E. Zacur, Av@car: A spanish multichannel
multimodal corpus for in-vehicle automatic audio-
visual speech recognition, in: LREC, 2004.
[21] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li,
M. Agrawala, Content-based tools for editing audio
stories, in: UIST ’13, 2013.
[22] D. King, Dlib-ml: A machine learning toolkit, J.
Mach. Learn. Res. 10 (2009) 1755–1758.
[23] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A
large-scale challenging dataset for deepfake foren-
sics, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020,
pp. 3207–3216.
[24] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes,
M. Wang, C. C. Ferrer, The deepfake detection
challenge dataset, arXiv preprint arXiv:2006.07397
(2020).
[25] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen,
Mesonet: a compact facial video forgery detection
network, 2018 IEEE International Workshop on
Information Forensics and Security (WIFS) (2018)
1–7.
[26] Y. Li, S. Lyu, Exposing deepfake videos by detecting
face warping artifacts, in: IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops
(CVPRW), 2019.
[27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
ing for image recognition, 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)