1. Introduction

Lip Forgery Video Detection via Multi-Phoneme Selection

Jiaying Lin

Wenbo Zhou

Honggu Liu

Hang Zhou

Weiming Zhang

Nenghai Yu

1 0 Simon Fraser University 1 University of Science and Technology of China

Deepfake technique can produce realistic manipulation videos including full-face synthesis and local region forgery. General methods work well in detecting the former but are usually intractable in capturing local artifacts especially for lip forgery detection. In this paper, we focus on the lip forgery detection task. We first establish a robust mapping from audio to lip shapes. Then we classify the lip shapes of each video frame according to diferent spoken phonemes, enable the network in capturing the dissonances between lip shapes and phonemes in fake videos, increasing the interpretability. Each lip shapephoneme set is used to train a sub-model, those with better discrimination will be selected to obtain an ensemble classification model. Extensive experimental results demonstrate that our method outperforms the most state-of-the-art methods on both the public DFDC dataset and a self-organized lip forgery dataset.

eol>Lip Forgery Deepfake Detection Phoneme and Viseme

1. Introduction

Real Thanks to the tremendous success of deep generative models, face forgery becomes an emerging research topic in very recent years and various methods have been proposed [ 1, 2 ]. Depending on the manipulated region, they Fake can be roughly categorized into two types: full-face synthesis [3, 4] that usually swaps the whole synthesized source face to a target face, and local face region forgery Figure 1: The lip shapes of speaking the word “apple” in real [5, 6] that only modifies partial face region, e.g., modify- (top) and fake (bottom) video. In the real video, the lips are ing the lip shape to match the audio content. Especially more widely opened with clear teeth texture, while opposite when the lips of politicians have been tampered with in the fake. to make inappropriate speeches, it can lead to serious political crisis.

To alleviate the risks brought by malicious uses of face forgery, many detection methods have been proposed [7, 8, 9]. These methods usually consider the forgery detection from diferent aspects and extract visual features from the whole face region, achieving impressive detection results on public datasets FF++ and DFDC, in which most of the fake videos are tampered in a full-face synthesized manner. But this type of detection methods struggle to handle the local region forgery cases like lip-sync [5]. Recently, [10] attempt to detect lip-sync forgery video with single phoneme-viseme matching for specific targets. [ 11, 12 ] employ features such as audio and expression to detect synchronization between diferent modalities.

To address the problem of local region forgery detection, in this paper, we proposed a complete multiphoneme selection-based framework. To take full advantage of the particularity of lip forgery videos that contain audios, we need to establish a robust mapping relationship between the lip shapes and the audio contents. Prior studies in the realm of Audio-Visual Speech Recognition have demonstrated that the phoneme is the smallest identifiable unit correlated with a particular lip shape. Motivated by [13], we divide audio contents into 12 phoneme classes and classify all the video frames. For each phoneme-lip set, we measure the deviation on openclose amplitude between real and fake lip shapes, and train a sub-model for real/fake classification.

Usually, a large deviation represents the obvious discrepancy between the real and fake lip shapes, which also indicates the great dificulty in synthesizing the lip shape for the corresponding phoneme. Simultaneously, it shows the robustness of correlated phoneme-lip mapping against physical changes in diferent videos, e.g., volume and face angle. This precisely provides a distinguishing feature for forgery detection. By selecting the phonemes with the top-5 deviations, we integrate the corresponding 5 well-trained sub-models into an ensemble model for maximizing the discriminability of real and fake videos.

To verify the efectiveness, we have conducted extensive experiments on both the public DFDC dataset and a self-organized lip forgery video dataset which contains four sub-datasets. The experimental results demonstrate that our method outperforms the current state-of-the-art detection methods on cross-dataset evaluation and multiple class classification. In addition, our method is also competitive on single dataset classification. methods have become mainstream in very recent years. [7] uses XceptionNet [17] to extract features from spatial domain. F3-Net [9] achieves state-of-the-art using frequency-aware decomposition. However, since the audios are lacking in most public deepfake datasets, these methods are designed in a universal manner with no consideration of audios matching. They perform well in full-face synthesis detection but is not adequate to recognize the subtle artifacts in local region forgery.

Recently, [ 11, 12 ] utilize Siamese network to calculate the feature distances in multi-modalities. If manipulation is conducted on a small segment of the video, this will weaken the inconsistency among these modalities at the video level, leading to a decrease in detection performance.[10] establishes one single phoneme-viseme mapping for a specific person, which severely restricts the application scenario. To address the above limitations, we propose a multi-phoneme selection based framework for lip forgery video detection. • We propose a multi-phonemes selection based framework for lip forgery detection task, which takes full advantage of the visual and aural information in lip forgery videos. • We establish 12 categories of phoneme-lip mapping relationships and explore the robustness between the open-close amplitudes on each pair for 3. Method real/fake classification. We also organize a new lip forgery dataset which is helpful to facilitate In this section we will elaborate the multi-phoneme sethe development of lip forgery detection methods. lection based framework. Before that, an important ob• Extensive experiments demonstrate that our servation of lip forgery will be introduced first. method outperforms state-of-the-art approaches for lip forgery detection on both the public DFDC 3.1. Motivation dataset and a self-organized lip forgery dataset.

Lip forgery modifies a specific person’s lip shape to match arbitrary audio contents, thus establishing a close rela2. Related work tionship between them. However, due to imperfections in the manipulation, uncontrollable artifacts may be gen2.1. Deep Face Forgery erated to hinder the matching.

As shown in Figure 1, when saying the word “apple", According to diferent forgery regions, existing methods the lips in the forgery videos are more blurred to open can be divided into two categories: full-face synthesis well. Although this nuance is not easy to perceive by and local region forgery. Full-face synthesis usually syn- human eyes, a well-designed detector can capture it. Nevthesizes a whole source face and swaps it to the target. ertheless, the lip shape itself fluctuates in a certain range Typical works are [4, 14]. under diferent expressions, large fluctuation indicates

Local region forgery is a more common type, focus- poor robustness. ing on slight manipulation of partial facial regions, eg, Based on this observation, it is necessary to establish eyebrow locations and lip shapes. Lip-sync [5] is able a robust mapping from audios to lip shapes. Inspired by to modify the lip shapes in Obama’s talking videos to recent works in Audio-Visual Speech Recognition [18], accurately synchronize with a given audio sequence. [15] we divide all audio contents into 12 phonemes categories leverages 3D modeling for specific face videos to make as the smallest identifiable units. Each phoneme set conthe control of lip shapes more flexible. First Order Motion sists of various vowels, consonants and quiet soundmark, [16] uses video to drive a single source portrait image to which can be used to train sub-model independently to generate a talking video. The detection of local region distinguish real/fake lips. Eventually, we select several forgery is more challenging due to the subtle and local sub-models to integrate the final classifier considering nature. the trade-of between eficiency and performance. The framework is depicted in Figure 2.

2.2. Face Forgery Detection

Early works explored visual artifacts, eg, the abnormality of eye blinking and teeth. Learning-based detection Audio dividing LDA

Classifier Real

Fake 12 Phonme-Lip Mapping Multi-Phonemes Selection Video

Lip Frames 12 Phoneme Categories

Amplitude Deviation 48 IPA Phonetic Symbols

Here, ( | x) is the probability of x belongs to class c, which is computed as the ratio between the in-class and the out-of-class distribution from the previous distance , following the Gaussian distribution with means , ˜ and variances , , respectively :

̃︀ ( | x) = 1 − Φ (︁ (x)− )︁

Φ (︁ (x)−˜ )︁ ˜ (3)

Next, we estimate the probabilities of it belonging to each class, and assign the sample to the class with the highest normalized probability : ( | x) (x) = ∑︀ =1 ( | x) (1) (2)

3.3. Multiple Phonemes Selection

Although the lip shapes in one phoneme set are similar, the open-close amplitudes among phonemes are quite diferent. We use dlib 68 face landmarks detector [ 22] to compute the vertical axis value between the 63th and 67th landmarks: = (63-67). Here represents the

Forgery Methods Obama Lip-sync[5] Audio Driven[15] First Order Motion[16] W1 W2

open-close amplitude of the current lip shape. Using the number of frames as the horizontal axis, we calculate for each frame during the period of the phoneme. In Figure 3, we plot two average amplitude curves for each set, the red curves represent the real videos while the blue for fake.

In W1 and W2, the real and fake curves are widely separated with almost no overlap, while in W3 and W6, there are partially stacked areas. This observation indicates that the real and fake lips are more discriminative in certain phoneme sets. To select the most distinguishable phonemes for classification, we calculate the diferences between the maximum and minimum values , of real/fake curves, respectively. We define the amplitude deviation to represent the discrepancy between real and fake in each phoneme : = 12 ( + ).

Considering the potential diferences in forgery methods, the amplitude deviations of a single phoneme are not identical. As listed in Table 1, the phonemes with top-5 amplitude deviations are in bold, and we will introduce the self-organized dataset in Section 4.

3.4. Sub-classification Models training and Ensemble

After selecting the phoneme-lip sets for each forgery method, we train sub-classification models based on them. Many datasets [7, 23] have been public for deepfake detecEach sub-model can be used independently for real/fake tion task. Although with large scale and various forgery lips discrimination. Here we adopt XceptionNet [17] as methods, most fake videos do not contain the audios, the backbone and transfer it to our task by resizing the which still tampered in a full-face synthesized manner. So input to 128×128 and replacing the final connected layer far, there is no dedicated dataset released for lip forgery with two outputs. detection. In this paper, we use one public audio-visual

To obtain a stronger detection performance, we inte- deepfake dataset and organize a new dataset targeting grate the sub-models into an ensemble one. The average the lip forgery detection task. weight for each is equal to ensure the contribution is Public DFDC Dataset [24] has been published in the maximized. Furthermore, phoneme units in the video Deepfake Detection Challenge, using multiple manipuwill last for some duration, which contain several lip lation techniques and adding audios to make the video frames. Both the lip frame numbers and sub-models scenarios more natural. To make a fair comparison, we will influence the detection accuracy of the final ensem- align with the settings of [ 11 ], using 18,000 videos in the ble model, hence we experiment on them respectively. experiments.

The results in Section 4 demonstrate that when = 4 New Lip Forgery Dataset To build the new lip fogery and = 5, the ensemble model can achieve excellent dataset, we adopt four state-of-the-art methods [5, 15,

4. Experiments

In this section, we initially introduce a new lip forgery video dataset organized by this paper. Several parameter studies can verify the optimality of our settings. Further experiments are provided to demonstrate the efectiveness of our proposed framework on DFDC and selforganized dataset, as well as the transferability between them.

4.1. Public Dataset and New Lip Forgery Dataset

16, 6] to generate fake videos. The composition of the organized dataset is elaborated in Table 2.

4.2. Experimental Settings

As mentioned before, XceptionNet is the baseline. According to the particularity of the public DFDC dataset and self-organized dataset, we adopt diferent training strategies. On the large DFDC dataset, we train our model with a batch size of 128 for 500 epochs. Due to the distinctly smaller size of the self-organized dataset, we train with a batch size of 16 for 100 epochs on each sub-dataset. For both datasets, we uniformly use the Adam optimizer with the learning rate of 0.001 and employ ACC (accuracy) and AUC (area under ROC curve) as evaluation metrics.

4.3. Parameter Study

Frame Selection. As showed in Figure 2, a single phoneme unit will include several lip frames. We use to represent the number of lip frames, the value of has an impact on the competence of the model. Few lip frames result in missing lip features of the current phoneme, while extra frames may overlap with others.

In order not to introduce disturbances from other factors, we experiment on the Obama Lip-sync dataset. We integrate all the 12 phoneme sub-models into one and take the beginning time of each phoneme as the center to select the surrounding frames . Table 3 displays the accuracy of from 3 to 8. The accuracy reaches 97.73% when = 4, 7 and 8. Considering the tradeof between accuracy and complexity, we finally choose = 4.

Phoneme Selection. Still executing on the Obama Lip-sync dataset, we use to denote the number of selected phonemes. Referring to the amplitude deviations ranking listed in Table 1, we integrate the sub-models from 2 to 12, the highest accuracy is achieved when = 5. Thus we choose phoneme sets with the top 5 amplitude deviations to train sub-models.

4.4. Evaluation on DFDC Dataset

In this section, we compare our method with previous deepfake detection methods on DFDC. The ratio of training and testing sets is 85:15. Even though we only crop the lip region of the face, we still achieve a competitive performance. In Table 4, our method achieves 91.6% on AUC, which outperforms not only the vision based fullface method but also the audio-visual based multi-modal method. Among them, Syncnet[12] detects the synchronization from audios to video frames, achieves 89.50% on AUC, while ignoring the content matching between them. The improvement in ours mainly benefits from the establishment of the phoneme-lip mapping, where the selected phonemes W2,W5,W7,W10 and W11 are robust to various external disturbances in DFDC such as face angle, illumination, and video compression, boosting the detection capability of the ensemble model.

Moreover, we respectively visualize the Gradientweighted Class Activation Mapping (Grad-CAM) [28] for the baseline and ours, as shown in Figure 4. It shows that our method can significantly include the surrounding regions such as the upper and lower lips, which facilitates the network to focus on the open-close amplitudes and is in line with our motivation. In contrast, the baseline model mainly concerns the internal teeth regions, losing the edge information.

4.5. Evaluation on Self-organized Dataset

In this section, we conduct experiments on self-organized dataset to verify the performance of real/fake classification and multiple classification. 4.5.1. Evaluation of Real/Fake Classification For each sub-dataset, We use diferent phonemes to integrate the final classification model, the selections are listed in Table 5. The baseline model (Xception) is directly trained on all continuous frames of real/fake videos.

Further, to verify that our method is not restricted by the backbone, we adopt another network architecture

ResNet-50 [29] which performs well in image classification tasks. The results in Table 5 demonstrate that our method outperforms the previous methods, where MBP is designed for Obama lip forgery and the Audio Driven dataset is challenging with low video resolution and the blocking of microphones or arms. 4.5.2. Evaluation of Multiple Classification To further distinguish diferent forgery methods, in the 4 sub-datasets, we label all real lips with 0 and fake lips with 1 ∼ 4 individually. W2, W3, W4, W7, W8 are chosen to train the classification model.

DFDC

Xception [13] H. L. Bear, R. Harvey, Phoneme-to-viseme map- (2016) 770–778.

pings: the good, the bad, and the ugly, ArXiv [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, abs/1805.02934 (2017). D. Parikh, D. Batra, Grad-cam: Visual explanations [14] L. Li, J. Bao, H. Yang, D. Chen, F. Wen, Faceshifter: from deep networks via gradient-based localization, Towards high fidelity and occlusion aware face in: Proceedings of the IEEE international conferswapping, arXiv preprint arXiv:1912.13457 (2019). ence on computer vision, 2017, pp. 618–626. [15] R. Yi, Z. Ye, J. Zhang, H. Bao, Y. Liu, Audio-driven [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learntalking face video generation with natural head ing for image recognition, in: Proceedings of the pose, ArXiv abs/2002.10137 (2020). IEEE conference on computer vision and pattern [16] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, recognition, 2016, pp. 770–778.

N. Sebe, First order motion model for image anima- [30] L. v. d. Maaten, G. Hinton, Visualizing data using tion, ArXiv abs/2003.00196 (2019). t-sne, Journal of machine learning research 9 (2008) [17] F. Chollet, Xception: Deep learning with depthwise 2579–2605.

separable convolutions, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 1800–1807. [18] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos,

M. Pantic, Audio-visual speech recognition with a hybrid ctc/attention architecture, 2018 IEEE Spoken Language Technology Workshop (SLT) (2018) 513– 520. [19] T. Baltrusaitis, P. Robinson, L.-P. Morency, Openface: An open source facial behavior analysis toolkit, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) (2016) 1–10. [20] A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel,

L. Buera, E. Zacur, Av@car: A spanish multichannel multimodal corpus for in-vehicle automatic audiovisual speech recognition, in: LREC, 2004. [21] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li,

M. Agrawala, Content-based tools for editing audio stories, in: UIST ’13, 2013. [22] D. King, Dlib-ml: A machine learning toolkit, J.

Mach. Learn. Res. 10 (2009) 1755–1758. [23] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A large-scale challenging dataset for deepfake forensics, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3207–3216. [24] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes,

M. Wang, C. C. Ferrer, The deepfake detection challenge dataset, arXiv preprint arXiv:2006.07397 (2020). [25] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen,

Mesonet: a compact facial video forgery detection network, 2018 IEEE International Workshop on Information Forensics and Security (WIFS) (2018) 1–7. [26] Y. Li, S. Lyu, Exposing deepfake videos by detecting face warping artifacts, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019. [27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

[1]

Thies ,

Zollhöfer ,

Stamminger ,

Theobalt , M. Nießner, Face2face: Real-time face capture and reenactment of rgb videos , 2016 IEEE ConferOurs (ResNet-50) 62.38 63.51 ence on Computer Vision and Pattern Recognition Ours (Xception) 63.67 64.05 (CVPR) ( 2016 ) 2387 - 2395 .

[2]

Nirkin , Y. Keller, T. Hassner, Fsgan: Subject agnostic face swapping and reenactment, 2019 Table 6 verifies that the ensemble model can be ap - IEEE/CVF International Conference on Computer plied to multiple classification scenarios. We also intu- Vision (ICCV) ( 2019 ) 7183 - 7192 . itively visualize the t-SNE[30] feature distributions from [3] DeepFakes, Deepfakes github , http://github.com/ Siamese-based to ours. As shown in Figure 5, our method deepfakes/faceswap, 2017 . Accessed 2020- 08 -18. is superior to find latent dissimilarity in high-dimensional [4] FaceSwap, Faceswap github , http://https://github. space with fewer outliers . com/MarekKowalski/FaceSwap, 2016 . Accessed 2020- 08 -18.

[11]

Mittal , U. Bhattacharya,

Chandra ,

Bera ,

Acknowledgments D.

Manocha , Emotions don't lie: A deepfake detection method using audio-visual afective cues, This work was supported in part by the Natural Science ArXiv abs/ 2003 .06711 ( 2020 ). Foundation of China under Grant U20B2047 , U1636201 , [12]

Chugh ,

Gupta ,

Dhall ,

Subramanian , Not 62002334 , by the Anhui Science Foundation of China un- made for each other- audio-visual dissonance-based der Grant 2008085QF296, by the Exploration Fund Project deepfake detection and localization , Proceedings of the University of Science and Technology of China of the 28th ACM International Conference on Multimedia ( 2020 ).