Lip Forgery Video Detection via Multi-Phoneme Selection Jiaying Lin1 , Wenbo Zhou1 * , Honggu Liu1 , Hang Zhou2 , Weiming Zhang1 * and Nenghai Yu1 1 University of Science and Technology of China 2 Simon Fraser University Abstract Deepfake technique can produce realistic manipulation videos including full-face synthesis and local region forgery. General methods work well in detecting the former but are usually intractable in capturing local artifacts especially for lip forgery detection. In this paper, we focus on the lip forgery detection task. We first establish a robust mapping from audio to lip shapes. Then we classify the lip shapes of each video frame according to different spoken phonemes, enable the network in capturing the dissonances between lip shapes and phonemes in fake videos, increasing the interpretability. Each lip shape- phoneme set is used to train a sub-model, those with better discrimination will be selected to obtain an ensemble classification model. Extensive experimental results demonstrate that our method outperforms the most state-of-the-art methods on both the public DFDC dataset and a self-organized lip forgery dataset. Keywords Lip Forgery, Deepfake Detection, Phoneme and Viseme 1. Introduction Real Thanks to the tremendous success of deep generative models, face forgery becomes an emerging research topic in very recent years and various methods have been pro- Fake posed [1, 2]. Depending on the manipulated region, they can be roughly categorized into two types: full-face syn- thesis [3, 4] that usually swaps the whole synthesized source face to a target face, and local face region forgery Figure 1: The lip shapes of speaking the word “apple” in real [5, 6] that only modifies partial face region, e.g., modify- (top) and fake (bottom) video. In the real video, the lips are ing the lip shape to match the audio content. Especially more widely opened with clear teeth texture, while opposite when the lips of politicians have been tampered with in the fake. to make inappropriate speeches, it can lead to serious political crisis. To alleviate the risks brought by malicious uses of face specific targets. [11, 12] employ features such as audio forgery, many detection methods have been proposed and expression to detect synchronization between differ- [7, 8, 9]. These methods usually consider the forgery ent modalities. detection from different aspects and extract visual fea- To address the problem of local region forgery de- tures from the whole face region, achieving impressive tection, in this paper, we proposed a complete multi- detection results on public datasets FF++ and DFDC, in phoneme selection-based framework. To take full ad- which most of the fake videos are tampered in a full-face vantage of the particularity of lip forgery videos that synthesized manner. But this type of detection meth- contain audios, we need to establish a robust mapping ods struggle to handle the local region forgery cases like relationship between the lip shapes and the audio con- lip-sync [5]. Recently, [10] attempt to detect lip-sync tents. Prior studies in the realm of Audio-Visual Speech forgery video with single phoneme-viseme matching for Recognition have demonstrated that the phoneme is the smallest identifiable unit correlated with a particular lip Woodstock’21: Symposium on the irreproducible science, June 07–11, shape. Motivated by [13], we divide audio contents into 2021, Woodstock, NY * 12 phoneme classes and classify all the video frames. For Corresponding Author. " vivian19@mail.ustc.edu.cn (J. Lin); welbeckz@ustc.edu.cn each phoneme-lip set, we measure the deviation on open- (W. Zhou); lhg9754@mail.ustc.edu.cn (H. Liu); close amplitude between real and fake lip shapes, and zhouhang2991@gmail.com (H. Zhou); zhangwm@ustc.edu.cn train a sub-model for real/fake classification. (W. Zhang); ynh@ustc.edu.cn (N. Yu) Usually, a large deviation represents the obvious dis-  0000-0001-5553-9482 (J. Lin); 0000-0002-4703-4641 (W. Zhou); crepancy between the real and fake lip shapes, which 0000-0001-9294-9624 (H. Liu); 0000-0001-7860-8452 (H. Zhou); 0000-0001-5576-6108 (W. Zhang); 0000-0003-4417-9316 (N. Yu) also indicates the great difficulty in synthesizing the lip © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). shape for the corresponding phoneme. Simultaneously, it CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) shows the robustness of correlated phoneme-lip mapping against physical changes in different videos, e.g., volume methods have become mainstream in very recent years. and face angle. This precisely provides a distinguishing [7] uses XceptionNet [17] to extract features from spa- feature for forgery detection. By selecting the phonemes tial domain. F3 -Net [9] achieves state-of-the-art using with the top-5 deviations, we integrate the corresponding frequency-aware decomposition. However, since the au- 5 well-trained sub-models into an ensemble model for dios are lacking in most public deepfake datasets, these maximizing the discriminability of real and fake videos. methods are designed in a universal manner with no To verify the effectiveness, we have conducted exten- consideration of audios matching. They perform well sive experiments on both the public DFDC dataset and a in full-face synthesis detection but is not adequate to self-organized lip forgery video dataset which contains recognize the subtle artifacts in local region forgery. four sub-datasets. The experimental results demonstrate Recently, [11, 12] utilize Siamese network to calcu- that our method outperforms the current state-of-the-art late the feature distances in multi-modalities. If manip- detection methods on cross-dataset evaluation and mul- ulation is conducted on a small segment of the video, tiple class classification. In addition, our method is alsothis will weaken the inconsistency among these modali- competitive on single dataset classification. ties at the video level, leading to a decrease in detection performance.[10] establishes one single phoneme-viseme • We propose a multi-phonemes selection based mapping for a specific person, which severely restricts framework for lip forgery detection task, which the application scenario. To address the above limitations, takes full advantage of the visual and aural infor- we propose a multi-phoneme selection based framework mation in lip forgery videos. for lip forgery video detection. • We establish 12 categories of phoneme-lip map- ping relationships and explore the robustness be- tween the open-close amplitudes on each pair for 3. Method real/fake classification. We also organize a new lip forgery dataset which is helpful to facilitate In this section we will elaborate the multi-phoneme se- the development of lip forgery detection methods. lection based framework. Before that, an important ob- servation of lip forgery will be introduced first. • Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for lip forgery detection on both the public DFDC 3.1. Motivation dataset and a self-organized lip forgery dataset. Lip forgery modifies a specific person’s lip shape to match arbitrary audio contents, thus establishing a close rela- 2. Related work tionship between them. However, due to imperfections in the manipulation, uncontrollable artifacts may be gen- 2.1. Deep Face Forgery erated to hinder the matching. As shown in Figure 1, when saying the word “apple", According to different forgery regions, existing methods the lips in the forgery videos are more blurred to open can be divided into two categories: full-face synthesis well. Although this nuance is not easy to perceive by and local region forgery. Full-face synthesis usually syn- human eyes, a well-designed detector can capture it. Nev- thesizes a whole source face and swaps it to the target. ertheless, the lip shape itself fluctuates in a certain range Typical works are [4, 14]. under different expressions, large fluctuation indicates Local region forgery is a more common type, focus- poor robustness. ing on slight manipulation of partial facial regions, eg, Based on this observation, it is necessary to establish eyebrow locations and lip shapes. Lip-sync [5] is able a robust mapping from audios to lip shapes. Inspired by to modify the lip shapes in Obama’s talking videos to recent works in Audio-Visual Speech Recognition [18], accurately synchronize with a given audio sequence. [15] we divide all audio contents into 12 phonemes categories leverages 3D modeling for specific face videos to make as the smallest identifiable units. Each phoneme set con- the control of lip shapes more flexible. First Order Motion sists of various vowels, consonants and quiet soundmark, [16] uses video to drive a single source portrait image to which can be used to train sub-model independently to generate a talking video. The detection of local region distinguish real/fake lips. Eventually, we select several forgery is more challenging due to the subtle and local sub-models to integrate the final classifier considering nature. the trade-off between efficiency and performance. The framework is depicted in Figure 2. 2.2. Face Forgery Detection Early works explored visual artifacts, eg, the abnormal- ity of eye blinking and teeth. Learning-based detection P2FA Audio in Real Videos Audio in Fake Videos Audio dividing Real Fake 12 Phonme-Lip Mapping Multi-Phonemes Selection LDA Classifier Lip Frames Video Amplitude 12 Phoneme Categories Deviation 48 IPA Phonetic Symbols Figure 2: The framework of ours. Through 12 phoneme-lip shape mapping and multi-phonemes selection, we obtain the final ensemble detection model. Phoneme Real Fake Open-close Amplitude Phoneme Real Fake Open-close Amplitude Real W1 m b p W2 t d n s z l r Fake Here, 𝑝(𝑐 | x) is the probability of x belongs to class W4 f v c, which is computed as the ratio between the in-class W3 k g ŋ and the out-of-class distribution from the previous dis- W5 ʃ ʒ tʃ dʒ W6 θ ð tance 𝑑𝑐 , following the Gaussian distribution with means W8 𝜇𝑐 , 𝜇˜𝑐 and variances 𝜎𝑐 , 𝜎𝑐̃︀, respectively : W7 i: e ɪ eɪ j æ (︁ )︁ W9 ɑ: ɑ ʌ ə h W10 u: ɔ: ɒ w ɔɪ 1 − Φ 𝑑𝑐 (x)−𝜇 𝜎𝑐 𝑐 𝑝(𝑐 | x) = (︁ )︁ (3) W11 ɜː W12 # Φ 𝑑𝑐 (x)−𝜇 𝜎˜ 𝑐 ˜ 𝑐 Figure 3: Illustration of the robust phoneme categories. We After obtaining the mapping, a multi-class LDA classi- fier pre-trained on [20] is utilized for classification. How- exhibit the basic lip patterns with similar phonetics, visually compare the real and fake lip shapes and the average open- ever, different classes may share the same lip shape ap- close amplitude curves. pearance, e.g., m,b,p. By iteratively merging similar pho- netic symbol classes, we obtain 12 distinguishable real lip shapes named “phoneme" (from W1 to W12) with 3.2. Correlations Establishment from robustness. A visual example is given in Figure 3. Phonemes to Lip shapes In fake videos, the lip shapes have been manipulated. As illustrated in Figure 1, the opening amplitudes of fake For a given talking video, we use OpenFace [19] to align lips are quite different from real ones, thus directly using each frame and crop the lip area to 128×128. These lip the phoneme classifier trained on real lips may lead to images will be categorized into different phoneme set and misclassification. Since the audio contents in fake videos used as training/testing data for real/fake classification. are not modified, we decide to use them as the guidance To establish the mapping from phonemes to lip shapes, for fake lips classification. First, Google’s Speech-to-Text we first process all the real videos. According to the API is used to obtain the corresponding transcribed texts International Phonetic Alphabet (IPA) we divide the lip from the audios. Both the texts and audios are then fed shapes into 48 classes. For a given lip shape, we calculate into the P2FA toolkit [21]. By conducting forced align- the Mahalanobis distance 𝑑𝑐 of the open-close amplitude ment on phonemes and words, we get the start and end between the current lip shape x and mean xc of each time for each phoneme, the lip images during this period class. will be categorized into the current phoneme. In Figure 2, √︁ the P2FA section clearly shows the alignment procedure. 𝑑𝑐 (x) = (x − x̄𝑐 )𝑇 · Σ−1 𝑐 · (x − x̄𝑐 ) (1) 3.3. Multiple Phonemes Selection Next, we estimate the probabilities of it belonging to Although the lip shapes in one phoneme set are similar, each class, and assign the sample to the class with the the open-close amplitudes among phonemes are quite highest normalized probability 𝑃𝑐 : different. We use dlib 68 face landmarks detector [22] to compute the vertical axis value between the 63th and 𝑝(𝑐 | x) 𝑃𝑐 (x) = ∑︀𝐶 (2) 67th landmarks: 𝐷 = (𝑦63 -𝑦67 ). Here 𝐷 represents the 𝑐=1 𝑝(𝑐 | x) Table 1 Amplitude Deviation Values for 12 phonemes in self-organized dataset. The Top-5 phonemes with the largest amplitude deviation for each sub-dataset are in bold. Forgery Methods W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 Obama Lip-sync[5] 33.00 31.13 21.63 33.12 34.87 27.625 37.50 24.37 26.87 24.00 22.38 25.25 Audio Driven[15] 15.00 23.62 18.50 26.62 28.00 25.50 29.50 20.63 17.37 18.25 17.00 12.50 First Order Motion[16] 25.13 23.75 34.67 37.12 34.87 22.50 23.38 25.125 33.50 29.50 21.75 20.88 Wav2lip[6] 35.51 34.71 26.71 28.01 25.12 25.43 35.12 28.76 27.32 33.84 29.96 33.60 open-close amplitude of the current lip shape. Using the Table 2 number of frames as the horizontal axis, we calculate The composition of our self-organized dataset, including the 𝐷 for each frame during the period of the phoneme. In numbers of videos and frames. The whole dataset consists of Figure 3, we plot two average amplitude curves for each four sub-datasets. set, the red curves represent the real videos while the Dataset Real/Fake Total Frames blue for fake. Obama Lip-sync[5] 28 56 62534 In W1 and W2, the real and fake curves are widely Audio Driven[15] 24 48 54416 separated with almost no overlap, while in W3 and W6, First Order Motion[16] 24 48 53614 there are partially stacked areas. This observation indi- Wav2lip[6] 28 56 63736 cates that the real and fake lips are more discriminative in certain phoneme sets. To select the most distinguish- able phonemes 𝑊 for classification, we calculate the performance without importing extra complexity. differences between the maximum and minimum values 𝐷𝑊𝑚𝑎𝑥 ,𝐷𝑊𝑚𝑖𝑛 of real/fake curves, respectively. We define the amplitude deviation 𝐷𝑊 to represent the dis- 4. Experiments crepancy between real and fake in each phoneme 𝑊 : 𝐷𝑊 = 21 (𝐷𝑊𝑚𝑎𝑥 + 𝐷𝑊𝑚𝑖𝑛 ). In this section, we initially introduce a new lip forgery Considering the potential differences in forgery meth- video dataset organized by this paper. Several parameter ods, the amplitude deviations of a single phoneme are not studies can verify the optimality of our settings. Fur- identical. As listed in Table 1, the phonemes with top-5 ther experiments are provided to demonstrate the effec- amplitude deviations are in bold, and we will introduce tiveness of our proposed framework on DFDC and self- the self-organized dataset in Section 4. organized dataset, as well as the transferability between them. 3.4. Sub-classification Models training and Ensemble 4.1. Public Dataset and New Lip Forgery Dataset After selecting the phoneme-lip sets for each forgery method, we train sub-classification models based on them. Many datasets [7, 23] have been public for deepfake detec- Each sub-model can be used independently for real/fake tion task. Although with large scale and various forgery lips discrimination. Here we adopt XceptionNet [17] as methods, most fake videos do not contain the audios, the backbone and transfer it to our task by resizing the which still tampered in a full-face synthesized manner. So input to 128×128 and replacing the final connected layer far, there is no dedicated dataset released for lip forgery with two outputs. detection. In this paper, we use one public audio-visual To obtain a stronger detection performance, we inte- deepfake dataset and organize a new dataset targeting grate the sub-models into an ensemble one. The average the lip forgery detection task. weight for each is equal to ensure the contribution is Public DFDC Dataset [24] has been published in the maximized. Furthermore, phoneme units in the video Deepfake Detection Challenge, using multiple manipu- will last for some duration, which contain several lip lation techniques and adding audios to make the video frames. Both the lip frame numbers 𝑓 and sub-models 𝑁 scenarios more natural. To make a fair comparison, we will influence the detection accuracy of the final ensem- align with the settings of [11], using 18,000 videos in the ble model, hence we experiment on them respectively. experiments. The results in Section 4 demonstrate that when 𝑓 = 4 New Lip Forgery Dataset To build the new lip fogery and 𝑁 = 5, the ensemble model can achieve excellent dataset, we adopt four state-of-the-art methods [5, 15, Table 4 Table 3 Comparison of our method(Xception) with other techniques Parameter study of frame selection. 𝑓 = 4 can guaran- on the DFDC dataset using the AUC metric. We select sub- tee the best performance and avoid the overlap with other models of W2, W5, W7, W10, and W11 for integration, and phonemes. our result is competitive against Syncnet and Siamese-based Frame Numbers 𝑓 = 3 𝑓 = 4 𝑓 = 5 𝑓 = 6 𝑓 = 7 𝑓 = 8 methods. ACC (%) 96.21 97.73 96.21 96.97 97.73 97.73 Methods DFDC Modality AUC (%) 97.45 98.89 97.45 97.83 98.89 98.89 Xception-c23[17] 72.20 Video Meso4[25] 75.30 Video DSP-FWA[26] 75.50 Video MBP[10] 80.34 Audio & Video 16, 6] to generate fake videos. The composition of the Siamese-based[11] 84.40 Audio & Video organized dataset is elaborated in Table 2. Syncnet[12] 89.50 Audio & Video Ours (Xception) 91.60 Audio & Video 4.2. Experimental Settings As mentioned before, XceptionNet is the baseline. Ac- cording to the particularity of the public DFDC dataset ing and testing sets is 85:15. Even though we only crop and self-organized dataset, we adopt different training the lip region of the face, we still achieve a competitive strategies. On the large DFDC dataset, we train our model performance. In Table 4, our method achieves 91.6% on with a batch size of 128 for 500 epochs. Due to the dis- AUC, which outperforms not only the vision based full- face method but also the audio-visual based multi-modal tinctly smaller size of the self-organized dataset, we train method. Among them, Syncnet[12] detects the synchro- with a batch size of 16 for 100 epochs on each sub-dataset. For both datasets, we uniformly use the Adam optimizer nization from audios to video frames, achieves 89.50% with the learning rate of 0.001 and employ ACC (accu- on AUC, while ignoring the content matching between racy) and AUC (area under ROC curve) as evaluation them. The improvement in ours mainly benefits from the metrics. establishment of the phoneme-lip mapping, where the selected phonemes W2,W5,W7,W10 and W11 are robust to various external disturbances in DFDC such as face 4.3. Parameter Study angle, illumination, and video compression, boosting the Frame Selection. As showed in Figure 2, a single detection capability of the ensemble model. phoneme unit will include several lip frames. We use Moreover, we respectively visualize the Gradient- 𝑓 to represent the number of lip frames, the value of weighted Class Activation Mapping (Grad-CAM) [28] for 𝑓 has an impact on the competence of the model. Few the baseline and ours, as shown in Figure 4. It shows that lip frames result in missing lip features of the current our method can significantly include the surrounding re- phoneme, while extra frames may overlap with others. gions such as the upper and lower lips, which facilitates In order not to introduce disturbances from other fac- the network to focus on the open-close amplitudes and tors, we experiment on the Obama Lip-sync dataset. We is in line with our motivation. In contrast, the baseline integrate all the 12 phoneme sub-models into one and model mainly concerns the internal teeth regions, losing take the beginning time of each phoneme as the center the edge information. to select the surrounding frames 𝑓 . Table 3 displays the accuracy of 𝑓 from 3 to 8. The accuracy reaches 97.73% 4.5. Evaluation on Self-organized Dataset when 𝑓 = 4, 7 and 8. Considering the tradeoff between accuracy and complexity, we finally choose 𝑓 = 4. In this section, we conduct experiments on self-organized Phoneme Selection. Still executing on the Obama dataset to verify the performance of real/fake classifica- Lip-sync dataset, we use 𝑁 to denote the number of tion and multiple classification. selected phonemes. Referring to the amplitude deviations ranking listed in Table 1, we integrate the sub-models 4.5.1. Evaluation of Real/Fake Classification from 2 to 12, the highest accuracy is achieved when 𝑁 = For each sub-dataset, We use different phonemes to in- 5. Thus we choose phoneme sets with the top 5 amplitude tegrate the final classification model, the selections are deviations to train sub-models. listed in Table 5. The baseline model (Xception) is di- rectly trained on all continuous frames of real/fake videos. 4.4. Evaluation on DFDC Dataset Further, to verify that our method is not restricted by the backbone, we adopt another network architecture In this section, we compare our method with previous ResNet-50 [29] which performs well in image classifica- deepfake detection methods on DFDC. The ratio of train- Table 5 Evaluation of Real/Fake Classification. For each dataset, the performance of our approach surpasses baselines (Xception/ResNet-50) and existing state-of-the-art detection methods. Obama Lip-sync[5] Audio Driven[15] First Order[16] Wav2lip[6] Methods (W1-W2-W4-W5-W7) (W2-W4-W5-W6-W7) (W3-W4-W5-W9-W10) (W1-W2-W7-W10-W12) ACC (%) AUC (%) ACC (%) AUC (%) ACC (%) AUC (%) ACC (%) AUC (%) MBP[10] 93.54 96.03 - - - - - - Siamese-based[11] 90.53 93.01 87.47 89.86 92.03 95.21 84.77 88.64 Syncnet[12] 92.18 95.21 90.83 92.89 92.18 95.56 86.08 90.16 ResNet-50 79.38 85.72 68.65 72.62 86.97 89.40 75.23 78.96 Xception[17] 84.82 89.19 70.18 78.43 88.83 93.71 78.54 80.78 Ours(ResNet-50) 96.35 97.67 94.67 96.40 96.25 97.62 95.12 96.74 Ours(Xception) 97.73 98.89 95.84 97.61 97.59 98.60 96.43 97.89 Table 6 Evaluation of multiple classification. In the table, except for the average AUC (%) in the last column, other data represent the ACC (%). Here, Our method integrates the sub-models of W2, W3, W4, W7 and W8 into the ensemble one, which largely outperforms the advanced methods. Methods Real Obama Lip-sync[5] Audio Driven[15] First Order[16] Wav2lip[6] Average ACC Average AUC Siamese-based[11] 92.91 77.63 70.86 85.14 79.44 81.20 88.45 Syncnet[12] 94.89 78.79 74.33 88.62 81.54 83.46 90.53 Xception[17] 92.13 73.44 55.13 78.01 77.27 75.37 83.12 Ours (Xception) 96.21 95.96 87.50 96.97 94.88 94.29 96.84 tion tasks. The results in Table 5 demonstrate that our 4.5.2. Evaluation of Multiple Classification method outperforms the previous methods, where MBP To further distinguish different forgery methods, in the is designed for Obama lip forgery and the Audio Driven 4 sub-datasets, we label all real lips with 0 and fake lips dataset is challenging with low video resolution and the with 1 ∼ 4 individually. W2, W3, W4, W7, W8 are blocking of microphones or arms. chosen to train the classification model. DFDC Xception Ours (Xception) Obama Xception Ours (Xception) Audio Driven Xception Ours (Xception) Real Fake Figure 4: The Grad-CAM of the baseline Xception and ours, including DFDC dataset and two forgery methods in self- organized dataset. Ours can easily capture more lip regions. (a) Siamese-based (b) Syncnet (c) Xception (d) Ours(Xception) Figure 5: Feature distributions visualization from Siamese-based (a) to ours (d) on multiple classification. In the four methods, ours contains less outliers and widely separates the real and fake classes. Table 7 under Grant YD3480002001 and the Fundamental Re- Evaluation on cross-dataset. The testset is self-organized search Funds for the Central Universities under Grant dataset. Ours (W2,W5,W7,W10,W11) achieves better results. WK2100000011. Methods ACC AUC MBP[10] 57.94 59.12 References Siamese-based[11] 59.51 60.68 Syncnet[12] 60.11 61.79 [1] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, ResNet-50[27] 54.74 57.67 M. Nießner, Face2face: Real-time face capture Xception[17] 56.80 58.89 and reenactment of rgb videos, 2016 IEEE Confer- Ours (ResNet-50) 62.38 63.51 ence on Computer Vision and Pattern Recognition Ours (Xception) 63.67 64.05 (CVPR) (2016) 2387–2395. [2] Y. Nirkin, Y. Keller, T. Hassner, Fsgan: Subject agnostic face swapping and reenactment, 2019 Table 6 verifies that the ensemble model can be ap- IEEE/CVF International Conference on Computer plied to multiple classification scenarios. We also intu- Vision (ICCV) (2019) 7183–7192. itively visualize the t-SNE[30] feature distributions from [3] DeepFakes, Deepfakes github, http://github.com/ Siamese-based to ours. As shown in Figure 5, our method deepfakes/faceswap, 2017. Accessed 2020-08-18. is superior to find latent dissimilarity in high-dimensional [4] FaceSwap, Faceswap github, http://https://github. space with fewer outliers. com/MarekKowalski/FaceSwap, 2016. Accessed 2020-08-18. [5] I. K.-S. Supasorn Suwajanakorn, Steven Seitz, Syn- 4.6. Evaluation on cross-dataset thesizing obama: Learning lip sync from audio, SIG- Transferability is evaluated by training on DFDC but GRAPH 36 (2017) 95. testing on self-organized dataset where all lips are labeled [6] R. PrajwalK, R. Mukhopadhyay, V. Namboodiri, as real/fake. Table 7 shows better transferability of ours C. Jawahar, A lip sync expert is all you need for in detecting universal artifacts in various datasets. speech to lip generation in the wild, Proceedings of the 28th ACM International Conference on Mul- timedia (2020). 5. Conclusion [7] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, Faceforensics++: Learning Lip forgery detection is an extremely challenging task to detect manipulated facial images, arXiv preprint in deepfake detection due to the subtle and local mod- arXiv:1901.08971 (2019). ifications. In this paper, we present a multi-phoneme [8] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, selection based framework. Varying from existing deep- B. Guo, Face x-ray for more general face forgery fake detection, it takes full advantage of the particularity detection, in: Proceedings of the IEEE/CVF Confer- of lip forgery videos, establishing a robust mapping from ence on Computer Vision and Pattern Recognition, audio to lip shapes. 12 categories of phonemes are de- 2020, pp. 5001–5010. termined as the smallest identifiable unit for various lip [9] Y. Qian, G. Yin, L. Sheng, Z. Chen, J. Shao, Thinking shapes and the phonemes with top-5 distinguishability in frequency: Face forgery detection by mining are selected to train sub-classification models. In addition, frequency-aware clues, in: ECCV, 2020. we organize a new dataset consists of four sub-datasets, [10] S. Agarwal, H. Farid, O. Fried, M. Agrawala, De- which is the first one organized for lip forgery detection tecting deep-fake videos from phoneme-viseme task. Extensive experiments demonstrate the effective- mismatches, 2020 IEEE/CVF Conference on Com- ness of our framework, including the challenging task of puter Vision and Pattern Recognition Workshops cross-dataset evaluation. (CVPRW) (2020) 2814–2822. [11] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, Acknowledgments D. Manocha, Emotions don’t lie: A deepfake de- tection method using audio-visual affective cues, This work was supported in part by the Natural Science ArXiv abs/2003.06711 (2020). Foundation of China under Grant U20B2047, U1636201, [12] K. Chugh, P. Gupta, A. Dhall, R. Subramanian, Not 62002334, by the Anhui Science Foundation of China un- made for each other- audio-visual dissonance-based der Grant 2008085QF296, by the Exploration Fund Project deepfake detection and localization, Proceedings of the University of Science and Technology of China of the 28th ACM International Conference on Mul- timedia (2020). [13] H. L. Bear, R. Harvey, Phoneme-to-viseme map- (2016) 770–778. pings: the good, the bad, and the ugly, ArXiv [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, abs/1805.02934 (2017). D. Parikh, D. Batra, Grad-cam: Visual explanations [14] L. Li, J. Bao, H. Yang, D. Chen, F. Wen, Faceshifter: from deep networks via gradient-based localization, Towards high fidelity and occlusion aware face in: Proceedings of the IEEE international confer- swapping, arXiv preprint arXiv:1912.13457 (2019). ence on computer vision, 2017, pp. 618–626. [15] R. Yi, Z. Ye, J. Zhang, H. Bao, Y. Liu, Audio-driven [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- talking face video generation with natural head ing for image recognition, in: Proceedings of the pose, ArXiv abs/2002.10137 (2020). IEEE conference on computer vision and pattern [16] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, recognition, 2016, pp. 770–778. N. Sebe, First order motion model for image anima- [30] L. v. d. Maaten, G. Hinton, Visualizing data using tion, ArXiv abs/2003.00196 (2019). t-sne, Journal of machine learning research 9 (2008) [17] F. Chollet, Xception: Deep learning with depthwise 2579–2605. separable convolutions, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 1800–1807. [18] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, M. Pantic, Audio-visual speech recognition with a hybrid ctc/attention architecture, 2018 IEEE Spoken Language Technology Workshop (SLT) (2018) 513– 520. [19] T. Baltrusaitis, P. Robinson, L.-P. Morency, Open- face: An open source facial behavior analysis toolkit, 2016 IEEE Winter Conference on Appli- cations of Computer Vision (WACV) (2016) 1–10. [20] A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel, L. Buera, E. Zacur, Av@car: A spanish multichannel multimodal corpus for in-vehicle automatic audio- visual speech recognition, in: LREC, 2004. [21] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li, M. Agrawala, Content-based tools for editing audio stories, in: UIST ’13, 2013. [22] D. King, Dlib-ml: A machine learning toolkit, J. Mach. Learn. Res. 10 (2009) 1755–1758. [23] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A large-scale challenging dataset for deepfake foren- sics, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3207–3216. [24] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, C. C. Ferrer, The deepfake detection challenge dataset, arXiv preprint arXiv:2006.07397 (2020). [25] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, Mesonet: a compact facial video forgery detection network, 2018 IEEE International Workshop on Information Forensics and Security (WIFS) (2018) 1–7. [26] Y. Li, S. Lyu, Exposing deepfake videos by detecting face warping artifacts, in: IEEE Conference on Com- puter Vision and Pattern Recognition Workshops (CVPRW), 2019. [27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- ing for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)