<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Lip Forgery Video Detection via Multi-Phoneme Selection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jiaying</forename><surname>Lin</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Science and Technology</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wenbo</forename><surname>Zhou</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Science and Technology</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Honggu</forename><surname>Liu</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Science and Technology</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hang</forename><surname>Zhou</surname></persName>
							<email>zhouhang2991@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="institution">Simon Fraser University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Weiming</forename><surname>Zhang</surname></persName>
							<email>zhangwm@ustc.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Science and Technology</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nenghai</forename><surname>Yu</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Science and Technology</orgName>
								<address>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Lip Forgery Video Detection via Multi-Phoneme Selection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F7935B4CDDFBC220F33FDA6D9D6A5C82</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Lip Forgery, Deepfake Detection, Phoneme and Viseme (N. Yu) 0000-0001-5553-9482 (J. Lin)</term>
					<term>0000-0002-4703-4641 (W. Zhou)</term>
					<term>0000-0001-9294-9624 (H. Liu)</term>
					<term>0000-0001-7860-8452 (H. Zhou)</term>
					<term>0000-0001-5576-6108 (W. Zhang)</term>
					<term>0000-0003-4417-9316 (N. Yu)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Deepfake technique can produce realistic manipulation videos including full-face synthesis and local region forgery. General methods work well in detecting the former but are usually intractable in capturing local artifacts especially for lip forgery detection. In this paper, we focus on the lip forgery detection task. We first establish a robust mapping from audio to lip shapes. Then we classify the lip shapes of each video frame according to different spoken phonemes, enable the network in capturing the dissonances between lip shapes and phonemes in fake videos, increasing the interpretability. Each lip shapephoneme set is used to train a sub-model, those with better discrimination will be selected to obtain an ensemble classification model. Extensive experimental results demonstrate that our method outperforms the most state-of-the-art methods on both the public DFDC dataset and a self-organized lip forgery dataset.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Thanks to the tremendous success of deep generative models, face forgery becomes an emerging research topic in very recent years and various methods have been proposed <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. Depending on the manipulated region, they can be roughly categorized into two types: full-face synthesis <ref type="bibr" target="#b3">[3,</ref><ref type="bibr" target="#b4">4]</ref> that usually swaps the whole synthesized source face to a target face, and local face region forgery <ref type="bibr" target="#b5">[5,</ref><ref type="bibr" target="#b6">6]</ref> that only modifies partial face region, e.g., modifying the lip shape to match the audio content. Especially when the lips of politicians have been tampered with to make inappropriate speeches, it can lead to serious political crisis.</p><p>To alleviate the risks brought by malicious uses of face forgery, many detection methods have been proposed <ref type="bibr" target="#b7">[7,</ref><ref type="bibr" target="#b8">8,</ref><ref type="bibr" target="#b9">9]</ref>. These methods usually consider the forgery detection from different aspects and extract visual features from the whole face region, achieving impressive detection results on public datasets FF++ and DFDC, in which most of the fake videos are tampered in a full-face synthesized manner. But this type of detection methods struggle to handle the local region forgery cases like lip-sync <ref type="bibr" target="#b5">[5]</ref>. Recently, <ref type="bibr" target="#b10">[10]</ref> attempt to detect lip-sync forgery video with single phoneme-viseme matching for The lip shapes of speaking the word "apple" in real (top) and fake (bottom) video. In the real video, the lips are more widely opened with clear teeth texture, while opposite in the fake. specific targets. <ref type="bibr" target="#b11">[11,</ref><ref type="bibr" target="#b12">12]</ref> employ features such as audio and expression to detect synchronization between different modalities.</p><p>To address the problem of local region forgery detection, in this paper, we proposed a complete multiphoneme selection-based framework. To take full advantage of the particularity of lip forgery videos that contain audios, we need to establish a robust mapping relationship between the lip shapes and the audio contents. Prior studies in the realm of Audio-Visual Speech Recognition have demonstrated that the phoneme is the smallest identifiable unit correlated with a particular lip shape. Motivated by <ref type="bibr" target="#b13">[13]</ref>, we divide audio contents into 12 phoneme classes and classify all the video frames. For each phoneme-lip set, we measure the deviation on openclose amplitude between real and fake lip shapes, and train a sub-model for real/fake classification.</p><p>Usually, a large deviation represents the obvious discrepancy between the real and fake lip shapes, which also indicates the great difficulty in synthesizing the lip shape for the corresponding phoneme. Simultaneously, it shows the robustness of correlated phoneme-lip mapping against physical changes in different videos, e.g., volume and face angle. This precisely provides a distinguishing feature for forgery detection. By selecting the phonemes with the top-5 deviations, we integrate the corresponding 5 well-trained sub-models into an ensemble model for maximizing the discriminability of real and fake videos.</p><p>To verify the effectiveness, we have conducted extensive experiments on both the public DFDC dataset and a self-organized lip forgery video dataset which contains four sub-datasets. The experimental results demonstrate that our method outperforms the current state-of-the-art detection methods on cross-dataset evaluation and multiple class classification. In addition, our method is also competitive on single dataset classification. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work 2.1. Deep Face Forgery</head><p>According to different forgery regions, existing methods can be divided into two categories: full-face synthesis and local region forgery. Full-face synthesis usually synthesizes a whole source face and swaps it to the target. Typical works are <ref type="bibr" target="#b4">[4,</ref><ref type="bibr" target="#b14">14]</ref>.</p><p>Local region forgery is a more common type, focusing on slight manipulation of partial facial regions, eg, eyebrow locations and lip shapes. Lip-sync <ref type="bibr" target="#b5">[5]</ref> is able to modify the lip shapes in Obama's talking videos to accurately synchronize with a given audio sequence. <ref type="bibr" target="#b15">[15]</ref> leverages 3D modeling for specific face videos to make the control of lip shapes more flexible. First Order Motion <ref type="bibr" target="#b16">[16]</ref> uses video to drive a single source portrait image to generate a talking video. The detection of local region forgery is more challenging due to the subtle and local nature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Face Forgery Detection</head><p>Early works explored visual artifacts, eg, the abnormality of eye blinking and teeth. Learning-based detection methods have become mainstream in very recent years. <ref type="bibr" target="#b7">[7]</ref> uses XceptionNet <ref type="bibr" target="#b17">[17]</ref> to extract features from spatial domain. F 3 -Net <ref type="bibr" target="#b9">[9]</ref> achieves state-of-the-art using frequency-aware decomposition. However, since the audios are lacking in most public deepfake datasets, these methods are designed in a universal manner with no consideration of audios matching. They perform well in full-face synthesis detection but is not adequate to recognize the subtle artifacts in local region forgery.</p><p>Recently, <ref type="bibr" target="#b11">[11,</ref><ref type="bibr" target="#b12">12]</ref> utilize Siamese network to calculate the feature distances in multi-modalities. If manipulation is conducted on a small segment of the video, this will weaken the inconsistency among these modalities at the video level, leading to a decrease in detection performance. <ref type="bibr" target="#b10">[10]</ref> establishes one single phoneme-viseme mapping for a specific person, which severely restricts the application scenario. To address the above limitations, we propose a multi-phoneme selection based framework for lip forgery video detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>In this section we will elaborate the multi-phoneme selection based framework. Before that, an important observation of lip forgery will be introduced first.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Motivation</head><p>Lip forgery modifies a specific person's lip shape to match arbitrary audio contents, thus establishing a close relationship between them. However, due to imperfections in the manipulation, uncontrollable artifacts may be generated to hinder the matching.</p><p>As shown in Figure <ref type="figure" target="#fig_0">1</ref>, when saying the word "apple", the lips in the forgery videos are more blurred to open well. Although this nuance is not easy to perceive by human eyes, a well-designed detector can capture it. Nevertheless, the lip shape itself fluctuates in a certain range under different expressions, large fluctuation indicates poor robustness.</p><p>Based on this observation, it is necessary to establish a robust mapping from audios to lip shapes. Inspired by recent works in Audio-Visual Speech Recognition <ref type="bibr" target="#b18">[18]</ref>, we divide all audio contents into 12 phonemes categories as the smallest identifiable units. Each phoneme set consists of various vowels, consonants and quiet soundmark, which can be used to train sub-model independently to distinguish real/fake lips. Eventually, we select several sub-models to integrate the final classifier considering the trade-off between efficiency and performance. The framework is depicted in Figure <ref type="figure">2</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Correlations Establishment from Phonemes to Lip shapes</head><p>For a given talking video, we use OpenFace <ref type="bibr" target="#b19">[19]</ref> to align each frame and crop the lip area to 128×128. These lip images will be categorized into different phoneme set and used as training/testing data for real/fake classification.</p><p>To establish the mapping from phonemes to lip shapes, we first process all the real videos. According to the International Phonetic Alphabet (IPA) we divide the lip shapes into 48 classes. For a given lip shape, we calculate the Mahalanobis distance 𝑑𝑐 of the open-close amplitude between the current lip shape x and mean xc of each class.</p><formula xml:id="formula_0">𝑑𝑐(x) = √︁ (x − x ¯𝑐) 𝑇 • Σ −1 𝑐 • (x − x ¯𝑐)<label>(1)</label></formula><p>Next, we estimate the probabilities of it belonging to each class, and assign the sample to the class with the highest normalized probability 𝑃𝑐:</p><formula xml:id="formula_1">𝑃𝑐(x) = 𝑝(𝑐 | x) ∑︀ 𝐶 𝑐=1 𝑝(𝑐 | x)<label>(2)</label></formula><p>Here, 𝑝(𝑐 | x) is the probability of x belongs to class c, which is computed as the ratio between the in-class and the out-of-class distribution from the previous distance 𝑑𝑐, following the Gaussian distribution with means 𝜇𝑐, 𝜇𝑐 ˜and variances 𝜎𝑐, 𝜎 ̃︀ 𝑐 , respectively :</p><formula xml:id="formula_2">𝑝(𝑐 | x) = 1 − Φ (︁ 𝑑𝑐(x)−𝜇𝑐 𝜎𝑐 )︁ Φ (︁ 𝑑𝑐(x)−𝜇 𝑐 σ𝑐 ˜)︁ (3)</formula><p>After obtaining the mapping, a multi-class LDA classifier pre-trained on <ref type="bibr" target="#b20">[20]</ref> is utilized for classification. However, different classes may share the same lip shape appearance, e.g., m,b,p. By iteratively merging similar phonetic symbol classes, we obtain 12 distinguishable real lip shapes named "phoneme" (from W1 to W12) with robustness. A visual example is given in Figure <ref type="figure">3</ref>.</p><p>In fake videos, the lip shapes have been manipulated. As illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, the opening amplitudes of fake lips are quite different from real ones, thus directly using the phoneme classifier trained on real lips may lead to misclassification. Since the audio contents in fake videos are not modified, we decide to use them as the guidance for fake lips classification. First, Google's Speech-to-Text API is used to obtain the corresponding transcribed texts from the audios. Both the texts and audios are then fed into the P2FA toolkit <ref type="bibr" target="#b21">[21]</ref>. By conducting forced alignment on phonemes and words, we get the start and end time for each phoneme, the lip images during this period will be categorized into the current phoneme. In Figure <ref type="figure">2</ref>, the P2FA section clearly shows the alignment procedure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Multiple Phonemes Selection</head><p>Although the lip shapes in one phoneme set are similar, the open-close amplitudes among phonemes are quite different. We use dlib 68 face landmarks detector <ref type="bibr" target="#b22">[22]</ref> to compute the vertical axis value between the 63th and 67th landmarks: 𝐷 = (𝑦63-𝑦67). Here 𝐷 represents the open-close amplitude of the current lip shape. Using the number of frames as the horizontal axis, we calculate 𝐷 for each frame during the period of the phoneme. In Figure <ref type="figure">3</ref>, we plot two average amplitude curves for each set, the red curves represent the real videos while the blue for fake.</p><p>In W1 and W2, the real and fake curves are widely separated with almost no overlap, while in W3 and W6, there are partially stacked areas. This observation indicates that the real and fake lips are more discriminative in certain phoneme sets. To select the most distinguishable phonemes 𝑊 for classification, we calculate the differences between the maximum and minimum values 𝐷𝑊 𝑚𝑎𝑥 ,𝐷𝑊 𝑚𝑖𝑛 of real/fake curves, respectively. We define the amplitude deviation 𝐷𝑊 to represent the discrepancy between real and fake in each phoneme 𝑊 : 𝐷𝑊 = 1 2 (𝐷𝑊 𝑚𝑎𝑥 + 𝐷𝑊 𝑚𝑖𝑛 ). Considering the potential differences in forgery methods, the amplitude deviations of a single phoneme are not identical. As listed in Table <ref type="table" target="#tab_1">1</ref>, the phonemes with top-5 amplitude deviations are in bold, and we will introduce the self-organized dataset in Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Sub-classification Models training and Ensemble</head><p>After selecting the phoneme-lip sets for each forgery method, we train sub-classification models based on them. Each sub-model can be used independently for real/fake lips discrimination. Here we adopt XceptionNet <ref type="bibr" target="#b17">[17]</ref> as the backbone and transfer it to our task by resizing the input to 128×128 and replacing the final connected layer with two outputs.</p><p>To obtain a stronger detection performance, we integrate the sub-models into an ensemble one. The average weight for each is equal to ensure the contribution is maximized. Furthermore, phoneme units in the video will last for some duration, which contain several lip frames. Both the lip frame numbers 𝑓 and sub-models 𝑁 will influence the detection accuracy of the final ensemble model, hence we experiment on them respectively. The results in Section 4 demonstrate that when 𝑓 = 4 and 𝑁 = 5, the ensemble model can achieve excellent</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>The composition of our self-organized dataset, including the numbers of videos and frames. The whole dataset consists of four sub-datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset</head><p>Real/Fake Total Frames Obama Lip-sync <ref type="bibr">[</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>In this section, we initially introduce a new lip forgery video dataset organized by this paper. Several parameter studies can verify the optimality of our settings. Further experiments are provided to demonstrate the effectiveness of our proposed framework on DFDC and selforganized dataset, as well as the transferability between them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Public Dataset and New Lip Forgery Dataset</head><p>Many datasets <ref type="bibr" target="#b7">[7,</ref><ref type="bibr" target="#b23">23]</ref> have been public for deepfake detection task. Although with large scale and various forgery methods, most fake videos do not contain the audios, which still tampered in a full-face synthesized manner. So far, there is no dedicated dataset released for lip forgery detection. In this paper, we use one public audio-visual deepfake dataset and organize a new dataset targeting the lip forgery detection task. Public DFDC Dataset <ref type="bibr" target="#b24">[24]</ref> has been published in the Deepfake Detection Challenge, using multiple manipulation techniques and adding audios to make the video scenarios more natural. To make a fair comparison, we align with the settings of <ref type="bibr" target="#b11">[11]</ref>, using 18,000 videos in the experiments.</p><p>New Lip Forgery Dataset To build the new lip fogery dataset, we adopt four state-of-the-art methods <ref type="bibr">[5, 15,</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experimental Settings</head><p>As mentioned before, XceptionNet is the baseline. According to the particularity of the public DFDC dataset and self-organized dataset, we adopt different training strategies. On the large DFDC dataset, we train our model with a batch size of 128 for 500 epochs. Due to the distinctly smaller size of the self-organized dataset, we train with a batch size of 16 for 100 epochs on each sub-dataset.</p><p>For both datasets, we uniformly use the Adam optimizer with the learning rate of 0.001 and employ ACC (accuracy) and AUC (area under ROC curve) as evaluation metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Parameter Study</head><p>Frame Selection. As showed in Figure <ref type="figure">2</ref>, a single phoneme unit will include several lip frames. We use 𝑓 to represent the number of lip frames, the value of 𝑓 has an impact on the competence of the model. Few lip frames result in missing lip features of the current phoneme, while extra frames may overlap with others.</p><p>In order not to introduce disturbances from other factors, we experiment on the Obama Lip-sync dataset. We integrate all the 12 phoneme sub-models into one and take the beginning time of each phoneme as the center to select the surrounding frames 𝑓 . Table <ref type="table" target="#tab_3">3</ref> displays the accuracy of 𝑓 from 3 to 8. The accuracy reaches 97.73% when 𝑓 = 4, 7 and 8. Considering the tradeoff between accuracy and complexity, we finally choose 𝑓 = 4.</p><p>Phoneme Selection. Still executing on the Obama Lip-sync dataset, we use 𝑁 to denote the number of selected phonemes. Referring to the amplitude deviations ranking listed in Table <ref type="table" target="#tab_1">1</ref>, we integrate the sub-models from 2 to 12, the highest accuracy is achieved when 𝑁 = 5. Thus we choose phoneme sets with the top 5 amplitude deviations to train sub-models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Evaluation on DFDC Dataset</head><p>In this section, we compare our method with previous deepfake detection methods on DFDC. The ratio of train-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Comparison of our method(Xception) with other techniques on the DFDC dataset using the AUC metric. We select submodels of W2, W5, W7, W10, and W11 for integration, and our result is competitive against Syncnet and Siamese-based methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DFDC Modality</head><p>Xception-c23 <ref type="bibr" target="#b17">[17]</ref> 72.20 Video Meso4 <ref type="bibr" target="#b25">[25]</ref> 75.30 Video DSP-FWA <ref type="bibr" target="#b26">[26]</ref> 75.50 Video MBP <ref type="bibr" target="#b10">[10]</ref> 80.34 Audio &amp; Video Siamese-based <ref type="bibr" target="#b11">[11]</ref> 84.40 Audio &amp; Video Syncnet <ref type="bibr" target="#b12">[12]</ref> 89.50 Audio &amp; Video Ours (Xception)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>91.60</head><p>Audio &amp; Video ing and testing sets is 85:15. Even though we only crop the lip region of the face, we still achieve a competitive performance. In Table <ref type="table">4</ref>, our method achieves 91.6% on AUC, which outperforms not only the vision based fullface method but also the audio-visual based multi-modal method. Among them, Syncnet <ref type="bibr" target="#b12">[12]</ref> detects the synchronization from audios to video frames, achieves 89.50% on AUC, while ignoring the content matching between them. The improvement in ours mainly benefits from the establishment of the phoneme-lip mapping, where the selected phonemes W2,W5,W7,W10 and W11 are robust to various external disturbances in DFDC such as face angle, illumination, and video compression, boosting the detection capability of the ensemble model. Moreover, we respectively visualize the Gradientweighted Class Activation Mapping (Grad-CAM) <ref type="bibr" target="#b28">[28]</ref> for the baseline and ours, as shown in Figure <ref type="figure" target="#fig_2">4</ref>. It shows that our method can significantly include the surrounding regions such as the upper and lower lips, which facilitates the network to focus on the open-close amplitudes and is in line with our motivation. In contrast, the baseline model mainly concerns the internal teeth regions, losing the edge information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Evaluation on Self-organized Dataset</head><p>In this section, we conduct experiments on self-organized dataset to verify the performance of real/fake classification and multiple classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.1.">Evaluation of Real/Fake Classification</head><p>For each sub-dataset, We use different phonemes to integrate the final classification model, the selections are listed in Table <ref type="table">5</ref>. The baseline model (Xception) is directly trained on all continuous frames of real/fake videos. Further, to verify that our method is not restricted by the backbone, we adopt another network architecture ResNet-50 <ref type="bibr" target="#b29">[29]</ref> which performs well in image classifica-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Evaluation of Real/Fake Classification. For each dataset, the performance of our approach surpasses baselines (Xception/ResNet-50) and existing state-of-the-art detection methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head><p>Obama Lip-sync <ref type="bibr" target="#b5">[5]</ref> Audio Driven <ref type="bibr" target="#b15">[15]</ref> First Order <ref type="bibr" target="#b16">[16]</ref> Wav2lip </p><formula xml:id="formula_3">[6] (W1-W2-W4-W5-W7) (W2-W4-W5-W6-W7) (W3-W4-W5-W9-W10) (W1-W2-W7-W10-W12) ACC (%) AUC (%) ACC<label>(</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.2.">Evaluation of Multiple Classification</head><p>To further distinguish different forgery methods, in the 4 sub-datasets, we label all real lips with 0 and fake lips with 1 ∼ 4 individually. W2, W3, W4, W7, W8 are chosen to train the classification model.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 7</head><p>Evaluation on cross-dataset. The testset is self-organized dataset. Ours (W2,W5,W7,W10,W11) achieves better results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACC AUC</head><p>MBP <ref type="bibr" target="#b10">[10]</ref> 57.94 59.12 Siamese-based <ref type="bibr" target="#b11">[11]</ref> 59.51 60.68 Syncnet <ref type="bibr" target="#b12">[12]</ref> 60.11 61.79 ResNet-50 <ref type="bibr" target="#b27">[27]</ref> 54.74 57.67 Xception <ref type="bibr" target="#b17">[17]</ref> 56 Table <ref type="table" target="#tab_4">6</ref> verifies that the ensemble model can be applied to multiple classification scenarios. We also intuitively visualize the t-SNE <ref type="bibr" target="#b30">[30]</ref> feature distributions from Siamese-based to ours. As shown in Figure <ref type="figure" target="#fig_4">5</ref>, our method is superior to find latent dissimilarity in high-dimensional space with fewer outliers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Evaluation on cross-dataset</head><p>Transferability is evaluated by training on DFDC but testing on self-organized dataset where all lips are labeled as real/fake. Table <ref type="table">7</ref> shows better transferability of ours in detecting universal artifacts in various datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>Lip forgery detection is an extremely challenging task in deepfake detection due to the subtle and local modifications. In this paper, we present a multi-phoneme selection based framework. Varying from existing deepfake detection, it takes full advantage of the particularity of lip forgery videos, establishing a robust mapping from audio to lip shapes. 12 categories of phonemes are determined as the smallest identifiable for various lip shapes and the phonemes with top-5 distinguishability are selected to train sub-classification models. In addition, we organize a new dataset consists of four sub-datasets, which is the first one organized for lip forgery detection task. Extensive experiments demonstrate the effectiveness of our framework, including the challenging task of cross-dataset evaluation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>FakeFigure 1 :</head><label>1</label><figDesc>Figure 1:The lip shapes of speaking the word "apple" in real (top) and fake (bottom) video. In the real video, the lips are more widely opened with clear teeth texture, while opposite in the fake.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :Figure 3 :</head><label>23</label><figDesc>Figure 2: The framework of ours. Through 12 phoneme-lip shape mapping and multi-phonemes selection, we obtain the final ensemble detection model.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: The Grad-CAM of the baseline Xception and ours, including DFDC dataset and two forgery methods in selforganized dataset. Ours can easily capture more lip regions.</figDesc><graphic coords="6,91.57,565.47,97.92,73.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Feature distributions visualization from Siamese-based (a) to ours (d) on multiple classification. In the four methods, ours contains less outliers and widely separates the real and fake classes.</figDesc><graphic coords="6,195.63,565.47,97.92,73.44" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Amplitude Deviation Values for 12 phonemes in self-organized dataset. The Top-5 phonemes with the largest amplitude deviation for each sub-dataset are in bold.</figDesc><table><row><cell>Forgery Methods</cell><cell>W1</cell><cell>W2</cell><cell>W3</cell><cell>W4</cell><cell>W5</cell><cell>W6</cell><cell>W7</cell><cell>W8</cell><cell>W9</cell><cell>W10 W11 W12</cell></row><row><cell>Obama Lip-sync[5]</cell><cell cols="10">33.00 31.13 21.63 33.12 34.87 27.625 37.50 24.37 26.87 24.00 22.38 25.25</cell></row><row><cell>Audio Driven[15]</cell><cell cols="10">15.00 23.62 18.50 26.62 28.00 25.50 29.50 20.63 17.37 18.25 17.00 12.50</cell></row><row><cell cols="11">First Order Motion[16] 25.13 23.75 34.67 37.12 34.87 22.50 23.38 25.125 33.50 29.50 21.75 20.88</cell></row><row><cell>Wav2lip[6]</cell><cell cols="10">35.51 34.71 26.71 28.01 25.12 25.43 35.12 28.76 27.32 33.84 29.96 33.60</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Parameter study of frame selection. 𝑓 = 4 can guarantee the best performance and avoid the overlap with other phonemes. Frame Numbers 𝑓 = 3 𝑓 = 4 𝑓 = 5 𝑓 = 6 𝑓 = 7 𝑓 = 8</figDesc><table><row><cell>ACC (%)</cell><cell>96.21 97.73 96.21 96.97 97.73 97.73</cell></row><row><cell>AUC (%)</cell><cell>97.45 98.89 97.45 97.83 98.89 98.89</cell></row><row><cell cols="2">16, 6] to generate fake videos. The composition of the</cell></row><row><cell cols="2">organized dataset is elaborated in Table 2.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 6</head><label>6</label><figDesc>Evaluation of multiple classification. In the table, except for the average AUC (%) in the last column, other data represent the ACC (%). Here, Our method integrates the sub-models of W2, W3, W4, W7 and W8 into the ensemble one, which largely outperforms the advanced methods.</figDesc><table><row><cell>%) AUC (%)</cell><cell>ACC (%) AUC (%)</cell><cell>ACC (%) AUC (%)</cell></row></table><note>tion tasks. The results in Table5demonstrate that our method outperforms the previous methods, where MBP is designed for Obama lip forgery and the Audio Driven dataset is challenging with low video resolution and the blocking of microphones or arms.</note></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was supported in part by the Natural Science Foundation of China under Grant U20B2047, U1636201, 62002334, by the Anhui Science Foundation of China under Grant 2008085QF296, by the Exploration Fund Project of the University of Science and Technology of China under Grant YD3480002001 and the Fundamental Research Funds for the Central Universities under Grant WK2100000011.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Face2face: Real-time face capture and reenactment of rgb videos</title>
		<author>
			<persName><forename type="first">J</forename><surname>Thies</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zollhöfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stamminger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Theobalt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nießner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="2387" to="2395" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Nirkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Keller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hassner</surname></persName>
		</author>
		<title level="m">Fsgan: Subject agnostic face swapping and reenactment</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m">IEEE/CVF International Conference on Computer Vision (ICCV)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="7183" to="7192" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><surname>Deepfakes</surname></persName>
		</author>
		<ptr target="http://github.com/deepfakes/faceswap" />
		<title level="m">Deepfakes github</title>
				<imprint>
			<date type="published" when="2017">2017. 2020-08-18</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><surname>Faceswap</surname></persName>
		</author>
		<ptr target="https://github.com/MarekKowalski/FaceSwap" />
		<title level="m">Faceswap github</title>
				<imprint>
			<date type="published" when="2016">2016. 2020-08-18</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Synthesizing obama: Learning lip sync from audio</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">K</forename><surname>-S. Supasorn Suwajanakorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Steven</forename><surname>Seitz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIG-GRAPH</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page">95</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A lip sync expert is all you need for speech to lip generation in the wild</title>
		<author>
			<persName><forename type="first">R</forename><surname>Prajwalk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mukhopadhyay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Namboodiri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Jawahar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM International Conference on Multimedia</title>
				<meeting>the 28th ACM International Conference on Multimedia</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Rössler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cozzolino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Verdoliva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Riess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Thies</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nießner</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1901.08971</idno>
		<title level="m">Faceforensics++: Learning to detect manipulated facial images</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Face x-ray for more general face forgery detection</title>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Guo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="5001" to="5010" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Thinking in frequency: Face forgery detection by mining frequency-aware clues</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Detecting deep-fake videos from phoneme-viseme mismatches</title>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Farid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Fried</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Agrawala</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="2814" to="2822" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Emotions don&apos;t lie: A deepfake detection method using audio-visual affective cues</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mittal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Bhattacharya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chandra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Manocha</surname></persName>
		</author>
		<idno>ArXiv abs/2003.06711</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Not made for each other-audio-visual dissonance-based deepfake detection and localization</title>
		<author>
			<persName><forename type="first">K</forename><surname>Chugh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dhall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Subramanian</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM International Conference on Multimedia</title>
				<meeting>the 28th ACM International Conference on Multimedia</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">L</forename><surname>Bear</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Harvey</surname></persName>
		</author>
		<idno>ArXiv abs/1805.02934</idno>
		<title level="m">Phoneme-to-viseme mappings: the good, the bad, and the ugly</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.13457</idno>
		<title level="m">Faceshifter: Towards high fidelity and occlusion aware face swapping</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Yi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<idno>ArXiv abs/2002.10137</idno>
		<title level="m">Audio-driven talking face video generation with natural head pose</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">First order motion model for image animation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Siarohin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lathuilière</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tulyakov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ricci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sebe</surname></persName>
		</author>
		<idno>ArXiv abs/2003.00196</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Xception: Deep learning with depthwise separable convolutions</title>
		<author>
			<persName><forename type="first">F</forename><surname>Chollet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="1800" to="1807" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Audio-visual speech recognition with a hybrid ctc/attention architecture</title>
		<author>
			<persName><forename type="first">S</forename><surname>Petridis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Stafylakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Tzimiropoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pantic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Spoken Language Technology Workshop (SLT)</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="513" to="520" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Openface: An open source facial behavior analysis toolkit</title>
		<author>
			<persName><forename type="first">T</forename><surname>Baltrusaitis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Robinson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-P</forename><surname>Morency</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Winter Conference on Applications of Computer Vision (WACV)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Av@car: A spanish multichannel multimodal corpus for in-vehicle automatic audiovisual speech recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ortega</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Sukno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lleida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Frangi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Miguel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Buera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zacur</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC</title>
				<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Content-based tools for editing audio stories</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rubin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Berthouzoz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J</forename><surname>Mysore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Agrawala</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">UIST &apos;13</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Dlib-ml: A machine learning toolkit</title>
		<author>
			<persName><forename type="first">D</forename><surname>King</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Mach. Learn. Res</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1755" to="1758" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Celeb-df: A large-scale challenging dataset for deepfake forensics</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lyu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3207" to="3216" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Dolhansky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bitton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pflaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Howes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Ferrer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.07397</idno>
		<title level="m">The deepfake detection challenge dataset</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Mesonet: a compact facial video forgery detection network</title>
		<author>
			<persName><forename type="first">D</forename><surname>Afchar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Nozick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yamagishi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Echizen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Workshop on Information Forensics and Security (WIFS)</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Exposing deepfake videos by detecting face warping artifacts</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lyu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Grad-cam: Visual explanations from deep networks via gradient-based localization</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Selvaraju</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cogswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vedantam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="618" to="626" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Visualizing data using t-sne</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">V D</forename><surname>Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine learning research</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="2579" to="2605" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
