<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lip Forgery Video Detection via Multi-Phoneme Selection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiaying Lin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenbo Zhou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Honggu Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hang Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weiming Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nenghai Yu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Simon Fraser University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science and Technology of China</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deepfake technique can produce realistic manipulation videos including full-face synthesis and local region forgery. General methods work well in detecting the former but are usually intractable in capturing local artifacts especially for lip forgery detection. In this paper, we focus on the lip forgery detection task. We first establish a robust mapping from audio to lip shapes. Then we classify the lip shapes of each video frame according to diferent spoken phonemes, enable the network in capturing the dissonances between lip shapes and phonemes in fake videos, increasing the interpretability. Each lip shapephoneme set is used to train a sub-model, those with better discrimination will be selected to obtain an ensemble classification model. Extensive experimental results demonstrate that our method outperforms the most state-of-the-art methods on both the public DFDC dataset and a self-organized lip forgery dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lip Forgery</kwd>
        <kwd>Deepfake Detection</kwd>
        <kwd>Phoneme and Viseme</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Real
Thanks to the tremendous success of deep generative
models, face forgery becomes an emerging research topic
in very recent years and various methods have been
proposed [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Depending on the manipulated region, they Fake
can be roughly categorized into two types: full-face
synthesis [3, 4] that usually swaps the whole synthesized
source face to a target face, and local face region forgery Figure 1: The lip shapes of speaking the word “apple” in real
[5, 6] that only modifies partial face region, e.g., modify- (top) and fake (bottom) video. In the real video, the lips are
ing the lip shape to match the audio content. Especially more widely opened with clear teeth texture, while opposite
when the lips of politicians have been tampered with in the fake.
to make inappropriate speeches, it can lead to serious
political crisis.
      </p>
      <p>
        To alleviate the risks brought by malicious uses of face
forgery, many detection methods have been proposed
[7, 8, 9]. These methods usually consider the forgery
detection from diferent aspects and extract visual
features from the whole face region, achieving impressive
detection results on public datasets FF++ and DFDC, in
which most of the fake videos are tampered in a full-face
synthesized manner. But this type of detection
methods struggle to handle the local region forgery cases like
lip-sync [5]. Recently, [10] attempt to detect lip-sync
forgery video with single phoneme-viseme matching for
specific targets. [
        <xref ref-type="bibr" rid="ref3">11, 12</xref>
        ] employ features such as audio
and expression to detect synchronization between
diferent modalities.
      </p>
      <p>To address the problem of local region forgery
detection, in this paper, we proposed a complete
multiphoneme selection-based framework. To take full
advantage of the particularity of lip forgery videos that
contain audios, we need to establish a robust mapping
relationship between the lip shapes and the audio
contents. Prior studies in the realm of Audio-Visual Speech
Recognition have demonstrated that the phoneme is the
smallest identifiable unit correlated with a particular lip
shape. Motivated by [13], we divide audio contents into
12 phoneme classes and classify all the video frames. For
each phoneme-lip set, we measure the deviation on
openclose amplitude between real and fake lip shapes, and
train a sub-model for real/fake classification.</p>
      <p>Usually, a large deviation represents the obvious
discrepancy between the real and fake lip shapes, which
also indicates the great dificulty in synthesizing the lip
shape for the corresponding phoneme. Simultaneously, it
shows the robustness of correlated phoneme-lip mapping
against physical changes in diferent videos, e.g., volume
and face angle. This precisely provides a distinguishing
feature for forgery detection. By selecting the phonemes
with the top-5 deviations, we integrate the corresponding
5 well-trained sub-models into an ensemble model for
maximizing the discriminability of real and fake videos.</p>
      <p>To verify the efectiveness, we have conducted
extensive experiments on both the public DFDC dataset and a
self-organized lip forgery video dataset which contains
four sub-datasets. The experimental results demonstrate
that our method outperforms the current state-of-the-art
detection methods on cross-dataset evaluation and
multiple class classification. In addition, our method is also
competitive on single dataset classification.
methods have become mainstream in very recent years.
[7] uses XceptionNet [17] to extract features from
spatial domain. F3-Net [9] achieves state-of-the-art using
frequency-aware decomposition. However, since the
audios are lacking in most public deepfake datasets, these
methods are designed in a universal manner with no
consideration of audios matching. They perform well
in full-face synthesis detection but is not adequate to
recognize the subtle artifacts in local region forgery.</p>
      <p>
        Recently, [
        <xref ref-type="bibr" rid="ref3">11, 12</xref>
        ] utilize Siamese network to
calculate the feature distances in multi-modalities. If
manipulation is conducted on a small segment of the video,
this will weaken the inconsistency among these
modalities at the video level, leading to a decrease in detection
performance.[10] establishes one single phoneme-viseme
mapping for a specific person, which severely restricts
the application scenario. To address the above limitations,
we propose a multi-phoneme selection based framework
for lip forgery video detection.
• We propose a multi-phonemes selection based
framework for lip forgery detection task, which
takes full advantage of the visual and aural
information in lip forgery videos.
• We establish 12 categories of phoneme-lip
mapping relationships and explore the robustness
between the open-close amplitudes on each pair for 3. Method
real/fake classification. We also organize a new
lip forgery dataset which is helpful to facilitate In this section we will elaborate the multi-phoneme
sethe development of lip forgery detection methods. lection based framework. Before that, an important
ob• Extensive experiments demonstrate that our servation of lip forgery will be introduced first.
method outperforms state-of-the-art approaches
for lip forgery detection on both the public DFDC 3.1. Motivation
dataset and a self-organized lip forgery dataset.
      </p>
      <p>Lip forgery modifies a specific person’s lip shape to match
arbitrary audio contents, thus establishing a close
rela2. Related work tionship between them. However, due to imperfections
in the manipulation, uncontrollable artifacts may be
gen2.1. Deep Face Forgery erated to hinder the matching.</p>
      <p>As shown in Figure 1, when saying the word “apple",
According to diferent forgery regions, existing methods the lips in the forgery videos are more blurred to open
can be divided into two categories: full-face synthesis well. Although this nuance is not easy to perceive by
and local region forgery. Full-face synthesis usually syn- human eyes, a well-designed detector can capture it.
Nevthesizes a whole source face and swaps it to the target. ertheless, the lip shape itself fluctuates in a certain range
Typical works are [4, 14]. under diferent expressions, large fluctuation indicates</p>
      <p>Local region forgery is a more common type, focus- poor robustness.
ing on slight manipulation of partial facial regions, eg, Based on this observation, it is necessary to establish
eyebrow locations and lip shapes. Lip-sync [5] is able a robust mapping from audios to lip shapes. Inspired by
to modify the lip shapes in Obama’s talking videos to recent works in Audio-Visual Speech Recognition [18],
accurately synchronize with a given audio sequence. [15] we divide all audio contents into 12 phonemes categories
leverages 3D modeling for specific face videos to make as the smallest identifiable units. Each phoneme set
conthe control of lip shapes more flexible. First Order Motion sists of various vowels, consonants and quiet soundmark,
[16] uses video to drive a single source portrait image to which can be used to train sub-model independently to
generate a talking video. The detection of local region distinguish real/fake lips. Eventually, we select several
forgery is more challenging due to the subtle and local sub-models to integrate the final classifier considering
nature. the trade-of between eficiency and performance. The
framework is depicted in Figure 2.</p>
      <sec id="sec-1-1">
        <title>2.2. Face Forgery Detection</title>
        <p>Early works explored visual artifacts, eg, the
abnormality of eye blinking and teeth. Learning-based detection
Audio
dividing
LDA</p>
        <p>Classifier
Real</p>
        <p>Fake
12 Phonme-Lip Mapping
Multi-Phonemes Selection
Video</p>
        <p>Lip Frames
12 Phoneme Categories</p>
        <p>Amplitude
Deviation
48 IPA Phonetic Symbols</p>
        <p>Here, ( | x) is the probability of x belongs to class
c, which is computed as the ratio between the in-class
and the out-of-class distribution from the previous
distance , following the Gaussian distribution with means
, ˜ and variances , , respectively :</p>
        <p>̃︀
( | x) =
1 − Φ (︁ (x)− )︁</p>
        <p>Φ (︁ (x)−˜ )︁
˜
(3)</p>
        <p>Next, we estimate the probabilities of it belonging to
each class, and assign the sample to the class with the
highest normalized probability :
( | x)
(x) = ∑︀
=1 ( | x)
(1)
(2)</p>
      </sec>
      <sec id="sec-1-2">
        <title>3.3. Multiple Phonemes Selection</title>
        <p>Although the lip shapes in one phoneme set are similar,
the open-close amplitudes among phonemes are quite
diferent. We use dlib 68 face landmarks detector [ 22]
to compute the vertical axis value between the 63th and
67th landmarks:  = (63-67). Here  represents the</p>
        <sec id="sec-1-2-1">
          <title>Forgery Methods</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>Obama Lip-sync[5]</title>
        </sec>
        <sec id="sec-1-2-3">
          <title>Audio Driven[15]</title>
        </sec>
        <sec id="sec-1-2-4">
          <title>First Order Motion[16] W1 W2</title>
          <p>open-close amplitude of the current lip shape. Using the
number of frames as the horizontal axis, we calculate
 for each frame during the period of the phoneme. In
Figure 3, we plot two average amplitude curves for each
set, the red curves represent the real videos while the
blue for fake.</p>
          <p>In W1 and W2, the real and fake curves are widely
separated with almost no overlap, while in W3 and W6,
there are partially stacked areas. This observation
indicates that the real and fake lips are more discriminative
in certain phoneme sets. To select the most
distinguishable phonemes  for classification, we calculate the
diferences between the maximum and minimum values
 , of real/fake curves, respectively. We
define the amplitude deviation  to represent the
discrepancy between real and fake in each phoneme  :
 = 12 ( +  ).</p>
          <p>Considering the potential diferences in forgery
methods, the amplitude deviations of a single phoneme are not
identical. As listed in Table 1, the phonemes with top-5
amplitude deviations are in bold, and we will introduce
the self-organized dataset in Section 4.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>3.4. Sub-classification Models training and Ensemble</title>
        <p>After selecting the phoneme-lip sets for each forgery
method, we train sub-classification models based on them. Many datasets [7, 23] have been public for deepfake
detecEach sub-model can be used independently for real/fake tion task. Although with large scale and various forgery
lips discrimination. Here we adopt XceptionNet [17] as methods, most fake videos do not contain the audios,
the backbone and transfer it to our task by resizing the which still tampered in a full-face synthesized manner. So
input to 128×128 and replacing the final connected layer far, there is no dedicated dataset released for lip forgery
with two outputs. detection. In this paper, we use one public audio-visual</p>
        <p>
          To obtain a stronger detection performance, we inte- deepfake dataset and organize a new dataset targeting
grate the sub-models into an ensemble one. The average the lip forgery detection task.
weight for each is equal to ensure the contribution is Public DFDC Dataset [24] has been published in the
maximized. Furthermore, phoneme units in the video Deepfake Detection Challenge, using multiple
manipuwill last for some duration, which contain several lip lation techniques and adding audios to make the video
frames. Both the lip frame numbers  and sub-models  scenarios more natural. To make a fair comparison, we
will influence the detection accuracy of the final ensem- align with the settings of [
          <xref ref-type="bibr" rid="ref3">11</xref>
          ], using 18,000 videos in the
ble model, hence we experiment on them respectively. experiments.
        </p>
        <p>The results in Section 4 demonstrate that when  = 4 New Lip Forgery Dataset To build the new lip fogery
and  = 5, the ensemble model can achieve excellent dataset, we adopt four state-of-the-art methods [5, 15,</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experiments</title>
      <p>In this section, we initially introduce a new lip forgery
video dataset organized by this paper. Several parameter
studies can verify the optimality of our settings.
Further experiments are provided to demonstrate the
efectiveness of our proposed framework on DFDC and
selforganized dataset, as well as the transferability between
them.</p>
      <sec id="sec-2-1">
        <title>4.1. Public Dataset and New Lip Forgery</title>
      </sec>
      <sec id="sec-2-2">
        <title>Dataset</title>
        <p>16, 6] to generate fake videos. The composition of the
organized dataset is elaborated in Table 2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>4.2. Experimental Settings</title>
        <p>As mentioned before, XceptionNet is the baseline.
According to the particularity of the public DFDC dataset
and self-organized dataset, we adopt diferent training
strategies. On the large DFDC dataset, we train our model
with a batch size of 128 for 500 epochs. Due to the
distinctly smaller size of the self-organized dataset, we train
with a batch size of 16 for 100 epochs on each sub-dataset.
For both datasets, we uniformly use the Adam optimizer
with the learning rate of 0.001 and employ ACC
(accuracy) and AUC (area under ROC curve) as evaluation
metrics.</p>
      </sec>
      <sec id="sec-2-4">
        <title>4.3. Parameter Study</title>
        <p>Frame Selection. As showed in Figure 2, a single
phoneme unit will include several lip frames. We use
 to represent the number of lip frames, the value of
 has an impact on the competence of the model. Few
lip frames result in missing lip features of the current
phoneme, while extra frames may overlap with others.</p>
        <p>In order not to introduce disturbances from other
factors, we experiment on the Obama Lip-sync dataset. We
integrate all the 12 phoneme sub-models into one and
take the beginning time of each phoneme as the center
to select the surrounding frames  . Table 3 displays the
accuracy of  from 3 to 8. The accuracy reaches 97.73%
when  = 4, 7 and 8. Considering the tradeof between
accuracy and complexity, we finally choose  = 4.</p>
        <p>Phoneme Selection. Still executing on the Obama
Lip-sync dataset, we use  to denote the number of
selected phonemes. Referring to the amplitude deviations
ranking listed in Table 1, we integrate the sub-models
from 2 to 12, the highest accuracy is achieved when  =
5. Thus we choose phoneme sets with the top 5 amplitude
deviations to train sub-models.</p>
      </sec>
      <sec id="sec-2-5">
        <title>4.4. Evaluation on DFDC Dataset</title>
        <p>In this section, we compare our method with previous
deepfake detection methods on DFDC. The ratio of
training and testing sets is 85:15. Even though we only crop
the lip region of the face, we still achieve a competitive
performance. In Table 4, our method achieves 91.6% on
AUC, which outperforms not only the vision based
fullface method but also the audio-visual based multi-modal
method. Among them, Syncnet[12] detects the
synchronization from audios to video frames, achieves 89.50%
on AUC, while ignoring the content matching between
them. The improvement in ours mainly benefits from the
establishment of the phoneme-lip mapping, where the
selected phonemes W2,W5,W7,W10 and W11 are robust
to various external disturbances in DFDC such as face
angle, illumination, and video compression, boosting the
detection capability of the ensemble model.</p>
        <p>Moreover, we respectively visualize the
Gradientweighted Class Activation Mapping (Grad-CAM) [28] for
the baseline and ours, as shown in Figure 4. It shows that
our method can significantly include the surrounding
regions such as the upper and lower lips, which facilitates
the network to focus on the open-close amplitudes and
is in line with our motivation. In contrast, the baseline
model mainly concerns the internal teeth regions, losing
the edge information.</p>
      </sec>
      <sec id="sec-2-6">
        <title>4.5. Evaluation on Self-organized Dataset</title>
        <p>In this section, we conduct experiments on self-organized
dataset to verify the performance of real/fake
classification and multiple classification.
4.5.1. Evaluation of Real/Fake Classification
For each sub-dataset, We use diferent phonemes to
integrate the final classification model, the selections are
listed in Table 5. The baseline model (Xception) is
directly trained on all continuous frames of real/fake videos.</p>
        <p>Further, to verify that our method is not restricted by
the backbone, we adopt another network architecture</p>
        <p>ResNet-50 [29] which performs well in image
classification tasks. The results in Table 5 demonstrate that our
method outperforms the previous methods, where MBP
is designed for Obama lip forgery and the Audio Driven
dataset is challenging with low video resolution and the
blocking of microphones or arms.
4.5.2. Evaluation of Multiple Classification
To further distinguish diferent forgery methods, in the
4 sub-datasets, we label all real lips with 0 and fake lips
with 1 ∼ 4 individually. W2, W3, W4, W7, W8 are
chosen to train the classification model.</p>
        <p>DFDC</p>
        <p>Xception
[13] H. L. Bear, R. Harvey, Phoneme-to-viseme map- (2016) 770–778.</p>
        <p>pings: the good, the bad, and the ugly, ArXiv [28] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,
abs/1805.02934 (2017). D. Parikh, D. Batra, Grad-cam: Visual explanations
[14] L. Li, J. Bao, H. Yang, D. Chen, F. Wen, Faceshifter: from deep networks via gradient-based localization,
Towards high fidelity and occlusion aware face in: Proceedings of the IEEE international
conferswapping, arXiv preprint arXiv:1912.13457 (2019). ence on computer vision, 2017, pp. 618–626.
[15] R. Yi, Z. Ye, J. Zhang, H. Bao, Y. Liu, Audio-driven [29] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learntalking face video generation with natural head ing for image recognition, in: Proceedings of the
pose, ArXiv abs/2002.10137 (2020). IEEE conference on computer vision and pattern
[16] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, recognition, 2016, pp. 770–778.</p>
        <p>N. Sebe, First order motion model for image anima- [30] L. v. d. Maaten, G. Hinton, Visualizing data using
tion, ArXiv abs/2003.00196 (2019). t-sne, Journal of machine learning research 9 (2008)
[17] F. Chollet, Xception: Deep learning with depthwise 2579–2605.</p>
        <p>separable convolutions, 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)
(2017) 1800–1807.
[18] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos,</p>
        <p>M. Pantic, Audio-visual speech recognition with a
hybrid ctc/attention architecture, 2018 IEEE Spoken
Language Technology Workshop (SLT) (2018) 513–
520.
[19] T. Baltrusaitis, P. Robinson, L.-P. Morency,
Openface: An open source facial behavior analysis
toolkit, 2016 IEEE Winter Conference on
Applications of Computer Vision (WACV) (2016) 1–10.
[20] A. Ortega, F. Sukno, E. Lleida, A. Frangi, A. Miguel,</p>
        <p>L. Buera, E. Zacur, Av@car: A spanish multichannel
multimodal corpus for in-vehicle automatic
audiovisual speech recognition, in: LREC, 2004.
[21] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li,</p>
        <p>M. Agrawala, Content-based tools for editing audio
stories, in: UIST ’13, 2013.
[22] D. King, Dlib-ml: A machine learning toolkit, J.</p>
        <p>Mach. Learn. Res. 10 (2009) 1755–1758.
[23] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A
large-scale challenging dataset for deepfake
forensics, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020,
pp. 3207–3216.
[24] B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes,</p>
        <p>M. Wang, C. C. Ferrer, The deepfake detection
challenge dataset, arXiv preprint arXiv:2006.07397
(2020).
[25] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen,</p>
        <p>Mesonet: a compact facial video forgery detection
network, 2018 IEEE International Workshop on
Information Forensics and Security (WIFS) (2018)
1–7.
[26] Y. Li, S. Lyu, Exposing deepfake videos by detecting
face warping artifacts, in: IEEE Conference on
Computer Vision and Pattern Recognition Workshops
(CVPRW), 2019.
[27] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learning for image recognition, 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zollhöfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stamminger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Theobalt</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nießner, Face2face: Real-time face capture and reenactment of rgb videos</article-title>
          ,
          <source>2016 IEEE ConferOurs (ResNet-50) 62.38 63.51 ence on Computer Vision and Pattern Recognition Ours (Xception) 63.67 64.05 (CVPR)</source>
          (
          <year>2016</year>
          )
          <fpage>2387</fpage>
          -
          <lpage>2395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nirkin</surname>
          </string-name>
          , Y. Keller, T. Hassner, Fsgan:
          <article-title>Subject agnostic face swapping and reenactment, 2019 Table 6 verifies that the ensemble model can be ap</article-title>
          - IEEE/CVF International Conference on Computer plied to multiple classification scenarios.
          <source>We also intu- Vision</source>
          (ICCV) (
          <year>2019</year>
          )
          <fpage>7183</fpage>
          -
          <lpage>7192</lpage>
          .
          <article-title>itively visualize the t-SNE[30] feature distributions from [3] DeepFakes, Deepfakes github</article-title>
          , http://github.com/
          <article-title>Siamese-based to ours. As shown in Figure 5, our method deepfakes/faceswap, 2017</article-title>
          . Accessed 2020-
          <volume>08</volume>
          -18.
          <article-title>is superior to find latent dissimilarity in high-dimensional [4] FaceSwap, Faceswap github</article-title>
          , http://https://github. space
          <article-title>with fewer outliers</article-title>
          . com/MarekKowalski/FaceSwap,
          <year>2016</year>
          . Accessed 2020-
          <volume>08</volume>
          -18.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mittal</surname>
          </string-name>
          , U. Bhattacharya,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Acknowledgments D.</given-names>
            <surname>Manocha</surname>
          </string-name>
          ,
          <article-title>Emotions don't lie: A deepfake detection method using audio-visual afective cues, This work was supported in part by the Natural Science ArXiv abs/</article-title>
          <year>2003</year>
          .06711 (
          <year>2020</year>
          ).
          <article-title>Foundation of China under Grant U20B2047</article-title>
          ,
          <fpage>U1636201</fpage>
          , [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          , Not
          <volume>62002334</volume>
          ,
          <article-title>by the Anhui Science Foundation of China un- made for each other- audio-visual dissonance-based der Grant 2008085QF296, by the Exploration Fund Project deepfake detection and localization</article-title>
          ,
          <source>Proceedings of the University of Science and Technology of China of the 28th ACM International Conference on Multimedia</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>