<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>novel multimodal two-stream CNN deepfakes detector</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Mongelli</string-name>
          <email>mongelli.2020024@studenti.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Maiano</string-name>
          <email>maiano@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irene Amerini</string-name>
          <email>amerini@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>25</institution>
          ,
          <addr-line>Rome, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza University of Rome, Department of Computer, Control and Management Engineering A. Ruberti</institution>
          ,
          <addr-line>Via Ariosto</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Researchers commonly model deepfake detection as a binary classification problem, using an unimodal network for each type of manipulated modality (such as auditory and visual) and a final ensemble of their predictions. In this paper, we focus our attention on the simultaneous detection of relationships between audio and visual cues, leading to the extraction of more comprehensive information to expose deepfakes. We propose the Convolutional Multimodal deepfake detection model (CMDD), a novel multimodal model that relies on the power of two Convolution Neural Networks (CNNs) to concurrently extract and process spatial and temporal features. We compare it with two baseline models: DeepFakeCVT, which uses two CNNs and a final Vision Transformer, and DeepMerge, which employs a score fusion of each unimodal CNN model. The multimodal FakeAVCeleb dataset was used to train and test our model, resulting in a model accuracy of 98.9% that places our model in the top 3 ranking of models evaluated on FakeAVCeleb.</p>
      </abstract>
      <kwd-group>
        <kwd>Deepfake detection</kwd>
        <kwd>Multimodal deepfake</kwd>
        <kwd>Misinformation</kwd>
        <kwd>Multimedia Forensics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The generation of synthetic data ofers several benefits in the industry. Movie companies
utilize manipulation tools to alter the appearance of characters during hazardous scenes and to
manipulate their voices, while e-commerce companies use them to boost customer purchase
speed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Unfortunately, the advancements in AI-generated data also result in strong security
issues. The significant use of social media makes it efortless to gather personal data such
as photos, videos, and information regarding one’s habits and preferences. This data can
be modified with AI technologies in a malicious way, leading to trustworthiness problems.
Malicious individuals can utilize a person’s face and retina to bypass biometric authentications
and their personal habits and preferences to pass security questions. Stolen or false identities
can be used to open or access bank accounts and commit financial fraud. A recent example dates
back to 2020 when scammers stole $35 million from a Japanese company using a sophisticated
voice clone created that tricked the branch manager into thinking he was talking to the company
director and convinced him to initiate several fund transfers to new accounts1. Similarly AI
tools may potentially influence social opinion through harmful misinformation campaigns.
Examples can be found in several deepfake videos of the Ukrainian President Zelensky2, or the
one of Mark Zuckerberg used to dispute politicians’ unwillingness to impose restrictions on
significant technology firms such as Facebook 3.
      </p>
      <p>
        Most deepfake detection methods are typically posed as binary classification problems, where
the input is either “real” or “fake” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The most common approach in deepfake detection is
the unimodal approach, which refers to techniques that concentrate solely on one modality
(i.e., audio or video). This work focuses on the audio and visual modalities, which are the
most informative when dealing with deepfake recordings. Unimodal detectors are subjected to
relevant issues; for instance, a detector that only processes video data cannot identify acoustic
manipulation, while a detector that processes audio data can be readily tricked by manipulating
images [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Many unimodal models can be combined using an ensemble method. However,
in such cases, the model does not analyze the relationship between audio and video cues but
considers them independently. Due to the various forms of manipulation used in audio-visual
deepfakes, learning features from a single modality can lead to an inaccurate assessment of
media authenticity [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Introducing a multimodal approach for the simultaneous detection of
manipulated modalities allows extracting more comprehensive information to expose deepfakes.
However, this approach does not consistently guarantee higher accuracy compared to using
a single modality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], for instance, in cases where one of the two modalities is unavailable.
In such situations, two distinct unimodal approaches are generally more practical than using
multimodal detection methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>This paper presents the CMDD model, a novel multimodal architecture for deepfake detection
based on the concurrent use of two Convolutional Neural Networks for audio-visual features
extraction and a last shared convolutional block used for further processing and a final
multiclass classification. We utilize an efective strategy to extract relevant features from the input data
using the Mel-Spectrogram information for the audio part and spatiotemporal characteristics
between frames in the visual component. We also develop two multimodal baselines named
DeepFakeCVT and DeepMerge: the former is similar to the CMDD model, but it utilizes a shared
Vision Transformer block instead of another convolution step for multimodal classification,
while the latter is an ensemble method of two unimodal audio and video models. We evaluate the
proposed model using a perfectly balanced version of the FakeAVCeleb deepfake dataset such
that the number of videos, gender, age, ethnic backgrounds, and manipulation techniques are
balanced in each class. In our experiments, the proposed solution achieves good performance,
reaching an accuracy value of 98.9%, which places our model in first position in the
state-ofthe-art (SOTA) ranking if we consider methods trained and tested on FakeAVCeleb using our
balancing technique and in third position when it comes to all possible versions of the dataset
using any techniques. Our main contributions are summarized below.</p>
      <p>• We propose a novel multimodal deepfake architecture named the CMDD model, which
1https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-m
illions/
2https://www.reuters.com/world/europe/deepfake-footage-purports-show-ukrainian-president-capitulating-2022-0
3-16/
3https://www.independent.co.uk/tech/mark-zuckerberg-deepfake-ai-meta-b2236388.html
comprises two branches for audio and visual feature extraction and a shared convolutional
block for further processing and final multi-class classification.
• We train and test our model using a balanced version of the FakeAVCeleb dataset, a
benchmark dataset that contains both auditory and visual manipulation;
• To the best of our knowledge, our CMDD model achieves a SOTA performance with an
accuracy value of 98.9% that places our model in the top 3 ranking of models evaluated
on the FakeAVCeleb dataset.</p>
      <p>The rest of this paper is organized as follows. Section 2 proposes a detailed analysis of the
existing unimodal and multimodal state-of-the-art models. Section 3 delves into our proposed
model and the pre-processing stage. Section 4 reports our experimental setup, describing our
proposed baselines and comparative results between our models and state-of-the-art solutions.
Finally, Section 6 draws the conclusions of this work and gives an overview of possible future
improvements to our work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>This section presents various state-of-the-art deepfake detection techniques based on two
approaches: unimodal and multimodal detectors.</p>
      <sec id="sec-3-1">
        <title>2.1. Unimodal deepfake detection</title>
        <p>
          A first approach to classifying audio deepfakes is using machine learning (ML) models. Singh
et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] use a Quadratic Support Vector Machine (Q-SVM) model to diferentiate artificial
speech from humans, taking advantage of the Cepstral and Bi-spectral analysis. The Mel
Cepstral analysis of the audio allows the detection of significant power components in natural
speech absent from the AI-synthesised speech, while the Bi-spectral analysis is used for the
opposite. A remarkable comparison between ML and deep learning (DL) models for audio
deepfake detection is done in Liu et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Specifically, the authors utilize an SVM with
Melfrequency cepstral coeficient (MFCC) characteristics and a standard Convolutional Neural
Network (CNN) with five convolutional layers. This analysis shows that CNN is more reliable
than SVM despite their similar results. Therefore researchers thus turn their attention to the
use of CNN architectures. Wani et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] propose a CNN with four convolutional blocks and
two fully connected layers, and apply transfer learning on two pre-trained models (i.e., VGG16
and MobileNet), to detect inconsistencies in the frequency domain of mel-spectrogram input
images. Instead in Wijethunga et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the authors combine the CNN architecture with a
Recurrent Neural Network (RNN) to leverage their specific qualities: identification of long-term
dependencies in temporal variations with the RNN while the potential of feature extraction
capabilities of the CNN. Arif et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] introduce the ELTP-LFCC, a novel audio feature descriptor,
by combining the linear frequency cepstral coeficients (LFCC) with the local ternary pattern
(ELTP). Diferently from the previous unimodal audio approaches we implement the C4N model,
a CNN-based architecture with 4 convolutional layers able to extract the audio track from the
video, convert it into a digital signal, generate its correspondent mel-spectrogram image and
analyze possible inconsistencies in the frequency spectrum. The architecture is inspired by
Wani et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] with strong modification at the layer level.
        </p>
        <p>
          For what concerns video deepfake detection methods, a large group of works focuses on
detecting local texture inconsistencies caused by applying manipulation techniques. For instance,
Hu et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] propose a novel frame inference framework named FInfer that uses an encoder to
acquire facial representations for current and future frames and an auto-regressive model to
predict future faces using the encoder knowledge. Instead of analyzing facial inconsistencies,
Nirkin et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] also focus on the region around the face. Specifically, the authors notice
that the deepfake generator modifies the face region during manipulation, leaving the context
untouched. Therefore, they compare the resulting two identity vectors to detect discrepancies
using context recognition and face identification networks. A diferent approach is taken in
Maiano et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], where the authors present a DepthFake method that uses depth maps to
classify deepfake videos. They use the FaceDepth detector to estimate the depth of human faces
and a pre-trained Xception network to classify each frame of the recordings. A method named
ID-Reveal is presented in Cozzolino et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The proposed model uses metric learning in
conjunction with an adversarial training strategy to learn temporal facial features peculiar to a
person’s movements during speech. They utilize a CNN architecture composed of a facial feature
extractor, a temporal ID network to detect bio-metric anomalies, and a generative adversarial
network to predict each peculiar person’s motions. The use of Visual Transformers in video
deepfake detection is proposed by Khan et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], where the authors leverage incremental
learning. Their model employs an XceptionNet as an image feature extractor, a Single Stage
Detector (SSD) to extract and crop faces frame after frame, and a 3D Dense Face Alignment
(3DDFA) model to generate UV texture maps from face images. They use face pictures and UV
texture maps to extract the image characteristics since they present complementary information.
Instead, in our video detection approach we utilize a ResNet18-3D model that can remove the
frames from the input video, concatenate them into a unique tensor, and extract the necessary
information to detect inconsistencies between spatiotemporal features.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Multimodal deepfake detection</title>
        <p>
          Numerous studies demonstrate how merging several modalities can yield to better conclusions
and complementary information. Many modalities, such as facial signals, speech cues,
background context, hand gestures, and body position and orientation, can be extracted from a
video, even to detect deepfake content and can be combined to determine the authenticity of
a particular video [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In Asha et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], the authors propose the “D-Fence” layer, which
is an ensemble deepfake detection method of two unimodal networks: one detects artificial
face information using a VGG16 model and an optical flow estimator while the other one
extracts the audio information using the Mel Frequency Cepstral Coeficients (MFCC) temporal
feature vector which is then fed into a VGG16 model. The dissonance audio-video cues are
eficiently detected by the two cross-modal networks. Zhang et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] also follow the ensemble
approach. The authors separately extract the audio and visual information from the deepfake
videos and then compare the results. They employ an RNN network for the audio portion and
a 3D-ResNet [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for the visual portion, while the classification is actuated using an adaptive
modality dissonance score (MDS) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] criterion. Instead, authors in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] extend the applicability
of [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] to cases where one modality is missing.
        </p>
        <p>
          Instead of presenting an ensemble approach based on Convolutional Neural Networks, Hashmi
et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) architecture
that takes into account both visual and audio manipulations. Specifically, AVTENet combines
three exclusively transformer-based networks, integrated with pre-trained models through
supervised and self-supervised learning, to identify significant cues in audio, video, and
audiovisual modalities. In Cozzolino et al. [20] authors present a Person of Interest (POI) deepfake
detector which relies on biometric features. This results in an extremely accurate detector
because each individual possesses unique biometric features unlikely to be replicated by a
synthetic generator.
        </p>
        <p>
          Instead of focusing on biometric characteristics, authors in Mittal et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] rely on emotional
behaviors and features. They suggest a deep learning network inspired by triplet loss and
Siamese network architecture. To determine whether an input video is genuine, they extract and
compare afective signs from the auditory and visual modalities within the video that correspond
to the perceived emotion. Another study focuses on the lip movements of people speaking.
Specifically, in Shahzad et al. [ 21] a revolutionary lip-reading-based multimodal deepfake
detection technique is proposed. The idea is that synthetic lip sequences are frequently out of
sync with their audio stream, so one good indicator of whether or not the video information
has been altered is the disparity in lip movements. A Multi-modal Multi-scale TRansformer
(M2TR) is presented in Wang et al. [22], where the authors utilize a two-stream approach in
which the frequency stream uses learnable frequency filters to filter out forgery features in the
frequency domain and the RGB stream uses several scales to catch inconsistencies between
diferent regions within an image.
        </p>
        <p>In contrast to previous works, our multimodal approach consists of extracting and processing
each video’s audio and visual features using two branches. The audio branch is responsible for
detecting inconsistencies in the frequency spectrum of the mel-spectrogram input image, while
the video branch is responsible for detecting spatiotemporal irregularities on a frame-by-frame
basis. Both branches are characterized by a CNN-based model used for feature extraction. The
features from each branch are then concatenated and fed into a final shared convolutional block
that simultaneously elaborates the information and performs the final multi-label classification.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Method</title>
      <p>This section describes our proposed Convolutional Multimodal deepfake detection model (CMDD).
In our approach, we simultaneously consider the information related to audio and video
modalities. The proposed architecture, depicted in Figure 1, comprises a diferent stream for each
modality. Each of them extracts the feature vector from the input data using the power ofered
by the CNNs to investigate the temporal and spatial locality characteristics. Then, the extracted
vectors are concatenated, and the resulting tensor is given to another CNN block with a final
fully connected (FC) layer for the classification.</p>
      <sec id="sec-4-1">
        <title>3.1. Preprocessing</title>
        <p>Due to its multimodal nature, the proposed model requires a pre-processing phase to detect
and extract the necessary information in videos. One network branch is structured to deal
with audio information while the other requires visual features. Therefore, the input has to be
adapted before being fed into the two audiovisual blocks. Specifically for the auditory part, we
extract the audio track from the input video using the FFmpeg4 media converter and represent
it as a waveform digital signal. This signal is converted from the time to the frequency domain
using the Fast Fourier Transform (FFT), which is applied to signal segments using shifting
windows. The result is a sort of heat map named spectrogram, a visual representation of how
the signal’s amplitude varies over time at diferent frequencies. The obtained spectrogram is
ifnally converted from a logarithmic scale to a mel scale using Equation 1, where m indicates
the mel scale and f refers to the frequency.</p>
        <p>f
m = 2595 ∗  10(1 + ) (1)
700</p>
        <p>Regarding the visual part, we divide the input video into frames again using the FFmpeg
converter, and then all the obtained frames are stacked into a unique tensor. All mel spectrograms
and frames are resized to 224x224 pixels, converted into a tensor, and finally normalized.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Architecture</title>
        <p>Our proposed architecture consists of the audio and the video pipeline. The former uses a
C4N model consisting of a standard CNN architecture with 4 convolutional blocks. Each block
comprises a 2D convolutional layer with a rectified linear unit (ReLU) activation function and a
batchnorm2d normalization. No adaptive pooling layers or linear classifiers are used. The audio
network takes in input tensors containing the information of each mel spectrogram given in an
RGBA format. Instead the latter uses a pre-trained ResNet18 [23] adapted for 3D input data
to handle sequences of images. Like the audio network, it has no fully connected layers for
classification; therefore the resulting output is a feature vector. The two feature vectors obtained
from the audio and video pipelines are then concatenated, and the resulting tensor is finally fed
into another CNN block with a 3D convolutional layer, ReLu function, a BatchNorm3D layer,
and a last FC layer for the final 4-class classification: label 0 ( ArVr ) for videos classified with
both fake audio and fake video, label 1 (ArVf ) for real audio and fake video, label 2 (AfVr ) for
fake audio and real video, and label 3 (AfVf ) for both real audio and real video.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental settings</title>
      <p>This section describes the dataset and implementation details, compares results with
state-ofthe-art architectures, and highlights diferences.</p>
      <sec id="sec-5-1">
        <title>4.1. Dataset</title>
        <p>We train and evaluate our architecture using the FakeAVCeleb [24] dataset. Most available
datasets are not structured to deal with multimodal deepfake detection. For instance, datasets
such as FaceForensics++[25], Deep-FakeTIMIT [26] or KoDF [27] contain manipulated content
exclusively on the visual modality. Instead, FakeAVCeleb is created specifically for multimodal
audio-visual studies. It is a multimodal, age and gender-balanced dataset with YouTube videos
taken from the VoxCeleb2 [28] dataset by selecting videos of 500 celebrities with five racial
backgrounds (Caucasian (American), Caucasian (European), African, Asian (East), and Asian
(South)). All the speeches are in English, but the diferent ethnic backgrounds introduce diversity
in phonemes and accents partially removing racial bias issues. Each video is a clean recording
with only one person’s frontal face without occlusion. Additionally, it covers several Deepfake
generation techniques, allowing for a better generalization with various detection methods. In
particular, the dataset contains 500 real videos and 19500 deepfake ones, with a ratio between
fake and real videos equal to 1:39. The provided dataset is extremely unbalanced. Therefore,
we remove several deepfakes videos until we reach a perfectly balanced ratio, maintaining the
balance in the number of forged videos with the same manipulation technique and number of
videos for each class and equality of gender, age, and ethnic backgrounds. The resulting dataset
contains videos characterized by diferent lengths and frame rates. This diference may cause
issues during training. As a result, we tested diferent unimodal models on both the entire video
and the first second of the recording. The results were fairly similar, but using the entire video
requires significantly more computational power. Thus, we take the decision to consider just
the first second of the videos. Therefore, in the pre-processing phase explained in Section 3.1,
we extract the first second of audio and the first 25 frames from each video. We set 25 frames
because it is the dataset’s most common frame rate value, and thus, it is around the number of
frames in one second of video.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Baselines</title>
        <p>We develop the CMDD model using the knowledge gained from implementing DeepFakeCVT
and DeepMerge baselines.</p>
        <p>• DeepFakeCVT. The DeepFakeCVT has a similar structure to the CMDD model; the only
diference is in the classification block. DeepFakeCVT uses a CCT-3D transformer instead
of another convolution block to speed up the computational eficiency of the model.
• DeepMerge. The DeepMerge model is instead an ensemble method that directly performs
a fusion of the score of each branch of the CMDD model, where the two unimodal networks
have the last FC layers for classification.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Implementation details</title>
        <p>We use the proposed baseline architectures to find the best settings for our multimodal
architecture. We train our models using AdamW optimizer as an optimization algorithm, the
cross-entropy as a loss function, and a learning rate equal to 1e-5. The necessary number of
training epochs without introducing overfitting is around 12. We train all models on the Nvidia
1080 Ti with 11GB of memory. Due to the low available computational power, we must set the
batch size to 8, specifically, 4 samples show audio information, and the other 4 represent visual
features.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Evaluation metrics</title>
        <p>We evaluate our results based on the following metrics.</p>
        <p>• Accuracy. It represents the ratio of the correct predictions over the whole predicted
sample. Calling   ,   ,   , and   the true positive, true negative, false positive, and false
negative examples respectively, we can calculate the accuracy as  +  +  ++   .
• Precision. It represents the ratio of the correct predictions over the whole correct samples
and can be calculated as  + .
• Recall. It is the ratio of correct predictions for a class to the total number of cases in
which it occurs, and is computed as  +   .
• F1-score. It is the Harmonic mean between Precision and Recall and is obtained as
2 ∗ +∗ .
• Confusion matrix. It is a matrix that visually represents the distribution of predictions
among the classes.</p>
        <p>Some metrics are then adapted to deal with more than 2 classes by introducing the concept
of macro and micro metrics. Precision-Macro is computed by using the average precision for
each predicted class, Recall-Macro is by using the average recall for each actual class, and
Precision-Micro and Recall-Micro are by considering all the samples independently from their
class. Finally, the F1-Macro score and F1-Micro score are respectively computed using the Macro
and Micro metrics of Precision and Recall [29].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Results</title>
      <p>In this section, we propose an analysis of the performance of our proposal. Our approach reaches
promising results compared to our baselines and the accuracy values of other state-of-the-art
(SOTA) methods. Table 1 shows the performance of the proposed CMDD model and the two
provided baselines DeepFakeCVT and DeepMerge. It is evident that the CMDD model is the
architecture with the best results in all the considered metrics. It achieves a 98.9% accuracy,
therefore outperforming the two baselines. Figure 2 shows the confusion matrix related to the
CMDD model. It contains just 7 misclassified samples over 668 total examples. Specifically, 3
samples are predicted as FakeVideo-FakeAudio (label 0) instead of FakeVideo-RealAudio (label
1), and the other 3 as RealVideo-FakeAudio (label 2) instead of RealVideo-RealAudio (label 3).
Just one recording that belongs to the actual FakeVideo-RealAudio (label 1) class is incorrectly
predicted as RealVideo-RealAudio (label 3). Instead, Figure 3 reports the trend of the CMDD’s
training loss, which rapidly and smoothly decreases till convergence without showing relevant
issues.</p>
      <p>
        The high quality of our results is demonstrated by making comparisons with the SOTA models
trained and tested using the FakeAVCeleb dataset reported in Table 2. Our proposed CMDD
model takes third place in the SOTA leaderboard. Only DLC [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and S-Capsule Forensics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] beat
our CMDD and DeepMerge models, while DeepFakeCVT is also beaten by the MIS-AVoiDD [30]
model. Diferently from our proposed method, the authors of DLC [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] apply a diferent input
processing that consists of extracting the Short Time Fourier Transform signal (STFT) from
the input audio track in videos. The STFT gives information about the frequency trend of the
extracted signal over time. Instead, we deal with the same frequency information by displaying
it in a clever colored image, the mel-spectrogram. Furthermore, we obtain similar results just
using the first second of the audio track instead of considering the full audio track. In addition,
we represent the frequency information into a single image while [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] considers 4 concatenated
images for each second of video, increasing the memory and GPU requirements. In their paper,
the authors use a pre-training step to enhance the capabilities of their architecture, while our
model is trained from scratch. We also reach comparable results using fewer training epochs,
12 instead of over 100, and batch size, 8 instead of 12. We achieve a quite similar performance
despite our limited computing power, thus our model may outperform DLC [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with equal
computational capacity. Instead, S-Capsule Forensics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] uses mel spectrograms as audio input,
referring to the first 4 seconds of the recording instead of the first second as we do. Furthermore,
a diferent visual feature processing is applied. They utilize all the frames of each video, while
in our project we set the number of extracted frames to 25 per video. In this way, we have
the same number of extracted images, and the network does not consider the length of the
recordings as a relevant feature for classification. As well as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we get similar results in
accuracy using a lower number of training epochs and batch size. They use 50 epochs and a
batch size of 10. Also in the MIS-AVoiDD [30] method, researchers use another type of auditory
input. Instead of using the Short Time Fourier Transform (STFT) signal or the Mel spectrogram
as we do, they implement an architecture capable of handling the Mel Frequency Cepstrum
Coeficient (MFCC). For the visual part, they use 300 frames per video in the training step and
180 in the testing step. Instead, we obtain higher performances by using 26 frames for each
video, specifically 1 image for the audio and 25 for the visual information. Our networks are also
shallower than their model, allowing us to quickly extract and use the necessary information
for eficient training, without extracting too many features that may introduce some bias.
      </p>
      <p>Further diferences in the type of input also appear in PVASS-MDD [ 31] and in
Multimodaltrace [33], where the authors utilize a spectrogram and a Short Time Fourier Transform (STFT)
signal respectively. These two papers use unnecessary information in audio and visual features
and apply deeper networks than ours. Nevertheless, we beat both models, and even
AVFakeNet [32], using less informative content and shallow networks with lower batch size and
learning rate values, but the same optimizer and loss function.</p>
      <p>
        In Table 2, all the reported models use the same dataset, but some diferences appear in the
balancing phase. Some authors correct the unbalanced characteristic of the dataset by adding
videos from another small dataset, as done in Shahzad et al. [21], and by using augmentation
techniques like horizontal and vertical flipping, blurring, salt and pepper noise, as done in Ilyas
et al. [32]. For instance, both preprocessing methods are used by the authors of Raza et al. [33].
In other papers like Yu et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors split the dataset according to the manipulation
technique used to forge videos, run the test only on each group separately, and finally make an
average of the resulting evaluation metric values. In a few cases, the authors decide to reduce
the imbalance of the dataset by removing the deepfakes in excess. This is the case of Shahzad
et al. [21] and Hashmi et al. [34]. For this reason, we decided to explore this open branch by
applying the same balancing technique, and the resulting ranking of state-of-the-art models
is reported in Table 3, where our CMDD model is in the first place, with our two baselines
following as the second and third models in the leaderboard.
      </p>
      <p>The model closest to the models we proposed in terms of accuracy is the AV-Lip-Sync
introduced by Shahzad et al. [21]. In their paper, the authors use a diferent pipeline for audio
information. They use a Wav2Lip tool to generate synthetic lip sequences based on the audio
track of each video. Thus, they analyze the two lip sequences, the generated and the real, and
detect manipulated videos by looking at the potential mismatch between them. Also, their
selected hyper-parameters are diferent from ours. They use Adam optimizer, a learning rate of
0.0002, and batch size of 32 instead of AdamW, 1e-5 and 8 respectively, as done in our proposed
networks. Instead, Hashmi et al. [34] propose an ensemble model that uses diferent CNN-based
networks for the audio and visual feature extraction. Their choice of hyper-parameters is
diferent because they use Adam optimizer, the learning rate of 0.001, cross-entropy loss and a
batch size diferent for each network: 512 for the audio and 64 for the video network. Their
results are extremely poor concerning the accuracy values obtained by our models.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusions and Future Works</title>
      <p>This paper proposes a novel multimodal deepfake detection method called CMDD. It allows
for the simultaneous detection of forgeries in both auditory and visual channels using two
CNN networks for feature extraction and a shared convolutional block for the final multi-class
classification. We also implement two diferent baselines named DeepFakeCVT and DeepMerge
to highlight the powerful capabilities of the CMDD model. The CMDD model outperforms
both architectures using a shallower network than DeepFakeCVT, which uses a deeper Vision
Transformer and a single classification block instead of two as in DeepMerge. CMDD also
outperforms several state-of-the-art (SOTA) models, achieving an accuracy score of 98.9%,
which places it first in the SOTA ranking when considering methods trained and tested on
FakeAVCeleb using our same balancing technique and third when considering all possible
versions of the dataset.</p>
      <p>
        In the future, we can enhance our architecture by using more powerful model training and
testing tools. For example, increasing batch size, learning rate, or epochs can help us evaluate if
our proposed model outperforms the DLC [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] network. Furthermore, the proposed architecture
showed a rapid tendency to overfit due to the high number of layers and the dataset type.
We plan to expand this study to larger datasets for a more robust evaluation. It may also be
interesting to better explore the potential ofered by our DeepFakeCVT model by removing
the two initial CNNs, and thus reducing the depth of the network by several convolutional
steps, but also to test other concatenation and fusion techniques. Future work will, therefore,
be oriented towards studying diferent fusion techniques between the two modalities.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments References</title>
      <p>This study has been supported by SERICS (PE00000014) under the MUR National Recovery and
Resilience Plan funded by the European Union - NextGenerationEU.
dissonance-based deepfake detection and localization, 2021. arXiv:2005.14405.
[20] D. Cozzolino, A. Pianese, M. Nießner, L. Verdoliva, Audio-visual person-of-interest
deepfake detection, 2023. arXiv:2204.03083.
[21] S. A. Shahzad, A. Hashmi, S. Khan, Y.-T. Peng, Y. Tsao, H.-M. Wang, Lip sync matters: A
novel multimodal forgery detector, in: 2022 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1885–1892. doi:10.2
3919/APSIPAASC55919.2022.9980296.
[22] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, S.-N. Lim, Y.-G. Jiang, M2tr: Multi-modal
multi-scale transformers for deepfake detection, 2022. arXiv:2104.09770.
[23] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal
convolutions for action recognition, CoRR abs/1711.11248 (2017). URL: http://arxiv.org/ab
s/1711.11248. arXiv:1711.11248.
[24] H. Khalid, S. Tariq, M. Kim, S. S. Woo, Fakeavceleb: A novel audio-video multimodal
deepfake dataset, 2022. arXiv:2108.05080.
[25] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, Faceforensics++:</p>
      <p>Learning to detect manipulated facial images, 2019. arXiv:1901.08971.
[26] P. Korshunov, S. Marcel, Vulnerability assessment and detection of deepfake videos, in:
2019 International Conference on Biometrics (ICB), 2019, pp. 1–6. doi:10.1109/ICB45273
.2019.8987375.
[27] P. Kwon, J. You, G. Nam, S. Park, G. Chae, Kodf: A large-scale korean deepfake detection
dataset, 2021. arXiv:2103.10094.
[28] J. S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: Deep speaker recognition, in:
Interspeech 2018, interspeech 2018, ISCA, 2018, pp. 1086–1090. URL: http://dx.doi.org/10.21437/
Interspeech.2018-1929. doi:10.21437/interspeech.2018-1929.
[29] M. Grandini, E. Bagli, G. Visani, Metrics for multi-class classification: an overview, 2020.</p>
      <p>arXiv:2008.05756.
[30] V. S. Katamneni, A. Rattani, Mis-avoidd: Modality invariant and specific representation
for audio-visual deepfake detection, 2023. arXiv:2310.02234.
[31] Y. Yu, X. Liu, R. Ni, S. Yang, Y. Zhao, A. C. Kot, Pvass-mdd: Predictive visual-audio alignment
self-supervision for multimodal deepfake detection, IEEE Transactions on Circuits and
Systems for Video Technology (2023) 1–1. doi:10.1109/TCSVT.2023.3309899.
[32] H. Ilyas, A. Javed, K. M. Malik, Avfakenet: A unified end-to-end dense swin transformer
deep learning model for audio–visual deepfakes detection, Applied Soft Computing 136
(2023) 110124. URL: https://www.sciencedirect.com/science/article/pii/S1568494623001424.
doi:https://doi.org/10.1016/j.asoc.2023.110124.
[33] M. Anas Raza, K. Mahmood Malik, Multimodaltrace: Deepfake detection using audiovisual
representation learning, in: 2023 IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), 2023, pp. 993–1000. doi:10.1109/CVPRW59228.2023.
00106.
[34] A. Hashmi, S. A. Shahzad, W. Ahmad, C. W. Lin, Y. Tsao, H.-M. Wang, Multimodal
forgery detection using ensemble learning, in: 2022 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 1524–1532.
doi:10.23919/APSIPAASC55919.2022.9980255.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          , E. Sonuç,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Duru</surname>
          </string-name>
          ,
          <article-title>Analysis survey on deepfake detection and recognition with convolutional neural networks</article-title>
          ,
          <source>in: 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/HORA55278.
          <year>2022</year>
          .
          <volume>9799858</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Muppalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <article-title>Integrating audio-visual features for multimodal deepfake detection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>03827</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hashmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Shahzad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsao</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-M. Wang</surname>
          </string-name>
          , Avtenet:
          <article-title>Audio-visual transformer-based ensemble network exploiting multiple experts for video deepfake detection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>13103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tian</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lyu</surname>
          </string-name>
          , J. Han,
          <article-title>A unified framework for modality-agnostic deepfakes detection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>14491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Detection of ai-synthesized speech using cepstral</article-title>
          and bispectral statistics,
          <year>2021</year>
          . arXiv:
          <year>2009</year>
          .
          <year>01934</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yan</surname>
          </string-name>
          , G. Chen,
          <article-title>Identification of fake stereo audio using svm and cnn</article-title>
          ,
          <source>Information</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <article-title>263</article-title>
          . doi:
          <volume>10</volume>
          .3390/info12070263.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Wani</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Amerini</surname>
          </string-name>
          ,
          <article-title>Deepfakes audio detection leveraging audio spectrogram and convolutional neural networks</article-title>
          , in: G. L.
          <string-name>
            <surname>Foresti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fusiello</surname>
          </string-name>
          , E. Hancock (Eds.),
          <source>Image Analysis and Processing - ICIAP 2023</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wijethunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Matheesha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Noman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tissera</surname>
          </string-name>
          , L. Rupasinghe,
          <article-title>Deepfake audio detection: A deep learning based solution for group conversations</article-title>
          ,
          <source>2020 2nd International Conference on Advancements in Computing (ICAC) 1</source>
          (
          <year>2020</year>
          )
          <fpage>192</fpage>
          -
          <lpage>197</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:232071547.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alhameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jeribi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tahir</surname>
          </string-name>
          ,
          <article-title>Voice spoofing countermeasure for logical access attacks detection, IEEE Access PP (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3133134</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          , Finfer:
          <article-title>Frame inference-based deepfake detection for high-visual-quality videos</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>36</volume>
          (
          <year>2022</year>
          )
          <fpage>951</fpage>
          -
          <lpage>959</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/19978. doi:
          <volume>10</volume>
          .1609/aaai.v36i1.
          <fpage>19978</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nirkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Y. Keller, T. Hassner,
          <article-title>Deepfake detection based on the discrepancy between the face</article-title>
          and its context,
          <year>2020</year>
          . arXiv:
          <year>2008</year>
          .12262.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Maiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Papa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vocaj</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Amerini</surname>
          </string-name>
          ,
          <article-title>Depthfake: a depth-based strategy for detecting deepfake videos</article-title>
          ,
          <source>ArXiv abs/2208</source>
          .11074 (
          <year>2022</year>
          ). URL: https://api.semanticscholar.org/Corp usID:
          <fpage>251741027</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cozzolino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rössler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nießner</surname>
          </string-name>
          , L. Verdoliva, Id-reveal:
          <article-title>Identity-aware deepfake video detection</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <year>2012</year>
          .02512.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Video transformer for deepfake detection with incremental learning, 2021</article-title>
          . arXiv:
          <volume>2108</volume>
          .
          <fpage>05307</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mittal</surname>
          </string-name>
          , U. Bhattacharya,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Manocha</surname>
          </string-name>
          ,
          <article-title>Emotions don't lie: An audio-visual deepfake detection method using afective cues</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2003</year>
          .06711.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>A. S</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vinod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Amerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <article-title>A novel deepfake detection framework using audiovideo-textual features</article-title>
          ,
          <year>2022</year>
          . URL: https://doi.org/10.21203/rs.3.rs-
          <volume>2390408</volume>
          /v1. doi:
          <volume>10</volume>
          .212 03/rs.3.rs-
          <volume>2390408</volume>
          /v1.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A deepfake video detection method based on multi-modal deep learning method</article-title>
          ,
          <source>in: 2021 2nd International Conference on Electronics, Communications and Information Technology (CECIT)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>33</lpage>
          . doi:
          <volume>10</volume>
          .1109/CE CIT53797.
          <year>2021</year>
          .
          <volume>00014</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kataoka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Satoh</surname>
          </string-name>
          ,
          <article-title>Can spatiotemporal 3d cnns retrace the history of 2d cnns</article-title>
          and imagenet?,
          <year>2018</year>
          . arXiv:
          <volume>1711</volume>
          .
          <fpage>09577</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <article-title>Not made for each other- audio-visual</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>