<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CAU KU deep fake detection system for ADD 2023 challenge⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soyul Han</string-name>
          <email>soyul5458@cau.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taein Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunmook Choi</string-name>
          <email>felixchoi@korea.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaejin Seo</string-name>
          <email>seojaejin@cau.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanghyeok Chung</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sumi Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seungsang Oh</string-name>
          <email>seungsang@korea.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Il-Youp Kwak</string-name>
          <email>ikwak2@cau.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chung-Ang University</institution>
          ,
          <addr-line>84, Heukseok-ro, Dongjak-gu, Seoul 06974</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Korea University</institution>
          ,
          <addr-line>145 Anam-ro, Seongbuk-gu, Seoul 02841</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>The paper presents the participation of the CAU_KU team in the ADD 2023 Challenge, specifically in track 1.2 (audio fake game - detection track) and track 3 (deepfake algorithm recognition track). Various deep learning models were explored using features from the pretrained wav2vec2 network, as well as CQT, mel-spectrogram, etc. We modified the representation extraction component of the AASIST model to incorporate 2D spectrograms (wav2vec2 or CQT) and attempted diferent deep learning models, with model ensembling employed to create the final model. For track 1.2, our submitted ensemble model for round 1 utilized the CQT-LCNN and CQT-AASIST models. For round 2, our model used the CQT-LCNN, CQT-AASIST, and W2V2-GMM models. For track 3, we ensembled the CQT-LCNN, CQT-OFD and AASIST models. Additionally, we applied the openmax algorithm to detect unknown deepfake attacks. Our best submission achieved 23.44% and 21.26% on round 1 and 2 of track 1.2, respectively, and ranked 3rd in track 1.2.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;audio deep synthesis</kwd>
        <kwd>audio deepfake detection</kwd>
        <kwd>deep learning</kwd>
        <kwd>deepfake algorithm recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Modified representation extraction part of the</p>
      <p>
        AASIST model utilizing W2V2 and CQT.
• Experimented with models that ranked 3rd in
        <xref ref-type="bibr" rid="ref3">the
previous ADD 2022</xref>
        challenge [4] such as LCNN,
ResMax, and OFD.
• Conducted experiments using the Gaussian
mixture model (GMM) with the W2V2 feature, as well
as traditional features such as MFCC and CQT.
• Applied OpenMax algorithm for track 3
      </p>
      <sec id="sec-1-1">
        <title>2.3. Models</title>
        <sec id="sec-1-1-1">
          <title>For the first round of Track 1.2, the ensemble model</title>
          <p>submitted by our team comprised the CQT-LCNN and
CQT-AASIST models. For the second round, the
ensemble model consisted of the CQT-LCNN, CQT-AASIST,
and W2V2-GMM models. These submissions achieved
EERs of 23.44% and 21.26% on round 1 and 2 of Track 1.2,
respectively, and ranked 3rd in this track.</p>
          <p>In Track 3, we considered an ensemble of three models:
CQT-LCNN, CQT-OFD, and AASIST. To detect new attack
types, the OpenMax algorithm was applied. Our system
achieved an F1-score of 0.7205 for Track 3.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Feature engineering</title>
        <p>(a) LCNN block</p>
        <p>
          (b) LCNN model
In this study, we conducted experiments utilizing four The eficacy of the LCNN model has been demonstrated
widely used audio feature extraction methods: CQT, Mel- in previous research through its notable performance in
spectrogram, MFCC, and W2V2 [5]. Each method pos- the ASVspoof 2017, 2019, and 2021 challenges [11, 12, 13].
sesses distict advantages and limitations, rendering them Our implementation of the LCNN model as depicted in
suitable for specific applications. CQT uses a constant Q Figure 1(b) [14], consists of 9 layers, akin to the Light
factor to ensure higher frequency resolution at low fre- CNN-9 model. However, we made modifications to the
quencies and lower resolution at high frequencies, and architecture by substituting the fully connected layer
has demonstrated efectiveness in deepfake detection defined in the original Light CNN-9 model [ 14] with a
tasks. Mel-spectrogram is obtained by applying Mel- global average pooling layer, batch normalization, and
iflterbanks to the power spectrum of the audio signal. dropout layer. In Track 1.2, the final dense layer of our
MFCC is another popular feature extraction method used LCNN model outputs two values, representing the labels
in speech processing and music analysis. W2V2 is a “spoofing” and “genuine.” In Track 3, the output dense
state-of-the-art speech recognition method that learns layer had a size of 7, representing the seven known
deeppowerful representations from speech audio alone and fake algorithms, and it was activated using the softmax
achieves impressive results with significantly less labeled activation function. Figure 1(a) describes the LCNN block,
data compared to previous methods. The first-placed where  denotes the filter size,  denotes
          <xref ref-type="bibr" rid="ref3">the kernel size,
team in the ADD 2022</xref>
          challenge at track 1 (deepfake de- and  indicates the use of batch normalization. The LCNN
tection track) demonstrated the usefulness of the W2V2 block performs MFM (Max-Feature-Map) operation using
pretrained network [2]. By applying the discrete cosine two convolution layers and optionally applies a batch
transform (DCT) to the CQT or mel-spectrogram features, normalization layer indicated by the dashed block when
we obtain more compressed representations: Constant Q  = 1.
        </p>
        <p>Cepstral Coeficients (CQCC) or MFCC. In deep learning
scenarios, raw data such as mel-spectrogram and CQT of- 2.3.2. AASIST model and our proposed AASIST
ten lead to higher accuracy. Thus, for our deep-learning variant
models, we opted for mel-spectrogram and CQT features
rather than CQCC and MFCC features.</p>
        <p>AASIST is an extended version of the RawGAT-ST[15]
that is based on a graph neural network [16]. AASIST
2.2. Data augmentation has achieved state-of-the-art performance on ASVSpoof
2019 challenge dataset for logical access (LA) scenario.</p>
        <p>
          We explored several augmentation techniques such as We propose modifications to the representation
extracmixup [6], SpecAugment [7], FFM [4], FilterAugment [8] tion part of the AASIST model. We conducted
experiand cutout [9]. These techniques have previously shown ments by replacing this extraction part by either a W2V2
promise in improving performance in
          <xref ref-type="bibr" rid="ref3">the ADD 2022</xref>
          chal- pretrained model or CQT features, as shown in Figure
lenge [10]. However, in t
          <xref ref-type="bibr" rid="ref2">he context of the ADD 2023</xref>
          2. In the figure, the upper component of the
representachallenge, incorporating these augmentation techniques tion extraction part depicts the original AASIST model.
did not yield substantial improvements in performance. The middle component represents the model utilizing
W2V2, with fine-tuning of the last transformer layers
in the W2V2 pretrained model. The lower component to OFD model with  number of splits in the th block,
represents the model considering CQT features.
using ReLU. If  = 0, then it implies that the th block
does not split the input feature map.
2.3.3. GMM model
The Gaussian Mixture Model (GMM) is a probabilistic
model that represents data as a combination of
multiple Gaussian distributions [17]. During t
          <xref ref-type="bibr" rid="ref2">he ADD 2023</xref>
          challenge, it was observed that the performance of deep
learning models on the test set did not meet the
anticipated level of success. This led us to consider using the
traditional machine learning-based GMM model, which
has been widely employed in ASVspoof 2015 [18] and
ASVspoof 2017 [19]. In addition, considering the
necessity for simpler models to prevent overfitting, we
recognized GMM as a suitable method to efectively model
features extracted through W2V2 pretrained networks in
a straightforward yet efective manner. We considered
using various features such as MFCC, CQT, and W2V2
as input features for the GMM model.
2.3.4. OFD model
The Overlapped frequency-distributed (OFD) network
[20] is a spoofing detection model designed to detect
distinct features within diferent frequency ranges by
dividing spectrograms along the frequency axis. There are
two types of models: OFD model and Non-OFD model.
        </p>
        <sec id="sec-2-1-1">
          <title>In OFD model, each block divides the feature map</title>
          <p>into multiple parts along the frequency axis allowing for
overlap. In contrast, Non-OFD model partitions the
feature map along the frequency axis without any overlap.
Both models consist of six blocks, each characterized by
three hyperparameters: the number of splits, the
presence or absence of overlap, and the activation function.
In OFD model, all six blocks are split with overlap, while
in Non-OFD model, no overlaps occur between blocks.
The activation function can be either ReLU or MFM. For
instance, “OFD with (1, 2, 3, 4, 5, 6)-ReLU” refers
2.3.5. Other models
Additionally, we conducted experiments with several
alternative methods, including ResMax [21, 22],
BCResMax, and DDWS [23]. However, these methods
exhibited comparable accuracy to the LCNN model, and
due to limited time, we were unable to dedicate further
investigation to these models.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.4. OpenMax for unknown attack detection</title>
        <p>OpenMax [24] is an algorithm designed for open set
recognition, specifically targeting the identification of
utterances belonging to the unknown class. The algorithm
consists of two steps: preparation and inference.</p>
        <p>During the preparation step, a model is trained using
known classes from the training set. Following the
training phase, final-layer logit vectors (seven-dimensional)
are computed for correctly classified training data
samples. The mean vector   of the logit vectors
corresponding to each class  = 0, 1, . . . , 6 is computed. The
distance between the logit vector of each correctly classified
training sample and the mean vector of its class is
determined. Weibull distributions are fitted using the libMR
[25] FitHigh function for each class, using the  number
of samples with the largest distance to the mean vector.</p>
        <p>In the inference step, the final-layer logit vectors are
obtained for all test samples. For each logit vector  =
(,0, . . . , ,6), the probability  of not belonging to
class  is calculated for all  = 0, 1, . . . , 6. The logit
vector is then updated as</p>
        <p>︃(
˜ =
(1 − 0),0, . . . , (1 − 6),6,
6
=0
∑︁  , ,
)︃
and the softmax of ˜ serves as the output of the OpenMax Table 2
algorithm. The number of samples in the training, development, and test</p>
        <p>
          To handle uncertain predictions a thres
          <xref ref-type="bibr" rid="ref2">hold  is set. sets of the ADD 2023</xref>
          track 3 dataset.
        </p>
        <p>For each , if max∈{0,...,7}  (˜) ≤  or the
unknown class (j = 7) has the largest probability, then its
predicted class is considered to be 7.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>
          3.1.1. ADD 2023 c
          <xref ref-type="bibr" rid="ref2">hallenge datasets
The ADD 2023</xref>
          challenge consists of three tracks, and we
describe the datasets for Track 1.2 and Track 3 [1]. Track
1.2 aims to detect fake audio, which refers to realistic
and natural-sounding fake voice audio that can deceive
deepfake detection models. This track is divided into two
rounds, both featuring nearly identical detection tasks.
Table 1 shows the number of samples in the training,
development, and test sets (round 1 and 2). Track 3 aims
to recognize deepfake speech algorithms. The training
and development sets have seven categories (0, 1, 2, ...,
6) with labels, one of which is real and the other six are
fake speech algorithms. Notably, the label for real speech
is unknown. The test set has eight categories, but no
label information is provided. Seven of them align with
the “known” classes in the training and development sets,
while the remaining category represents the unknown
fake class labeled as 7. Table 2 shows the number of
samples in the training, development, and test sets.
The ASVspoof 2019 challenge [26] focuses on TTS, VC,
and replay spoofing attacks, and the dataset consists of
logical access (LA) and physical access (PA) scenarios
derived from the VCTK basic corpus [27]. Our focus
primarily lies on the LA data, which uses 17 TTS and
VC systems to produce both genuine and fake speech
samples. The dataset is partitioned into three subsets:
training, development, and evaluation. Here, the
evaluation data contains approximately 71K utterances with
unknown attacks.
To assess the performance of our experimental models,
we conducted evaluation on two databases: t
          <xref ref-type="bibr" rid="ref2">he ADD
2023</xref>
          challenge dataset and the ASVspoof 2019 LA dataset.
The model’s performance was evaluated using the equal
error rate (EER), which indicates the point at which the
false acceptance rate (FAR) and false rejection rate (FRR)
are equal. A lower EER value generally indicates better
performance.
        </p>
        <p>The CQT-LCNN model was trained using 9-second
samples, a batch size of 16, and 10 epochs. To fit the
9second signal, audio signals longer than 9 seconds were
trimmed, and signals shorter than 9 seconds were
repeated from the beginning to match the desired length.
In the case of training the GMM model, the entire length
of audio signals was used for extracting MFCC and CQT
features, while 13.67 seconds of audio signals were used
for W2V2 feature extraction to match the fixed input
length of the pretrained network, which is set at 246,000.
To simplify the structure of the GMM model, we assumed
a diagonal covariance matrix.</p>
        <p>In order to stabilize the convergence of model
parameters, the learning rate is initially set to 1e-3 and
subsequently reduced to 1e-5 using a sigmoidal decay function.
For the ASVspoof 2019 dataset, we trained the models
using only the training data. However, for the models
submitted in the challenge, we trained using both the
training and development sets for some sub-models.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Experimental results on ADD 2023 dataset for track 1.2</title>
        <p>Many of the models exhibited favorable performance on
the training and development data. However notable
declines in performance were observed when evaluating
the models on the actual test data. This indicates that
the models sufer from overfiting for both training and
development data. To address this issue, techniques such
as data augmentation and reducing model size can be
considered.
3.3.1. Use of data augmentation techniques
We utilized various data augmentation techniques such as
mixup, FFM (LF, HF and RF), FilterAugment (FA) [8] and
cutout. However, the application of data augmentation
did not yield substantial improvement when evaluated on
the test data. Table 3 shows the results of applying data
augmentation to the CQT-LCNN and BC-ResMax models.</p>
        <p>It was dificult to draw conclusions about the
efectiveness of data augmentation based on the experimental
results.
3.3.2. GMM based models
0.14%
0.18%
0.12%
0.16%
2.51%
0.10%
0.09%
0.09%
As in RawNet2, the Raw-AASIST model uses Fixed Sinc
Filters to extract features from raw audio and compares
them across diferent epochs. On the other hand, the
W2V2-AASIST and CQT-AASIST models substitute Fixed
Sinc Filters with W2V2 and CQT, respectively. We used
the XLS-R (1B) version as the pretrained model for
W2V2AASIST [28]. Table 5 presents the performance for three
AASIST-based models. It was observed that training the
models for multiple epochs led to overfitting on the test
data, resulting in a decrease in the dev EER but an
increase in the test EER. The CQT-AASIST (10ep) model
refers to the model trained for 10 epochs. Although
models trained for more epochs exhibited lower dev EER,
overfitting was evident in the test EER. Therefore,
despite having a higher dev EER, we chose to use models
trained with a small number of epochs, specifically
between 5 and 10 for the AASIST based models.</p>
        <p>
          Deep learning-based models have been observed to sufer
from serious overfitting issues in terms of test set
accuracy. In order to address this concern, we experimented
with GMM-based models, which had demonstrated suc- TMaobdleel5performance comparison for AASIST-based models.
cess in prior ASVspoof challenges
          <xref ref-type="bibr" rid="ref48">(2015 and 2017)</xref>
          , and
were known for their ability to handle overfitting. We Models Dev. EER Test EER (R1) Test EER (R2)
created a deepfake detection model using GMM models Raw-AASIST 0.79% 37.36%
with W2V2, CQT, and MFCC features. For W2V2 features, W2V2-AASIST 0.12% 39.83%
we experimented with two W2V2 pretrained models: one CCQQTT--AAAASSIISSTT ((2100eepp)) 41..9178%% 30.5-4% 3219..2722%%
trained on the Librispeech corpus’s 960 hours of audio
(LS-960) and the other trained on the LibriVox 60k hours
of data (LV-60k). We varied the number of components
parameter for the W2V2-LV60k-GMM, exploring values 3.4. Experimental results on ASVspoof
of 16, 32, 64, and 128. Table 4 presents the experimental 2019 LA dataset
results. W2V2-LV60k-GMM demonstrated better
performance than W2V2-LS960-GMM based on test EER and Table 6 presents the performance of the models
(develDev EER. Although W2V2-LV60k-GMM models ex
          <xref ref-type="bibr" rid="ref2">hib- oped for the ADD 2023</xref>
          challenge) on the ASVspoof 2019
ited higher Dev EER compared to other deep learning- LA dataset (ASV2019). The EER (ASV) indicates the
perbased methods, it yielded better results in terms of test formance evaluated on the evaluation data after training
EER. The simpler structure of the GMM-based model ap- on the training data of ASV2019. The EER (ADD-R1)
peared to mitigate the overfitting issue to some extent. and EER (ADD-R2) columns indicate the performance on
Additionally, we conducted experiments with CQT-GMM the test data of round 1 and round 2 of t
          <xref ref-type="bibr" rid="ref2">he ADD 2023</xref>
          and MFCC-GMM models, but W2V2-LV60k-GMM exhib- challenge, respectively. Among our experimental models,
ited the best performance. the W2V2-GMM model showed the best performance on
ADD-R2 with a 26.28% EER. However, it exhibited poor
performance on ASV2019, achieving a 9.8% EER. The with the ratios specified in Table 8. The models were
CQT-LCNN and CQT-AASIST models, which performed slightly modified to adapt them from spoofing detection
well on ADD-R1 and ADD-R2 achieved EERs of 1.93% to algorithm recognition tasks. The CQT-LCNN model
and 2.36%, respectively, on ASV2019. The W2V2-AASIST remains unchanged, with an output dense layer of size 7
model showed exceptional performance with a 0.21% with softmax activation. For the OFD model,
(2,2,0,0,0,0)EER on ASV2019, but performed poorly on ADD-R1. Re- ReLU configuration is used, and two additional dense
garding the MFCC-LCNN model, it demonstrated good layers with 128 and 64 nodes are added just before the
performance on ADD-R2, but showed poor performance final layer to use the features from CNN backbone for
on ASV2019. classifying algorithms. Lastly, the AASIST model [3] was
used with modified output dense layer of size 7 with
Table 6 softmax activation. The ensemble of these three models
Model performance comparison on ASVspoof 2019 LA dataset achieved a 0.7205 test F1-Score.
          <xref ref-type="bibr" rid="ref5">and ADD 2023</xref>
          test data (round 1 and round 2).
        </p>
        <p>Models
CQT-LCNN
W2V2-GMM
AASIST [3]
CQT-AASIST
W2V2-AASIST</p>
        <p>CQT-OFD
MFCC-LCNN</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>
        Table 7 describes the details of the three top-performing This paper presents the models employed by our
single systems, including their EERs on the final eval- CAU_KU team participating in Track 1.2 and Track 3
uation data (R1 and R2) in track 1.2 of t
        <xref ref-type="bibr" rid="ref2">he ADD 2023</xref>
        of t
        <xref ref-type="bibr" rid="ref2">he ADD 2023</xref>
        challenge. We utilized various
deepchallenge as well as the EERs of our two ensemble sys- fake models, including the W2V2 pretrained model and a
tems. In Round 1, our final model consisted of an en- modified AASIST architecture. In Track 1.2, Round 1, our
semble of CQT-LCNN and CQT-AASIST models in a 1:1 submission consisted of an ensemble model comprising
ratio, achieving 23.44% EER. In Round 2, we submitted the CQT-LCNN and CQT-AASIST models, achieving a
an ensemble of CQT-LCNN, CQT-AASIST, and W2V2- 23.44% EER. In Round 2, our submission involved an
enGMM models, considering their respective accuracies, semble model combining the CQT-LCNN, CQT-AASIST,
achieving 21.26% EER. and W2V2-GMM models, achieving a 21.26% EER. For
Track 3, we developed an ensemble model using the
CQTTable 7 LCNN, CQT-OFD, and AASIST models, achieving a 0.7205
EER on the final evaluation data for track 1.2. F1-score.
      </p>
      <p>Model
LCNN
AASIST
GMM
Ensemble 1
Ensemble 2</p>
      <p>Feature
CQT
CQT
W2V2-LV60k</p>
      <p>EER (R1) EER (R2)
29.75%
30.54%</p>
      <p>23.44%
35.40%
29.72%
26.28%</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>
        This work was supported by the National Research
Foundation of Korea (NRF) grant funded by the
Ministry of Science
        <xref ref-type="bibr" rid="ref5">and ICT (RS-2023</xref>
        -00208284,
2020R1C1C1A01013020) and Institute for Information
&amp; communications Technology Planning &amp; Evaluation
(IITP) grant funded by the Korea government(MSIT)
(No.2019-0-00033, 50%, Study on Quantum Security
Evaluation of Cryptography based on Computational
Quantum Complexity).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <year>Add 2022</year>
          :
          <article-title>the first audio [1</article-title>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          , C. Y.
          <article-title>deep synthesis detection challenge</article-title>
          , in: 2022 IEEE
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <year>Add 2023</year>
          :
          <article-title>Signal Processing (ICASSP), IEEE</article-title>
          , IEEE, Singapore,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>the second audio deepfake detection challenge</article-title>
          ,
          <source>in: 2022</source>
          , pp.
          <fpage>9216</fpage>
          -
          <lpage>9220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>IJCAI 2023 Workshop on Deepfake Audio Detection</source>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lavrentyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Novoselov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Malykh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Kozlov,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>and Analysis (DADA</surname>
          </string-name>
          <year>2023</year>
          ), volume
          <volume>0</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>0</fpage>
          -
          <lpage>0</lpage>
          .
          <string-name>
            <given-names>O.</given-names>
            <surname>Kudashev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shchemelinin</surname>
          </string-name>
          , Audio replay attack [2]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Martín-Doñas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Álvarez</surname>
          </string-name>
          ,
          <article-title>The vicomtech au- detection with deep learning frameworks</article-title>
          ,
          <source>in: Proc.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>dio deepfake detection system based on wav2vec2 Interspeech</source>
          <year>2017</year>
          , ISCA, Stockholm,
          <year>2017</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>for the 2022 add challenge</article-title>
          ,
          <source>in: ICASSP</source>
          <year>2022</year>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lavrentyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Novoselov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Tseren,
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>- 2022 IEEE International Conference on Acous- M. Volkova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gorlanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kozlov</surname>
          </string-name>
          , STC
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          ,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <source>Antispoofing Systems for the ASVspoof2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <year>2022</year>
          , pp.
          <fpage>9241</fpage>
          -
          <lpage>9245</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP43922. Challenge, in
          <source>: Proc. Interspeech</source>
          <year>2019</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <year>2022</year>
          .9747768. ISCA, Graz,
          <year>2019</year>
          , pp.
          <fpage>1033</fpage>
          -
          <lpage>1037</lpage>
          . URL: [3]
          <string-name>
            <given-names>J.-w.</given-names>
            <surname>Jung</surname>
          </string-name>
          , H.-S. Heo,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          , H.-j. Shim, J. S. http://dx.doi.org/10.21437/Interspeech.2019-
          <volume>1768</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>B.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
          </string-name>
          , Aasist: Au- doi:10.21437/Interspeech.2019-
          <volume>1768</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>dio anti-spoofing using integrated spectro-temporal [13]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Svishchev</surname>
          </string-name>
          , M. Volkova,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>graph attention networks</article-title>
          , in: ICASSP 2022
          <string-name>
            <surname>- A. Chirkovskiy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kondratev</surname>
          </string-name>
          , G. Lavrentyeva,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          2022 IEEE International Conference on Acoustics,
          <source>STC Antispoofing Systems for the ASVspoof2021</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Speech and Signal Processing</surname>
          </string-name>
          (ICASSP), IEEE, Brno, Challenge, in
          <source>: Proc. 2021 Edition of the Automatic</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <year>2022</year>
          , pp.
          <fpage>6367</fpage>
          -
          <lpage>6371</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP43922. Speaker Verification and Spoofing Countermea-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <year>2022</year>
          .9747766. sures Challenge, ISCA, Brno,
          <year>2021</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>67</lpage>
          . [4]
          <string-name>
            <given-names>I.-Y.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          , S. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Oh, doi:10.21437/ASVSPOOF.2021-
          <volume>10</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>Low-quality fake audio detection through fre-</article-title>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>A light cnn for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>1st International Workshop on Deepfake Detection Transactions on Information Forensics and Security</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>for Audio</surname>
            <given-names>Multimedia</given-names>
          </string-name>
          , DDAM '22,
          <string-name>
            <surname>Association</surname>
            <given-names>for</given-names>
          </string-name>
          13 (
          <year>2018</year>
          )
          <fpage>2884</fpage>
          -
          <lpage>2896</lpage>
          . doi:
          <volume>10</volume>
          .1109/TIFS.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Computing</given-names>
            <surname>Machinery</surname>
          </string-name>
          , New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>2833032</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          9-
          <fpage>17</fpage>
          . URL: https://doi.org/10.1145/3552466.3556533. [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>doi:10.1145/3552466</source>
          .3556533. A.
          <string-name>
            <surname>Larcher</surname>
          </string-name>
          ,
          <article-title>End-to-end anti-spoofing with rawnet2</article-title>
          , [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec in: ICASSP 2021
          <article-title>-</article-title>
          2021 IEEE International Confer-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>2.0: A framework for self-supervised learning ence on Acoustics, Speech</article-title>
          and Signal Processing
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>of speech representations</article-title>
          , in: H. Larochelle, (ICASSP), IEEE,
          <year>2021</year>
          , pp.
          <fpage>6369</fpage>
          -
          <lpage>6373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.), [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Veličković</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Casanova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Romero,
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , preprint arXiv:
          <volume>1710</volume>
          .10903 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          pp.
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          . URL: https://proceedings. [17]
          <string-name>
            <surname>C. M. Bishop</surname>
            ,
            <given-names>Pattern</given-names>
          </string-name>
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          and Machine
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>neurips.cc/paper_files/paper/2020/file/ Learning (Information Science and Statistics),</mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          92d1e1eb1cd6f9fba3227870bb6d7f07-
          <fpage>Paper</fpage>
          .pdf. Springer-Verlag, Berlin, Heidelberg,
          <year>2006</year>
          . [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cisse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Lopez-Paz, [18]
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kinnunen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yamagishi</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>mixup: Beyond empirical risk minimization</article-title>
          ,
          <year>2017</year>
          . C. Hanilci,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sizov</surname>
          </string-name>
          ,
          <year>Asvspoof 2015</year>
          : [7]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C.-
          <string-name>
            <surname>C. Chiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zoph</surname>
          </string-name>
          ,
          <article-title>The first automatic speaker verification spoofing</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>augmentation method for automatic speech recog- speech 2015, ISCA</article-title>
          , Dresden,
          <year>2015</year>
          , pp.
          <fpage>2037</fpage>
          -
          <lpage>2041</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          nition, in
          <source>: Proc. Interspeech</source>
          <year>2019</year>
          , ISCA, Graz,
          <year>2019</year>
          , [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Delgado</surname>
          </string-name>
          , M. Todisco,
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          pp.
          <fpage>2613</fpage>
          -
          <lpage>2617</lpage>
          . N.
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yamagishi</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , The asvspoof
          <year>2017</year>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <surname>Filteraugment:</surname>
          </string-name>
          <article-title>An challenge: Assessing the limits of replay spoofing</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>acoustic environmental data augmentation method, attack detection</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2017</year>
          , ISCA,
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>in: ICASSP</source>
          <year>2022</year>
          -2022 IEEE International Confer- Stockholm,
          <year>2017</year>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <source>ence on Acoustics, Speech and Signal Processing</source>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-Y.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oh</surname>
          </string-name>
          , Overlapped frequency-
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>(ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>4308</fpage>
          -
          <lpage>4312</lpage>
          .
          <article-title>distributed network: Frequency-aware voice spoof[9] T</article-title>
          . DeVries, G. W. Taylor,
          <article-title>Improved regularization of ing countermeasure</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2022</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <article-title>convolutional neural networks with cutout</article-title>
          ,
          <year>2017</year>
          . ISCA, Incheon,
          <year>2022</year>
          , pp.
          <fpage>3558</fpage>
          -
          <lpage>3562</lpage>
          . [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          , [21]
          <string-name>
            <given-names>I.-Y.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Huh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <given-names>Max</given-names>
            <surname>Feature Map</surname>
          </string-name>
          , in: 25th International Confer-
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Society</surname>
          </string-name>
          , Milan,
          <year>2021</year>
          , pp.
          <fpage>4837</fpage>
          -
          <lpage>4844</lpage>
          . [22]
          <string-name>
            <given-names>I.-Y.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kwag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , H.-J.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>convolution</surname>
          </string-name>
          , IEEE Access (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>ACCESS.</surname>
          </string-name>
          <year>2023</year>
          .
          <volume>3275790</volume>
          . [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-Y.</given-names>
            <surname>Kwak</surname>
          </string-name>
          , Light-
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>26th International Conference on Pattern Recog-</mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>Quebec</surname>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>477</fpage>
          -
          <lpage>483</lpage>
          . [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bendale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Boult</surname>
          </string-name>
          , Towards open set deep net-
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>works</surname>
          </string-name>
          ,
          <source>2016 IEEE Conference on Computer Vision</source>
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <string-name>
            <surname>and Pattern Recognition (CVPR</surname>
          </string-name>
          ) (
          <year>2015</year>
          )
          <fpage>1563</fpage>
          -
          <lpage>1572</lpage>
          . [25]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Micheals</surname>
          </string-name>
          , T. E. Boult,
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <source>Analysis and Machine Intelligence</source>
          <volume>33</volume>
          (
          <year>2011</year>
          )
          <fpage>1689</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          1695. doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2011</year>
          .
          <volume>54</volume>
          . [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vestman</surname>
          </string-name>
          , M. Sahidul-
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          2019:
          <article-title>Future Horizons in Spoofed</article-title>
          and Fake
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <given-names>Audio</given-names>
            <surname>Detection</surname>
          </string-name>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2019</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <string-name>
            <surname>ISCA</surname>
          </string-name>
          , Graz,
          <year>2019</year>
          , pp.
          <fpage>1008</fpage>
          -
          <lpage>1012</lpage>
          . URL: http://
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          dx.doi.org/10.21437/Interspeech.2019-
          <fpage>2249</fpage>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          21437/Interspeech.2019-
          <volume>2249</volume>
          . [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Veaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>MacDonald</surname>
          </string-name>
          , et al.,
          <source>Cstr</source>
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <source>voice cloning toolkit</source>
          ,
          <year>2017</year>
          . URL: https://datashare.
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          ed.ac.uk/handle/10283/2651. [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tjandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <source>arXiv:2111.09296</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>