1. Introduction

CAU KU deep fake detection system for ADD 2023 challenge⋆

Soyul Han

soyul5458@cau.ac.kr 0

Taein Kang

Sunmook Choi

felixchoi@korea.ac.kr 1

Jaejin Seo

seojaejin@cau.ac.kr 0

Sanghyeok Chung

Sumi Lee

Seungsang Oh

seungsang@korea.ac.kr 1

Il-Youp Kwak

ikwak2@cau.ac.kr 0 0 Chung-Ang University , 84, Heukseok-ro, Dongjak-gu, Seoul 06974 , Republic of Korea 1 Korea University , 145 Anam-ro, Seongbuk-gu, Seoul 02841 , Republic of Korea

2023

23 30

The paper presents the participation of the CAU_KU team in the ADD 2023 Challenge, specifically in track 1.2 (audio fake game - detection track) and track 3 (deepfake algorithm recognition track). Various deep learning models were explored using features from the pretrained wav2vec2 network, as well as CQT, mel-spectrogram, etc. We modified the representation extraction component of the AASIST model to incorporate 2D spectrograms (wav2vec2 or CQT) and attempted diferent deep learning models, with model ensembling employed to create the final model. For track 1.2, our submitted ensemble model for round 1 utilized the CQT-LCNN and CQT-AASIST models. For round 2, our model used the CQT-LCNN, CQT-AASIST, and W2V2-GMM models. For track 3, we ensembled the CQT-LCNN, CQT-OFD and AASIST models. Additionally, we applied the openmax algorithm to detect unknown deepfake attacks. Our best submission achieved 23.44% and 21.26% on round 1 and 2 of track 1.2, respectively, and ranked 3rd in track 1.2.

eol>audio deep synthesis audio deepfake detection deep learning deepfake algorithm recognition

1. Introduction

• Modified representation extraction part of the

AASIST model utilizing W2V2 and CQT. • Experimented with models that ranked 3rd in the previous ADD 2022 challenge [4] such as LCNN, ResMax, and OFD. • Conducted experiments using the Gaussian mixture model (GMM) with the W2V2 feature, as well as traditional features such as MFCC and CQT. • Applied OpenMax algorithm for track 3

2.3. Models For the first round of Track 1.2, the ensemble model

submitted by our team comprised the CQT-LCNN and CQT-AASIST models. For the second round, the ensemble model consisted of the CQT-LCNN, CQT-AASIST, and W2V2-GMM models. These submissions achieved EERs of 23.44% and 21.26% on round 1 and 2 of Track 1.2, respectively, and ranked 3rd in this track.

In Track 3, we considered an ensemble of three models: CQT-LCNN, CQT-OFD, and AASIST. To detect new attack types, the OpenMax algorithm was applied. Our system achieved an F1-score of 0.7205 for Track 3.

2. Methods 2.1. Feature engineering

(a) LCNN block

(b) LCNN model In this study, we conducted experiments utilizing four The eficacy of the LCNN model has been demonstrated widely used audio feature extraction methods: CQT, Mel- in previous research through its notable performance in spectrogram, MFCC, and W2V2 [5]. Each method pos- the ASVspoof 2017, 2019, and 2021 challenges [11, 12, 13]. sesses distict advantages and limitations, rendering them Our implementation of the LCNN model as depicted in suitable for specific applications. CQT uses a constant Q Figure 1(b) [14], consists of 9 layers, akin to the Light factor to ensure higher frequency resolution at low fre- CNN-9 model. However, we made modifications to the quencies and lower resolution at high frequencies, and architecture by substituting the fully connected layer has demonstrated efectiveness in deepfake detection defined in the original Light CNN-9 model [ 14] with a tasks. Mel-spectrogram is obtained by applying Mel- global average pooling layer, batch normalization, and iflterbanks to the power spectrum of the audio signal. dropout layer. In Track 1.2, the final dense layer of our MFCC is another popular feature extraction method used LCNN model outputs two values, representing the labels in speech processing and music analysis. W2V2 is a “spoofing” and “genuine.” In Track 3, the output dense state-of-the-art speech recognition method that learns layer had a size of 7, representing the seven known deeppowerful representations from speech audio alone and fake algorithms, and it was activated using the softmax achieves impressive results with significantly less labeled activation function. Figure 1(a) describes the LCNN block, data compared to previous methods. The first-placed where denotes the filter size, denotes the kernel size, team in the ADD 2022 challenge at track 1 (deepfake de- and indicates the use of batch normalization. The LCNN tection track) demonstrated the usefulness of the W2V2 block performs MFM (Max-Feature-Map) operation using pretrained network [2]. By applying the discrete cosine two convolution layers and optionally applies a batch transform (DCT) to the CQT or mel-spectrogram features, normalization layer indicated by the dashed block when we obtain more compressed representations: Constant Q = 1.

Cepstral Coeficients (CQCC) or MFCC. In deep learning scenarios, raw data such as mel-spectrogram and CQT of- 2.3.2. AASIST model and our proposed AASIST ten lead to higher accuracy. Thus, for our deep-learning variant models, we opted for mel-spectrogram and CQT features rather than CQCC and MFCC features.

AASIST is an extended version of the RawGAT-ST[15] that is based on a graph neural network [16]. AASIST 2.2. Data augmentation has achieved state-of-the-art performance on ASVSpoof 2019 challenge dataset for logical access (LA) scenario.

We explored several augmentation techniques such as We propose modifications to the representation extracmixup [6], SpecAugment [7], FFM [4], FilterAugment [8] tion part of the AASIST model. We conducted experiand cutout [9]. These techniques have previously shown ments by replacing this extraction part by either a W2V2 promise in improving performance in the ADD 2022 chal- pretrained model or CQT features, as shown in Figure lenge [10]. However, in t he context of the ADD 2023 2. In the figure, the upper component of the representachallenge, incorporating these augmentation techniques tion extraction part depicts the original AASIST model. did not yield substantial improvements in performance. The middle component represents the model utilizing W2V2, with fine-tuning of the last transformer layers in the W2V2 pretrained model. The lower component to OFD model with number of splits in the th block, represents the model considering CQT features. using ReLU. If = 0, then it implies that the th block does not split the input feature map. 2.3.3. GMM model The Gaussian Mixture Model (GMM) is a probabilistic model that represents data as a combination of multiple Gaussian distributions [17]. During t he ADD 2023 challenge, it was observed that the performance of deep learning models on the test set did not meet the anticipated level of success. This led us to consider using the traditional machine learning-based GMM model, which has been widely employed in ASVspoof 2015 [18] and ASVspoof 2017 [19]. In addition, considering the necessity for simpler models to prevent overfitting, we recognized GMM as a suitable method to efectively model features extracted through W2V2 pretrained networks in a straightforward yet efective manner. We considered using various features such as MFCC, CQT, and W2V2 as input features for the GMM model. 2.3.4. OFD model The Overlapped frequency-distributed (OFD) network [20] is a spoofing detection model designed to detect distinct features within diferent frequency ranges by dividing spectrograms along the frequency axis. There are two types of models: OFD model and Non-OFD model.

In OFD model, each block divides the feature map

into multiple parts along the frequency axis allowing for overlap. In contrast, Non-OFD model partitions the feature map along the frequency axis without any overlap. Both models consist of six blocks, each characterized by three hyperparameters: the number of splits, the presence or absence of overlap, and the activation function. In OFD model, all six blocks are split with overlap, while in Non-OFD model, no overlaps occur between blocks. The activation function can be either ReLU or MFM. For instance, “OFD with (1, 2, 3, 4, 5, 6)-ReLU” refers 2.3.5. Other models Additionally, we conducted experiments with several alternative methods, including ResMax [21, 22], BCResMax, and DDWS [23]. However, these methods exhibited comparable accuracy to the LCNN model, and due to limited time, we were unable to dedicate further investigation to these models.

2.4. OpenMax for unknown attack detection

OpenMax [24] is an algorithm designed for open set recognition, specifically targeting the identification of utterances belonging to the unknown class. The algorithm consists of two steps: preparation and inference.

During the preparation step, a model is trained using known classes from the training set. Following the training phase, final-layer logit vectors (seven-dimensional) are computed for correctly classified training data samples. The mean vector of the logit vectors corresponding to each class = 0, 1, . . . , 6 is computed. The distance between the logit vector of each correctly classified training sample and the mean vector of its class is determined. Weibull distributions are fitted using the libMR [25] FitHigh function for each class, using the number of samples with the largest distance to the mean vector.

In the inference step, the final-layer logit vectors are obtained for all test samples. For each logit vector = (,0, . . . , ,6), the probability of not belonging to class is calculated for all = 0, 1, . . . , 6. The logit vector is then updated as

︃( ˜ = (1 − 0),0, . . . , (1 − 6),6, 6 =0 ∑︁ , , )︃ and the softmax of ˜ serves as the output of the OpenMax Table 2 algorithm. The number of samples in the training, development, and test

To handle uncertain predictions a thres hold is set. sets of the ADD 2023 track 3 dataset.

For each , if max∈{0,...,7} (˜) ≤ or the unknown class (j = 7) has the largest probability, then its predicted class is considered to be 7.

3. Experiments 3.1. Datasets

3.1.1. ADD 2023 c hallenge datasets The ADD 2023 challenge consists of three tracks, and we describe the datasets for Track 1.2 and Track 3 [1]. Track 1.2 aims to detect fake audio, which refers to realistic and natural-sounding fake voice audio that can deceive deepfake detection models. This track is divided into two rounds, both featuring nearly identical detection tasks. Table 1 shows the number of samples in the training, development, and test sets (round 1 and 2). Track 3 aims to recognize deepfake speech algorithms. The training and development sets have seven categories (0, 1, 2, ..., 6) with labels, one of which is real and the other six are fake speech algorithms. Notably, the label for real speech is unknown. The test set has eight categories, but no label information is provided. Seven of them align with the “known” classes in the training and development sets, while the remaining category represents the unknown fake class labeled as 7. Table 2 shows the number of samples in the training, development, and test sets. The ASVspoof 2019 challenge [26] focuses on TTS, VC, and replay spoofing attacks, and the dataset consists of logical access (LA) and physical access (PA) scenarios derived from the VCTK basic corpus [27]. Our focus primarily lies on the LA data, which uses 17 TTS and VC systems to produce both genuine and fake speech samples. The dataset is partitioned into three subsets: training, development, and evaluation. Here, the evaluation data contains approximately 71K utterances with unknown attacks. To assess the performance of our experimental models, we conducted evaluation on two databases: t he ADD 2023 challenge dataset and the ASVspoof 2019 LA dataset. The model’s performance was evaluated using the equal error rate (EER), which indicates the point at which the false acceptance rate (FAR) and false rejection rate (FRR) are equal. A lower EER value generally indicates better performance.

The CQT-LCNN model was trained using 9-second samples, a batch size of 16, and 10 epochs. To fit the 9second signal, audio signals longer than 9 seconds were trimmed, and signals shorter than 9 seconds were repeated from the beginning to match the desired length. In the case of training the GMM model, the entire length of audio signals was used for extracting MFCC and CQT features, while 13.67 seconds of audio signals were used for W2V2 feature extraction to match the fixed input length of the pretrained network, which is set at 246,000. To simplify the structure of the GMM model, we assumed a diagonal covariance matrix.

In order to stabilize the convergence of model parameters, the learning rate is initially set to 1e-3 and subsequently reduced to 1e-5 using a sigmoidal decay function. For the ASVspoof 2019 dataset, we trained the models using only the training data. However, for the models submitted in the challenge, we trained using both the training and development sets for some sub-models.

3.3. Experimental results on ADD 2023 dataset for track 1.2

Many of the models exhibited favorable performance on the training and development data. However notable declines in performance were observed when evaluating the models on the actual test data. This indicates that the models sufer from overfiting for both training and development data. To address this issue, techniques such as data augmentation and reducing model size can be considered. 3.3.1. Use of data augmentation techniques We utilized various data augmentation techniques such as mixup, FFM (LF, HF and RF), FilterAugment (FA) [8] and cutout. However, the application of data augmentation did not yield substantial improvement when evaluated on the test data. Table 3 shows the results of applying data augmentation to the CQT-LCNN and BC-ResMax models.

It was dificult to draw conclusions about the efectiveness of data augmentation based on the experimental results. 3.3.2. GMM based models 0.14% 0.18% 0.12% 0.16% 2.51% 0.10% 0.09% 0.09% As in RawNet2, the Raw-AASIST model uses Fixed Sinc Filters to extract features from raw audio and compares them across diferent epochs. On the other hand, the W2V2-AASIST and CQT-AASIST models substitute Fixed Sinc Filters with W2V2 and CQT, respectively. We used the XLS-R (1B) version as the pretrained model for W2V2AASIST [28]. Table 5 presents the performance for three AASIST-based models. It was observed that training the models for multiple epochs led to overfitting on the test data, resulting in a decrease in the dev EER but an increase in the test EER. The CQT-AASIST (10ep) model refers to the model trained for 10 epochs. Although models trained for more epochs exhibited lower dev EER, overfitting was evident in the test EER. Therefore, despite having a higher dev EER, we chose to use models trained with a small number of epochs, specifically between 5 and 10 for the AASIST based models.

Deep learning-based models have been observed to sufer from serious overfitting issues in terms of test set accuracy. In order to address this concern, we experimented with GMM-based models, which had demonstrated suc- TMaobdleel5performance comparison for AASIST-based models. cess in prior ASVspoof challenges (2015 and 2017) , and were known for their ability to handle overfitting. We Models Dev. EER Test EER (R1) Test EER (R2) created a deepfake detection model using GMM models Raw-AASIST 0.79% 37.36% with W2V2, CQT, and MFCC features. For W2V2 features, W2V2-AASIST 0.12% 39.83% we experimented with two W2V2 pretrained models: one CCQQTT--AAAASSIISSTT ((2100eepp)) 41..9178%% 30.5-4% 3219..2722%% trained on the Librispeech corpus’s 960 hours of audio (LS-960) and the other trained on the LibriVox 60k hours of data (LV-60k). We varied the number of components parameter for the W2V2-LV60k-GMM, exploring values 3.4. Experimental results on ASVspoof of 16, 32, 64, and 128. Table 4 presents the experimental 2019 LA dataset results. W2V2-LV60k-GMM demonstrated better performance than W2V2-LS960-GMM based on test EER and Table 6 presents the performance of the models (develDev EER. Although W2V2-LV60k-GMM models ex hib- oped for the ADD 2023 challenge) on the ASVspoof 2019 ited higher Dev EER compared to other deep learning- LA dataset (ASV2019). The EER (ASV) indicates the perbased methods, it yielded better results in terms of test formance evaluated on the evaluation data after training EER. The simpler structure of the GMM-based model ap- on the training data of ASV2019. The EER (ADD-R1) peared to mitigate the overfitting issue to some extent. and EER (ADD-R2) columns indicate the performance on Additionally, we conducted experiments with CQT-GMM the test data of round 1 and round 2 of t he ADD 2023 and MFCC-GMM models, but W2V2-LV60k-GMM exhib- challenge, respectively. Among our experimental models, ited the best performance. the W2V2-GMM model showed the best performance on ADD-R2 with a 26.28% EER. However, it exhibited poor performance on ASV2019, achieving a 9.8% EER. The with the ratios specified in Table 8. The models were CQT-LCNN and CQT-AASIST models, which performed slightly modified to adapt them from spoofing detection well on ADD-R1 and ADD-R2 achieved EERs of 1.93% to algorithm recognition tasks. The CQT-LCNN model and 2.36%, respectively, on ASV2019. The W2V2-AASIST remains unchanged, with an output dense layer of size 7 model showed exceptional performance with a 0.21% with softmax activation. For the OFD model, (2,2,0,0,0,0)EER on ASV2019, but performed poorly on ADD-R1. Re- ReLU configuration is used, and two additional dense garding the MFCC-LCNN model, it demonstrated good layers with 128 and 64 nodes are added just before the performance on ADD-R2, but showed poor performance final layer to use the features from CNN backbone for on ASV2019. classifying algorithms. Lastly, the AASIST model [3] was used with modified output dense layer of size 7 with Table 6 softmax activation. The ensemble of these three models Model performance comparison on ASVspoof 2019 LA dataset achieved a 0.7205 test F1-Score. and ADD 2023 test data (round 1 and round 2).

Models CQT-LCNN W2V2-GMM AASIST [3] CQT-AASIST W2V2-AASIST

CQT-OFD MFCC-LCNN

4. Conclusion

Table 7 describes the details of the three top-performing This paper presents the models employed by our single systems, including their EERs on the final eval- CAU_KU team participating in Track 1.2 and Track 3 uation data (R1 and R2) in track 1.2 of t he ADD 2023 of t he ADD 2023 challenge. We utilized various deepchallenge as well as the EERs of our two ensemble sys- fake models, including the W2V2 pretrained model and a tems. In Round 1, our final model consisted of an en- modified AASIST architecture. In Track 1.2, Round 1, our semble of CQT-LCNN and CQT-AASIST models in a 1:1 submission consisted of an ensemble model comprising ratio, achieving 23.44% EER. In Round 2, we submitted the CQT-LCNN and CQT-AASIST models, achieving a an ensemble of CQT-LCNN, CQT-AASIST, and W2V2- 23.44% EER. In Round 2, our submission involved an enGMM models, considering their respective accuracies, semble model combining the CQT-LCNN, CQT-AASIST, achieving 21.26% EER. and W2V2-GMM models, achieving a 21.26% EER. For Track 3, we developed an ensemble model using the CQTTable 7 LCNN, CQT-OFD, and AASIST models, achieving a 0.7205 EER on the final evaluation data for track 1.2. F1-score.

Model LCNN AASIST GMM Ensemble 1 Ensemble 2

Feature CQT CQT W2V2-LV60k

EER (R1) EER (R2) 29.75% 30.54%

23.44% 35.40% 29.72% 26.28%

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (RS-2023 -00208284, 2020R1C1C1A01013020) and Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00033, 50%, Study on Quantum Security Evaluation of Cryptography based on Computational Quantum Complexity).

Yan ,

Xu ,

Wen ,

Li , Add 2022 : the first audio [1 ]

Yi ,

Tao ,

Fu ,

Yan ,

Wang ,

Wang , C. Y. deep synthesis detection challenge , in: 2022 IEEE

Gu ,

Wen ,

Liang ,

Lian ,

Li , Add 2023 : Signal Processing (ICASSP), IEEE , IEEE, Singapore,

the second audio deepfake detection challenge , in: 2022 , pp. 9216 - 9220 .

IJCAI 2023 Workshop on Deepfake Audio Detection [11]

Lavrentyeva ,

Novoselov ,

Malykh , A . Kozlov,

and Analysis (DADA

2023 ), volume 0 , 2023 , pp. 0 - 0 .

Kudashev ,

Shchemelinin , Audio replay attack [2]

J. M.

Martín-Doñas ,

Álvarez , The vicomtech au- detection with deep learning frameworks , in: Proc.

dio deepfake detection system based on wav2vec2 Interspeech 2017 , ISCA, Stockholm, 2017 , pp. 82 - 86 .

for the 2022 add challenge , in: ICASSP 2022 [12]

Lavrentyeva ,

Novoselov , A . Tseren,

- 2022 IEEE International Conference on Acous- M. Volkova , A.

Gorlanov , A.

Kozlov , STC

tics , Speech and Signal Processing (ICASSP) , Antispoofing Systems for the ASVspoof2019

2022 , pp. 9241 - 9245 . doi: 10 .1109/ICASSP43922. Challenge, in : Proc. Interspeech 2019 ,

2022 .9747768. ISCA, Graz, 2019 , pp. 1033 - 1037 . URL: [3]

J.-w.

Jung , H.-S. Heo,

Tak , H.-j. Shim, J. S. http://dx.doi.org/10.21437/Interspeech.2019- 1768 .

Chung , B.-J.

Lee , H.-J.

Yu , N.

Evans , Aasist: Au- doi:10.21437/Interspeech.2019- 1768 .

dio anti-spoofing using integrated spectro-temporal [13]

Tomilov ,

Svishchev , M. Volkova,

graph attention networks , in: ICASSP 2022 - A. Chirkovskiy , A. Kondratev , G. Lavrentyeva,

2022 IEEE International Conference on Acoustics, STC Antispoofing Systems for the ASVspoof2021

Speech and Signal Processing (ICASSP), IEEE, Brno, Challenge, in : Proc. 2021 Edition of the Automatic

2022 , pp. 6367 - 6371 . doi: 10 .1109/ICASSP43922. Speaker Verification and Spoofing Countermea-

2022 .9747766. sures Challenge, ISCA, Brno, 2021 , pp. 61 - 67 . [4]

I.-Y.

Kwak ,

Choi ,

Yang ,

Lee , S. Han, S . Oh, doi:10.21437/ASVSPOOF.2021- 10 .

Low-quality fake audio detection through fre- [14]

Wu ,

He ,

Sun ,

Tan , A light cnn for

1st International Workshop on Deepfake Detection Transactions on Information Forensics and Security

for Audio

Multimedia

, DDAM '22, Association

for

13 ( 2018 ) 2884 - 2896 . doi: 10 .1109/TIFS. 2018 .

Computing

Machinery , New York, NY, USA, 2022 , p. 2833032 .

9- 17 . URL: https://doi.org/10.1145/3552466.3556533. [15]

Tak ,

Patino ,

Todisco ,

Nautsch ,

Evans ,

doi:10.1145/3552466 .3556533. A. Larcher , End-to-end anti-spoofing with rawnet2 , [5]

Baevski ,

Zhou ,

Mohamed ,

Auli , wav2vec in: ICASSP 2021 - 2021 IEEE International Confer-

2.0: A framework for self-supervised learning ence on Acoustics, Speech and Signal Processing

of speech representations , in: H. Larochelle, (ICASSP), IEEE, 2021 , pp. 6369 - 6373 .

Ranzato ,

Hadsell ,

Balcan , H. Lin (Eds.), [16]

Veličković ,

Cucurull ,

Casanova , A . Romero,

Systems , volume 33 , Curran

Associates

, Inc., 2020 , preprint arXiv: 1710 .10903 ( 2017 ).

pp. 12449 - 12460 . URL: https://proceedings. [17] C. M. Bishop , Pattern Recognition and Machine

neurips.cc/paper_files/paper/2020/file/ Learning (Information Science and Statistics),

92d1e1eb1cd6f9fba3227870bb6d7f07- Paper .pdf. Springer-Verlag, Berlin, Heidelberg, 2006 . [6]

Zhang ,

Cisse ,

Y. N.

Dauphin , D. Lopez-Paz, [18] Z.

Wu , T.

Kinnunen , N.

Evans , J.

Yamagishi ,

mixup: Beyond empirical risk minimization , 2017 . C. Hanilci,

Sahidullah ,

Sizov , Asvspoof 2015 : [7]

D. S.

Park ,

Chan ,

Zhang , C.- C. Chiu , B. Zoph , The first automatic speaker verification spoofing

augmentation method for automatic speech recog- speech 2015, ISCA , Dresden, 2015 , pp. 2037 - 2041 .

nition, in : Proc. Interspeech 2019 , ISCA, Graz, 2019 , [19]

Kinnunen ,

Sahidullah ,

Delgado , M. Todisco,

pp. 2613 - 2617 . N. Evans , J.

Yamagishi , K. A.

Lee , The asvspoof 2017 [8]

Nam ,

S.-H.

Kim ,

Y.-H.

Park , Filteraugment:

An challenge: Assessing the limits of replay spoofing

acoustic environmental data augmentation method, attack detection , in: Proc. Interspeech 2017 , ISCA,

in: ICASSP 2022 -2022 IEEE International Confer- Stockholm, 2017 , pp. 2 - 6 .

ence on Acoustics, Speech and Signal Processing [20]

Choi ,

I.-Y.

Kwak ,

Oh , Overlapped frequency-

(ICASSP) , IEEE, 2022 , pp. 4308 - 4312 . distributed network: Frequency-aware voice spoof[9] T . DeVries, G. W. Taylor, Improved regularization of ing countermeasure , in: Proc. Interspeech 2022 ,

convolutional neural networks with cutout , 2017 . ISCA, Incheon, 2022 , pp. 3558 - 3562 . [10]

Yi ,

Fu ,

Tao ,

Nie , H. Ma,

Wang ,

Wang , [21]

I.-Y.

Kwak ,

Kwag ,

Lee ,

J. H.

Huh ,

C.-H.

Lee ,

Max

Feature Map , in: 25th International Confer-

Society , Milan, 2021 , pp. 4837 - 4844 . [22]

I.-Y.

Kwak ,

Kwag ,

Lee ,

Jeon ,

Hwang , H.-J.

convolution , IEEE Access ( 2023 ) 1 - 1 . doi: 10 .1109/

ACCESS.

2023 . 3275790 . [23]

Choi ,

Oh ,

Yang ,

Lee ,

I.-Y.

Kwak , Light-

26th International Conference on Pattern Recog-

Quebec , 2022 , pp. 477 - 483 . [24]

Bendale ,

T. E.

Boult , Towards open set deep net-

works , 2016 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR ) ( 2015 ) 1563 - 1572 . [25]

W. J.

Scheirer ,

Rocha ,

R. J.

Micheals , T. E. Boult,

Analysis and Machine Intelligence 33 ( 2011 ) 1689 -

1695. doi: 10 .1109/TPAMI. 2011 . 54 . [26]

Todisco ,

Wang ,

Vestman , M. Sahidul-

2019: Future Horizons in Spoofed and Fake

Audio

Detection , in: Proc. Interspeech 2019 ,

ISCA , Graz, 2019 , pp. 1008 - 1012 . URL: http://

dx.doi.org/10.21437/Interspeech.2019- 2249 . doi:10.

21437/Interspeech.2019- 2249 . [27]

Veaux ,

Yamagishi ,

MacDonald , et al., Cstr

voice cloning toolkit , 2017 . URL: https://datashare.

ed.ac.uk/handle/10283/2651. [28]

Babu ,

Wang ,

Tjandra ,

Lakhotia ,

Xu ,

arXiv:2111.09296 ( 2021 ).