1. Introduction

Detecting Unknown Speech Spoofing Algorithms with Nearest Neighbors

Jingze Lu

0 1

Yuxiang Zhang

0 1

Zhuo Li

0 1

Zengqiang Shang

0 1

WenChao Wang

Pengyuan Zhang

0 1 0 Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics , CAS, No.21 North 4th Ring West Road, Haidian District, Beijing , China 1 University of Chinese Academy of Sciences , No.1 Yanqihu East Road, Huairou District, Beijing , China

2023

89 94

The development of deep speech generation technology has increased the risk of people being exposed to malicious or misleading information. From a defensive perspective, merely distinguishing between genuine and fake utterances is not enough. At the vocoder level, the artifacts in diferent frequency bands make it possible to distinguish between diferent synthesis methods. A reliable model should not only classify synthesis algorithms correctly, but also be able to identify samples that have not been seen. The second Audio Deepfake Detection Challenge (ADD2023) set up Track3 (Deepfake Algorithms Recognition) to simulate such a scenario. The challenge motivates researchers to construct systems that are robust enough for In-Distribution (ID) and Out-Of-Distribution (OOD) utterances. Cosine similarity based kNN distance is introduced in this work to separate unknown samples from known ones. Together with data augmentation methods and logits based model fusion, our system wins first place in ADD2023 Track3.

eol>Deepfake Detection Algorithms Recognition Out-Of-Distribution Detection ADD Challenge

1. Introduction Nearest Neighbors (kNN) distance is adopted to detect

OOD data in [17]. We find that for the task of detecting spoofing algorithms, the kNN distance based on cosine similarity can efectively detect samples from OOD algorithms. Therefore, kNN distance is introduced in this work to construct a class calibration module, which improves the performance of basic models significantly. In addition, we investigate diferent data augmentation and model fusion methods. All these methods help us achieve ifrst place in ADD2023 Track3.

2. Method The proposed work is based on Track 3 (Deepfake Al

gorithms Recognition) of ADD2023 Challenge. In this Figure 1: The average amplitude of diferent vocoders at each section, we investigate the basis of deepfake algorithms frequency bands. recognition, which is the artifacts introduced by vocoders located in diferent frequency bands. In addition, in Track3, OOD samples exist in the test set. A kNN-based We denote the sets of In-distribution (ID) data and OutOOD detection method is also proposed to identify sam- of-distribution (OOD) data as and , respectively. ples from unknown counterfeit class. The purpose of the algorithm is to distinguish a sample ∈ , where donates the input space, is from or 2.1. Vocoder Artifacts . For this binary classification task, a direct solution is to set a mapping function () and a threshold .

Before recognizing deepfake algorithms, what needs to

be demonstrated is whether the utterances generated by diferent synthesis methods are distinguishable. In other {︃, () ≥ words, on what level are they distinguishable. ∈ , () <

Vocoder is a key component in the process of generating fake utterances, which converts features to sam- Based on the assumption that samples from diferent pling points. The quality of the vocoder determines the classes are farther apart, in this work, we leverage the quality of the generated utterance. Vocoder residual arti- k-th nearest neighbor (kNN) distance as the output of facts located in diferent frequency bands could serve as the mapping function (), inspired by [17]. Compared markers for deepfake algorithms. For instance, non-ideal to the 1st nearest neighbor (1NN) distance, under an upsampling filters will leave aliasing artifacts in the high appropriate k-value, kNN distance is less susceptible to frequency part [18]. Figure 1 shows the impact of difer- noise samples. Cosine similarity is adopted to calculated ent vocoders on utterances at the frequency level. We the distance between the feature embeddings. Cosine reconstruct the same batch of natural speech using dif- similarity is defined as: ferent vocoders, and calculate the average energy of the · frames at each frequency point. From Figure 1, it can be , = ‖‖‖‖ analyzed that the artifacts carried by diferent vocoders are located in diferent frequency bands. Therefore, features that encode time-frequency information could be utilized to recognize deepfake algorithms. where and are the embeddings of utterances extracted by models. Figure 2 shows the density of kNN distance of embeddings between ID data and OOD data. The ID data is from a known class of training set of ADD2023 Track3, and OOD data is from the other classes. kNN 2.2. KNN-based OOD Detection cosine distance of ID data is smaller than that of OOD The proposed KNN-based OOD detection method is a data. Therefore, we could use a threshold-based criterion distance-based method, which leverages the distance be- to determine whether the input utterance is OOD or not. tween embeddings extracted by trained DNN-based mod- The pipeline of the method could be summarized as: els. The basis of the proposed method is an intuitive (1) Train a multi-class DNN-based classifier with training assumption that samples from the same class are closer dataset D; (2) Use the trained model to pre-classify in distance, while samples from diferent classes are far- the test set D; (3) Extract the feature embeddings of ther apart. each samples from D and D. (4) Select an appropriate k-value, and calculate the kNN cosine distance of D of each class, and estimate a threshold; (5) Calcu- where and represent precision and recall, respeclate the kNN cosine distance between D and D tively. and are defined as: of each class, and attribute the OOD samples to a new unknown class based on a threshold-based criterion. = 1 ∑︁

=1 + = 1 ∑︁ =1 +

3. Experimental Setup 3.1. Dataset and Metrics We used the training, development and test datasets of

ADD2023 Challenge track 3 (Deepfake algorithm recognition) [14] to validate our work. The training and development sets include 6 types of counterfeit speech generated with diferent deepfake algorithms and 1 type of genuine speech. The test set includes the 7 classes from the training and development sets, and an unknown counterfeit speech class. The detailed information about the dataset is shown in Table 1.

3.2. Data Augmention

To augment training data, we utilized a common data augmentation method in the speech spoofing detection tasks, which is to add noise and reverberation to the original speech from MUSAN [19] and RIRs [20] datasets. In addition, we add some acoustic scenes as additive noise to improve the robustness of methods under various noisy scenarios. The acoustic scenes are randomly selected from the TAU Urban Acoustic Scenes database [21].

Since ADD2023 Track 3 includes OOD data, it is necessary to mitigate the common issue of overconfidence in deep neural networks. Therefore, we also introduce CutMix [ 22 ] as a data augmentation method. The operation of CutMix could be described as {︃ = ⊙ + (1 × ̃︀ ̃︀ = + (1 − ) − ) ⊙ where and ∈ R × are two-dimensional timefrequency feature extracted by utterances randomly selected from the training set. and are the labels of the selected samples. ∈ {0, 1} × denotes a binary mask indicating where to drop out and fill in from two features, 1 × is a binary mask filled with ones. ⊙ is element-wise multiplication. (, ) denotes the newly ̃︀ ̃︀ generated training sample. CutMix cuts and pastes two speeches from diferent classes at the two-dimensional time-frequency feature level, allowing the DNN model to learn a better decision boundary. In addition, CutMix can improve the model’s ability to distinguish OOD data [ 22 ].

3.3. Model Architecture Since the method introduced in this work, detecting OOD

data based on kNN, is model-agnostic, we attempt to train various model architectures. By doing so, the complementarity between diferent models could be utilized through fusing model in order to enhance performance.

Similar to the traditional pipeline of speech spoofing detection tasks, in this deepfake algorithm recognition task, we divide the model into a front-end for feature results are obtained by directly classifying the test set extraction and a back-end for classification. For the front- into 7 classes, without considering the unknown counterend, we choose a hand-crafted feature, STFT, and an un- feit class. The results show that the performance of the supervised pre-trained feature extractor, Wav2Vec2 [23]. model has been significantly improved after the addition For the back-end, three kinds of model architectures are of additive noise. which is consistent with the traditional adopted, which are SENet [4], LCNN-LSTM [24] and speech spoofing detection task. While after adding CutTDNN [25]. The SENet is an integration of the ResNet mix, the performance of the model is not significantly with the squeeze-and-excitation (SE) [26] block. The changed.

SENet18 and SENet34 are adopted in our work, the number of blocks of which are diferent. Table 2

The STFT feature is a two-dimensional time-frequency Data augmentation experimental results. feature, so convolution-based models can learn the patterns that exist in both dimensions. While, although Augment Method F1-score(%) ↑ Wav2Vec2 still extracts two-dimensional features from no augment 64.89 an utterance, the features at each time frame are context +additive noise 76.75 representations rather than patterns that could be learned +additive noise + CutMix 76.79 by convolutional kernels. Therefore, SENet-based backends are cascaded to STFT front-ends. And LCNN-LSTM and TDNN, which have RNN-based, which have the ability to extract temporal information, are cascaded to the 4.2. Results of kNN-based OOD detection Wav2Vec2-based front-end.

3.4. Training Strategy

All DNN-based models are trained with Adam optimizer [27], which is adopted with 1 = 0.9, 2 = 0.9, = 10−8 and weigth decay 10−4 . Angular margin based softmax loss (A-softmax) [28] is adopted as the loss function to be optimized. For the models with STFT-based front-end, the learning rate is initialized as 3 × 10−4 . As a scheduler, StepLR is used with step size of 10 epochs and coeficient 0.5. For the Wav2Vec2-based feature extractor, the learning rate is fixed at 10−6 . All models are trained with 100 epochs, in which the model with the the lowest loss on the dev set is selected as the final model.

3.5. Model Fusion Since the proposed OOD detection method is model

agnostic, to leverage the complementarity between different models, we introduce a logits-based model fusion method. Logits output by diferent models are weighted and then added. For the samples that are identified as OOD data by kNN-based detector, the original maximum logit value is assigned to the new unknown class, and the logit of the original max class index is set to zero.

4. Result and Analysis 4.1. Results of Data Augmentation Two data augmentation methods are introduced in this work, namely additive noise and cutmix. Under the same DNN model (STFT+SENet34), the results of data augmentation are shown in Table 2. It should be noted that all 4.3. Results of Model Fusion 5. Conclusion 4.4. Results of Submitted System This paper describes the system developed for ADD2023

Track3. Five single-models with diferent front-ends and back-ends are constructed as basic classifiers for the deepfake algorithms recognition task. kNN distance is efective in separating ID samples and OOD samples. Therefore, an OOD detection module based on kNN distance is introduced and improve the performance of singlemodels significantly. Introducing additive noise during the training process makes single-model more robust. After fusing these models at the logits level, our final system achieves first place in ADD2023 Track3.

Acknowledgments This work is partially supported by the National Key Research and Development Program of China (No. 2021YFC3320103).

Interspeech.

2022 - 129 . [23]

Baevski ,

Zhou ,

Mohamed ,

Auli , wav2vec [13]

Yan ,

Yi ,

Tao ,

Wang , H. Ma, T. Wang, 2 .0: A framework for self-supervised learning of

tecting vocoder fingerprints of fake audio , in: mation Processing Systems 33 ( 2020 ) 12449 - 12460 .

Proceedings of the 1st International Workshop on [24]

Wang ,

Yamagishi , A comparative study

'22, Association for Computing Machinery, New synthetic speech detection , in: Proc. Inter-

York , NY, USA, 2022 , p. 61 - 68 . URL: https://doi.org/ speech 2021 , 2021 , pp. 4259 - 4263 . doi: 10 .21437/

10.1145/3552466.3556525. doi: 10 .1145/3552466. Interspeech. 2021 - 702 .

3556525. [25]

Desplanques ,

Thienpondt ,

Demuynck , [14]

Yi ,

Tao ,

Fu ,

Yan ,

Wang , Chenglong Ecapa-tdnn: Emphasized channel attention , propa-

Xu ,

Zhou ,

Gu ,

Wen ,

Liang ,

Lian ,

Li , ifcation, Proc. Interspeech 2020 ( 2020 ) 3830 - 3834 .

Add 2023 : the second audio deepfake detection chal- [26]

Hu ,

Shen , G. Sun, Squeeze- and -excitation net-

lenge, in: IJCAI 2023 Workshop on Deepfake Audio works, in: Proceedings of the IEEE conference on

Detection and Analysis (DADA

2023 ), 2023 . computer vision and pattern recognition, 2018 , pp. [15]

Liang ,

Li ,

Srikant , Enhancing the reliabil- 7132 -7141.

ity of out-of-distribution image detection in neu- [27]

D. P.

Kingma ,

Ba , Adam: A method for stochastic

ral networks , in: 6th International Conference on optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd

Learning

Representations , ICLR 2018 , Vancouver, International Conference on Learning Representa-

BC , Canada, April 30 - May 3, 2018 , Conference tions, ICLR 2015 , San Diego, CA, USA, May 7-9,

Track

Proceedings , OpenReview.net, 2018 . URL: 2015 , Conference Track Proceedings, 2015 . URL:

https://openreview.net/forum?id=H1VGkIxRZ. http://arxiv.org/abs/1412.6980. [16]

Bendale ,

T. E.

Boult , Towards open set deep [28]

Liu ,

Wen ,

Yu ,

Li ,

Raj , L. Song,

on computer vision and pattern recognition, 2016, recognition , in: Proceedings of the IEEE conference

pp. 1563 - 1572 . on computer vision and pattern recognition, 2017 , [17]

Sun ,

Ming ,

Zhu ,

Li , Out- of-distribution pp. 212 - 220 .

2022 , pp. 20827 - 20840 . [18]

Shang ,

Zhang ,

Zhang , L. Wang,

Li , Analy-

form generation models, Applied Acoustics 203

( 2023 ) 109183 . [19]

Snyder ,

Chen ,

Povey , MUSAN:

abs/1510 .08484 ( 2015 ). URL: http://arxiv.org/abs/

1510.08484. arXiv: 1510 . 08484 . [20]

Ko ,

Peddinti ,

Povey ,

M. L.

Seltzer , S. Khudan-

speech for robust speech recognition , in: 2017 IEEE

Signal

Processing (ICASSP),

IEEE , 2017 , pp. 5220 -

5224. [21]

Mesaros ,

Heittola ,

Virtanen , A multi-

tion of Acoustic Scenes and Events 2018 Workshop

(DCASE2018) , 2018 , pp. 9 - 13 . [22]

Yun , D. Han,

S. J.

Oh ,

Chun ,

Choe ,

Yoo ,

puter vision , 2019 , pp. 6023 - 6032 .