1. Introduction

Single domain generalization for audio deepfake detection

Yuankun Xie

Haonan Cheng

Yutian Wang

Long Ye

0 0 State Key Laboratory of Media Convergence and Communication, Communication University of China , Beijing , China

2023

58 63

Audio deepfake detection (ADD) is a prominent problem in artificial intelligence. With diverse spoofing attacks emerging continually, generalization of ADD algorithms in the face of unknown domains and robustness in complex environments become key points for this field. However, when only limited and low-quality learning data is available, as in the case of ADD 2023 Challenge Track 1.2, it is an open issue to achieve good generalization and robustness. In this paper, we propose a Shufle Mix Aggregation and Separation Domain Generalization (SM-ASDG) method which enables single-domain generalization. Specifically, we first design a pre-processing module to improve the robustness of the method against low-quality data. Next, we split the single domain into multiple data domains via the proposed data shufle module. Finally, a well-generalized feature space is constructed through the designed feature extractor and MixStyle domain classifier. The proposed SM-ASDG obtain the weighted equal error rate (WEER) of 23.17% on ADD Challenge Track 1.2, which achieves the Top-5 rank in the challenge.

eol>Audio deepfake detection single domain generalization self-supervised representation ADD challenge

1. Introduction

tion and further strategies for improving generalizability are required.

Audio deepfake detection (ADD) is an important yet chal- To this end, several methods propose the domain inlenging task, which has raised several concerns due to its variant representation learning (DIRL) strategy [ 8, 9, 10 ] high societal impact [ 1, 2, 3 ]. This task aims to accurately in order to overcome the issue of generalizing to invisible classify real and fake audio, where one of the main chal- target domains with limited source data. The DIRL stratlenges is to identify accurately in the face of unknown egy aims to reduce representation diferences between spoofing methods or low quality audio. multiple diferent source domains to ensure domain in

In recent years, several works [ 4, 5 ] achieve promising variance. However, for situations where multiple source results on intra-domain datasets. However, the perfor- domains are not available, as in the case of the ADD 2023 mance of these methods degrades significantly when Audio fake game (FG) Challenge [ 11 ] where there is only extending to cross-domain scenarios [ 6 ]. This is mainly one acceptable training set, the DIRL strategy cannot due to the fact that these methods do not take suficient be applied efectively. In addition, the performance of account of the unknown domain and the damaged audio the ADD method degrades significantly when a large quality. Consequently, the issues of generalization and amount of noise, reverberation and other disturbances robustness become two key concerns for ADD. are mixed into the source domain data. Therefore, how

To address generalization and robustness issues, some to construct ADD models with good generalizability and methods [ 2, 3 ] adopt data augmentation schemes to im- robustness based on single-domain, low-quality data is an prove model performance by learning diverse audio fea- open problem that remains to be explored. tures over a larger amount of data. Specifically, Piotr et al. In this paper, we introduce a novel Shufle Mix [ 7 ] utilize a combination of three deepfake and spoofing Aggregation and Separation Domain Generalization (SMdatasets to increase the training stability. However, larger ASDG) method for single-domain ADD. The key idea of data sets also lead to higher computational costs. More- our approach is assuming that in an ideal classification over, as forgery techniques are constantly updated, there feature space, the data distribution of real audio can be are always unknown attack methods outside the domain. clustered in a single set, while the data distribution of Therefore, it is not suficient to rely on data augmenta- fake audio should be more scattered. This is because IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis diferent types of attacks impact more on spoofing audio, (DADA 2023), August 19, 2023, Macao, S.A.R although diferent recording devices or channel also have * Corresponding author. some impact on real audio. Based on this idea, we pro† These authors contributed equally. pose a modified DIRL strategy that allows the application $ xieyuankun@cuc.edu.cn (Y. Xie); haonancheng@cuc.edu.cn to a single source domain. To be specific, the proposed (H. Cheng); wangyutian@cuc.edu.cn (Y. Wang); SM-ASDG contains a total of four modules, namely preyelohnttgp@s:/c/uhca.oendaun.ccnhe(Ln.gY.cen)/ (H. Cheng) processing, data shufle, feature extractor and MixStyle 0000-0002-8366-9011 (Y. Xie); 0000-0003-3407-4318 (H. Cheng) domain classifier. First, the pre-processing module con© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tains three carefully designed pre-processing strategies CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

… ADD 2023 Dataset

Pre-processing

Low-pass Filter Amplitude Adjustment

Noise Augmentation

Shuffle Mix Aggregation and Separation Domain Generalization (SM-ASDG)

Data Shuffle … … …

Feature Extractor

MixStyle Domain Classifier

Feature Space

Triplet Loss BCE Loss Adversarial

Loss

Audio Score to eliminate the efects of noise and other factors on We further adjust the amplitude of signals due to the the model and to improve the robustness of the algo- observation that the amplitude of genuine speech is difrithm. Second, the data shufle module is introduced to ferent from that of spoofed speech. In the training set, approximate a multi-source domain situation by splitting we observe that the genuine speech has higher amplithe single domain. Then, we construct a feature extrac- tude than spoofed speech. This may cause the model tor based on W2V2-XLS-R [ 12 ]. Finally, we propose a tends to classify high-amplitude speech as genuine and MixStyle domain classifier by mixing feature statistics of low-amplitude speech as spoofed during inference. Thus, training samples across source domains. By this means, we compute the average amplitude of genuine and fake the model can diversify the style information at the bot- speech and increase the amplitude of each fake speech in tom layers of the networks. Our proposed SM-ASDG the training set to match the average amplitude of genmethod achieve outstanding results in the ADD 2023 Au- uine speech, thereby equalizing their average amplitudes dio FG Challenge, demonstrating the efectiveness of our in training process. method. In summary, our contributions are as follows: To enhance the robustness of the model in a noisy situation, we introduce a noise enhancement strategy in the pre-processing. We add reverberation and noise obtained from MUSAN [14] and RIR [15] to the original speech, which is a high efective strategy in speech recognition and speaker verification. • We propose SM-ASDG, a high eficient audio deepfake detection method which achieves the top-5 rank in the ADD 2023 challenge track 1.2. • A modified DIRL strategy is proposed for the situation where only a single source domain is available. The proposed domain generalization strategy can improve performance by 9% to 11% on diferent models. • The efects of a series of pre-processing strategies are explored. In addition to common preprocessing methods such as noise addition and reverberation, we also explore the efect of silent frames in forgery identification performance.

2. Proposed Method 2.1. Preprocessing

To address the efect of codec variabilities, we first adopt a low-pass filter [ 13 ]. This is because that in complex speech scenarios, focusing on the low-frequency speech components can often make the model more efective. Specifically, we utilize a Chebyshev Type I lowpass filter to preprocess the original 16 kHz signal into a low-pass ifltered signal. We set the order of the filter to 8, with maximum ripple and critical frequencies set to 0.05 and 4 kHz, respectively.

2.2. Data Shufle

To improve the generalization ability of the model, we divide the training data into three diferent domains randomly. Randomly shufling the domains enriches the style information of each domain, allowing the domain adversarial loss to aggregate all real speech from various styles. In the experimental section, we further verify that randomly shufling the domains is more efective than direct grouping the validation set into one domain and the training set into two domains.

2.3. Feature Extractor

We first extract features via a W2V2 based front-end, which is trained using a contrastive method with a masked feature encoder. The front-end feature extractor has a feature extractor with seven CNN layers to process speech signals of diferent lengths, followed by a Transformer network with 24 layers, 16 attention heads, and an embedding size of 1024 to obtain context representations. Consequently, the last hidden states from the

Module Input MixStyle Conv2d/MFM/BN Conv2d/MFM/Pool/BN Conv2d/MFM/BN Conv2d/MFM/Pool Conv2d/MFM/BN Conv2d/MFM/BN Conv2d/MFM/BN Conv2d/MFM/Pool Reshape/Transformer Flatten/FC

ConvFilter

2.4. MixStyle Domain Classifier

After get the W2V2 feature from feature extractor, we propose a MixStyle Domain Classifier to generate the feature space by optimizing three diferent loss function. The detailed architecture is described in Table 1, which is modified on the traditional LCNN [ 16]. In the architecture, MFM means the Max-Feature-Map layer to select the critical channels for ADD task of the feature and BN means Batch Normalization. After MixStyle domain classifier, we get the feature space of the shape (16,512).

Through the mixing of training instance styles, we can implicitly synthesize novel domains, which results in increased domain diversity of the source domains and ultimately improves the generalizability of the trained model. Given an input batch , we first random choose a reference batch ˜ from . Then, Mixstyle computes the mixed feature statistics as follow: mix = () + (1 − ) (˜), = () + (1 − )( ˜), where is the weight sample from the Beta distribution Beta(, )

. We set to 0.1 in our paper. Then, the style normalized feature is computed by the mixed feature statistics,

2.5. Loss Function

BCE loss. First, our main task is binary classification, which is to determine whether the features obtained are genuine or spoofed. We use several FC layer to down sample the feature from 512 to 1 and compute Binary Cross Entropy (BCE) to classify. It is worth mention that the feature normalization and weight normalization is used for this process, which will balance the numerical values of features and weights from speech signals across diferent domains, facilitating the convergence of the model.

Triplet loss. Our proposed ASDG strategy is that the real speech from diferent domain should be aggregated and the spoof one will be separate. The triplet mining method is suitable for the goal, which is defined as follow: (3) (4) = ∑︁ ‖ ( ) − ()‖22 −

⃦⃦ () − )︁ ⃦⃦⃦ 22

︁( ⃦

+ , where ,

, represent the anchor sample, real sample, and fake sample. By minimizing , the euclidean distance between the anchor and the real sample may get closer while the anchor may get further away from the fake sample. We set to 0.1 which is a margin value.

Adversarial loss. In the feature space, the distribution of real speech should be aggregated regardless of domain.

Thus, we design a single-side domain discriminator with Gradient Reverse Layer (GRL) [17]. Let () denotes the distributions of real feature and denotes the domain of . The adversarial loss function of the domain discriminator is defined as follows:

minmax (, ) = − ∼ (),∼ 3 =1 ∑︁ ( = ) ( (()) , where denotes the domain label. The feature generator is trained to learn a robustness feature to spoof the domain discriminator in order to maximize . In the meantime, the discriminator is trained to identify the feature domain by minimizing. To achieve this goal, we (1) use the Gradient Reversal Layer (GRL), which reverses the gradient during back propagation by multiplying negative dynamic coeficients. This makes the discriminator unable to identify the domain of the real feature, which leads to the aggregation of genuine speech in the feature space without being divided by domains.

Total loss. The total loss for our system is defined as follow: = + 1 + 2, (5) MixStyle() = + .

(2) − ()

() where 1 and 2 set to 0.1 to balance the value of three Table 2 diferent losses. By utilizing the loss, we can con- Performance comparison with the state-of-the-art ADD modstruct an optimal classification feature space for ADD els on the ADD-FG dataset. task, where genuine speech signals from diverse domains Method Feature 1 2 are clustered together while fake speech signals are separated from them.

AASIST [5] ResNet18 [18] LCNN [19] LCNN [19]

SM-ASDG

Raw Audio LFCC

Mel W2V2 W2V2

3. Experiments 3.1. Dataset and metrics

All experiments are conducted on the ADD 2023 Audio FG-D datasets. There are 27,084 audio clips in the training set and 28,324 audio clips in the development set. We divide the dataset as 90%/10% for training and validation, respectively. The audio amplitude in the training set is inconsistent and contains noise, and there are repeated tail segments without valid information. The audio situation in the test set is much more complex, including noise, reverberation, background music, and a large number of silent clips. Therefore, how to improve the generalization and robustness of methods is the core challenge.

Weighted equal error rate (WEER) is used as the evaluation metric, which is defined as better performance than manual feature. This is due to that the W2V2 is trained on a large amount of real utterances from diferent source domains which can enhance the diferential capability in complex scenarios. Moreover, results show that our SM-ASDG model outperforms all backbone models. = 1 + 2, (6) Impact of MixStyle. We further investigate the impact of the MixStyle units. As shown in Table 3, “SM-ASDG where = 0.4 and = 0.6 represent the weights for w/o MIX” denotes removing the MixStyle layer from equal error rate (EER) obtained in round 1 (1) and our full model. It can be observed that the performance round 2 (2) of ADD Challenge Track 1.2, respec- of the model decreases by 2.91% in round1 and 4.50% tively. in round2 with the removal of MixStyle. This demonstrates the efectiveness of our MixStyle domain classifier 3.2. Implementation details module. This is due to the fact that the bottom layer of CNN corresponds to style information and the top layer All training audio files are trimmed or padded to 4s. For corresponds to label information. MixStyle enables the baseline AASIST, the input is the raw waveform of about diversification of the bottom style information of LCNN. 4s (64000 samples). For baseline Resnet18, we use 80- That is, our model can generate diverse new styles of real dimensional LFCCs with a shape of (80,404) as front-end. speech and fake speech to enhance the ability of domain During training, the parameters of W2V2 front-end are generalization. frozen. After front-end, we can get the last hidden states Visualization for feature To analyze the efect of the vector with shape of (201, 1024) as input of back-end. For MixStyle and our proposed ASDG backbone, we visutraining strategy, the Adam optimizer is adopted with alize the distribution of diferent hidden features using 1 = 0.9, 2 = 0.999, = 10−9 and weight decay is T-SNE [20]. As shown in Figure 2, we randomly select 10−4 . The learning rate is initialized as 10−5 and halved 360 samples for three source data domains. In each doevery 5 epochs. main, we select 60 samples for real utterances and 60 for fake utterances. Figure 2 (a) and (b) demonstrate 3.3. Ablation studies on architecture that the hidden feature distributions become more distinct after applying MixStyle, indicating that MixStyle facilitates the diversification of the bottom style information in LCNN. The feature space depicted in Figure 2 (c) aligns with our conception of an ideal feature space by ASDG, where genuine speech signals are clustered together, while synthetic ones are segregated.

Impact of backbone models and features. We first investigate the impact of the backbone models and features.

As shown in Table 2, we compare with three baseline backbone models: AASIST [ 5 ], ResNet18 [18] and LCNN [19]. Furthermore, we compare W2V2 based feature and manual feature connected with the same LCNN back-end.

It can be observed that the W2V2 based feature shows

Real

Fake (a) (b)

(c)

3.4. Ablation studies on pre-processing

To improve the robustness and generalization of the model, we explore a series of pre-processing strategies, including data shufle, noise augmentation, low-pass filtering, amplitude adjustment, and region of interest (ROI) detection. Pre-processing strategy 1 2 Does data shufle help? We first explore the eficacy SM-ASDG w/o MIX 26.97 27.09 of data shufle strategy. As shown in Table 4, we design SM-ASDG w/o [MIX+AMP] 27.53 27.80 two domain segmentation schemes, namely data shuf- SM-ASDG w/o [MIX+AMP+LP] 28.05 29.24 lfe and direct division. The direct division refers to the SM-ASDG w/o [MIX+AMP+LP+NA] 37.07 38.67 directly using the test set and validation set as two sep- SM-ASDG with ROI 26.25 26.45 arate source domains. The two domain segmentation SM-ASDG 24.06 22.59 schemes are used in four variant, namely ASDG model and the ASDG model with diferent data augmentation strategies. “Rawboost” denotes the raw data boosting of the model. So we ultimately choose RM strategy as and augmentation strategy [ 3 ]. We utilize the best per- the noise augmentation strategy. formance strategy in ASVspoof2021LA, which combines Does low-pass filter help? To against complex speech linear and non-linear convolutive noise with impulsive, scenarios, we add low-pass filters to focus on the core signal-dependent noise. “RM” means adding noise and frequency of the speech. The result shown in Table 5 reverberation from RIRs [15] and MUSAN [14] datasets (the third row from top) indicates that low-pass filters to the audio of training set in a Kaldi [21] like manner. In can help improve the forgery detection performance. each pair of comparison data (the red row in Table 4 and Does amplitude adjustment help? The amplitude level its upper row), we can observe that the shufle strategy of the data samples varies greatly in the training and testcan efectively improve the forensic performance. Data ing datasets. However, we find in our experiments that shufle can reduce the order and pattern in the dataset, the amplitude of the audio is learned by the model and thus improving the generalization of the model. afects the classification results. Therefore, we adjust Does noise augmentation help? Since the test data the amplitudes to the same interval range uniformly durcontain a large amount of noise and background music ing both testing and training. As shown in Table 5 (the that are not available in the source domain dataset, we second row from top), audio normalization has obvious incorporate a noise augmentation strategy in the prepro- efects on the performance. cessing stage, that is, introducing noise during training to Does ROI detection help? Due to the large number of improve the robustness of the model. It can be seen from silent segments that do not contain speech content in the Table 5 (the fourth row from top) that when the noise aug- test data, a straightforward idea to improve performance mentation strategy is removed, the model performance is to detect only speech segments, that is, to detect ROIs. decreases by 9.02% in round1 and 9.43% in round2. In ad- However, as shown in Table 5 (the red row), when ROI dition, the efectiveness of diefrent noise augmentation detection is added, the overall performance of the model strategies can also be seen in Table 4 (as shown in red decreases by 2.19% in round1 and 3.86% in round2. This rows), as the RM strategy can maximize the robustness is mainly due to that silent segments also contain artifact

4. Conclusion

information for distinguishing real and fake audio [22]. Therefore, simply eliminating silent segments does not improve the generalization and robustness of the model. In this paper, we propose SM-ASDG, a novel shufle mix aggregation and separation domain generalization method for single domain ADD. The proposed method achieves a WEER of 23.17% in ADD 2023 track 1.2 final ranking, which is one of the top-5 performing methods. The outstanding robustness and generalization of the proposed SM-ASDG model is due to our carefully designed preprocessing module, data shufle and MixStyle domain classification module. In future works, we plan to embed more high-level semantic features of audio, such as sentiment features, into the model to further improve generalization.

[9]

Zhou ,

Luo ,

Gao ,

Li ,

Lei ,

Leng , Selective domain-invariant feature alignment network for face anti-spoofing , IEEE Transactions on Information Forensics and Security 16 ( 2021 ) 5352 - 5365 .

[10]

Wang ,

Yu ,

Deng ,

Li ,

Gao ,

Wang , Domain generalization via shufled style assembly for face anti-spoofing , in: Proceedings of the CVPR , 2022 , pp. 4123 - 4133 .

[11]

Jiangyan ,

Jianhua ,

Ruibo ,

Xinrui ,

Chenglong ,

Tao ,

Chuyuan ,

Xiaohui ,

Yan ,

Yong ,

Le ,

Junzuo , G. Hao,

Zhengqi ,

Shan ,

Zhen , L. Haizhou, Add 2023 : the second audio deepfake detection challenge , in: Proceedings of the IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis , 2023 .

[12]

Babu ,

Wang ,

Tjandra ,

Lakhotia ,

Xu ,

Goyal ,

Singh , P. von Platen,

Saraf ,

Pino , et al., Xls- r: Self-supervised cross-lingual speech representation learning at scale , arXiv preprint arXiv:2111.09296 ( 2021 ).

[13]

Wang ,

Nishizaki ,

Li , Low pass ifltering and bandwidth extension for robust anti-

[1]

Khan , K. M. Malik , J.

Ryan , M.

Saravanan , Voice spoofing countermeasure against codec variabilispoofing countermeasures: Taxonomy, state-of-the- ties , arXiv preprint arXiv: 2211 .06546 ( 2022 ). art, experimental analysis of generalizability , open [14]

David ,

Guoguo , P. Daniel, Musan: A music, challenges, and the way forward, arXiv preprint speech, and noise corpus , arXiv:1510.08484 ( 2015 ). arXiv: 2210 .00417 ( 2022 ). [15]

Tom ,

Vijayaditya , P. Daniel, S. Michael L ,

[2]

Cohen ,

Rimon , E. Aflalo,

Permuter ,

A K.

Sanjeev , A study on data augmentation of reverstudy on data augmentation in voice anti-spoofing, berant speech for robust speech recognition ., in: Speech Communication 141 ( 2022 ) 56 - 67 . Proceedings of the ICASSP , 2017 , p. 5220 - 5224 .

[3]

Tak ,

Kamble ,

Patino ,

Todisco ,

Evans , [16]

Wang ,

Yamagishi , A comparative study on reRawboost: A raw data boosting and augmentation cent neural spoofing countermeasures for synthetic method applied to automatic speaker verification speech detection , arXiv preprint arXiv:2103 .11326 anti-spoofing, in: Proceedings of the ICASSP , IEEE, ( 2021 ). 2022 , pp. 6382 - 6386 . [17]

Ganin ,

Lempitsky , Unsupervised domain adap-

[4]

Tak ,

Patino ,

Todisco ,

Nautsch , N. Evans, tation by backpropagation , in: Proceedings of the A. Larcher , End-to-end anti-spoofing with rawnet2 , ICML, 2015 , pp. 1180 - 1189 . in : Proceedings of the ICASSP , 2021 , pp. 6369 - 6373 . [18]

Ma ,

Ren , S. Xu, RW-Resnet : A Novel Speech

[5]

J.-w.

Jung , H.-S. Heo,

Tak , H.-j. Shim,

J. S.

Chung , Anti-Spoofing Model Using Raw Waveform , in: ProB. -J. Lee , H.-J.

Yu , N.

Evans , Aasist: Audio anti- ceedings of the Interspeech , 2021 , pp. 4144 - 4148 . spoofing using integrated spectro-temporal graph [19]

Lavrentyeva ,

Novoselov , S. , M. Volkova, attention networks , in: Proceedings of the ICASSP, A. Gorlanov, A. Kozlov, Stc antispoofing systems 2022 , pp. 6367 - 6371 . for the asvspoof2019 challenge, arXiv preprint

[6]

Müller ,

Czempin ,

Diekmann ,

Froghyar , arXiv: 1904 . 05576 ( 2019 ). K. Böttinger, Does Audio Deepfake Detection Gen- [20]

Van der Maaten , G. Hinton, Visualizing data eralize? , in: Proceedings of the Interspeech , 2022 , using t-sne., Journal of machine learning research pp. 2783 - 2787 . 9 ( 2008 ).

[7]

Piotr ,

Marcin ,

Piotr , Attack agnostic dataset: [21]

Mirco ,

Titouan ,

Yoshua , The pytorch-kaldi Towards generalization and stabilization of audio speech recognition toolkit, in: Proceedings of the deepfake detection , in: Proceedings of the Inter- ICASSP , 2019 , pp. 6465 - 6469 . speech, 2022 , pp. 4023 - 4027 . [22]

Yuxiang ,

Wenchao ,

Pengyuan , The efect

[8]

Matsuura , T. Harada, Domain generalization of silence and dual-band fusion in anti-spoofing using a mixture of multiple latent domains , in: Pro- system, in: Proceedings of the Interspeech , 2021 . ceedings of the AAAI , volume 34 , 2020 , pp. 11749 - 11756 .