-

August

Multi-perspective Information Fusion Res2Net with Random Specmix for Fake Speech Detection

Shunbo Dong

Jun Xue

Cunhang Fan

Kang Zhu

Yujie Chen

Zhao Lv

0 0 Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University , AHU 1 Jiulong Road, Hefei , 230601 , China

2023

19 2023 0000 0002

In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the model's ability to locate discriminative information. Specmix cuts and pastes the frequency dimension information of the spectrogram in the same batch of samples without introducing other data, which helps the model to locate the really useful information. At the same time, we randomly select samples for augmentation to reduce the impact of data augmentation directly changing all the data. Once the purpose of helping the model to locate information is achieved, it is also important to reduce unnecessary information. The role of MPIF-Res2Net is to reduce redundant interference information. Deceptive information from a single perspective is always similar, so the model learning this similar information will produce redundant spoofing clues and interfere with truly discriminative information. The proposed MPIF-Res2Net fuses information from diferent perspectives, making the information learned by the model more diverse, thereby reducing the redundancy caused by similar information and avoiding interference with the learning of discriminative information. The results on the ASVspoof 2021 LA dataset demonstrate the efectiveness of our proposed method, achieving EER and min-tDCF of 3.29% and 0.2557, respectively.

eol>multi-perspective information fusion fake speech detection task random Specmix strategy

1. Introduction (a) Res2Net backbone (b) MPIF-Res2Net This enables the model to acquire more comprehensive ening the fitting ability of system. Specifically, we raninformation of features. Li et al. [18] investigated the domly cover the part of the frequency information with efectiveness of Res2Net in conjunction with diferent corresponding frequency information of another sample acoustic features. Li et al.[19] proposed a channel-wise in the same batch. This approach can improve the pergating mechanism to suppress channels with lower cor- formance greatly. Our proposed method has been shown relations which they thought not useful. However, the to be efective on the ASVspoof 2021 LA dataset, with models mentioned above may not achieve better results achieved EER and min-tDCF results of 3.29% and 0.2557, in the low-quality scenarios as they only conducted their respectively. experiments in the celan scenarios.

In this work, we propose multi-perspective information fusion Res2Net (MPIF-Res2Net) with random 2. Methodology Specmix. Spoofing information from a single perspective is always similar during learning process, which causes 2.1. Proposed method redundant information and blurs the truly discriminative In this section, we introduce the structure of the proposed information. The MPIF module fuses the information MPIF-Res2Net, it reduces redundancy caused by learning from diferent receptive field to reduce the redundant single-perspective forgery information by integrating spoofing cues and enhance the robustness of system in information from multiple perspectives. The convoluthe poor-quality scenarios. Specmix can increase the tional operations with single kernel size are learning the diversity of training data, thereby improving the general- similar forgery clues, producing too much redundant inization ability of the model. The generated spectrogram formation and obscuring the important discriminative will incorporate information from other spectrograms, information. Therefore, the MPIF-Res2Net as shown on allowing the model to pay attention to the noteworthy the left of Figure 1(b) is proposed to fuse information information. And it performs cut and paste among spec- from diferent convolutional operations with diferent trograms without introducing data that was not present kernel size. The architecture of Res2Net is shown as Figin the original dataset with a modest impact on the origi- ure 1(a), the outcome from the 1 × 1 convolution was nal dataset. DA method conducted on all the samples may splited into equal parts by the channel dimension, deafect the distribution characteristics of the original data. noted as , where is the integer between 1 to n. And For this issue, we randomly choose samples according to each part has (Eq. 1) channels. the probability _ℎ in advance to conduct Specmix to prevent excessive augmentation methods from weak- = ( 1 ) #ℎ

() , ⎩ ( + −1 ) , = 1 = 2 2 < ≤ where #ℎ means the total number of channels. Table 1 Res2Net uses the residual-like connection to perform The Proposed MPIF-Res2net Model Architecture and Configuaddition between the channel groups. The following ration. the Dimensions Are Arranged in the Order of Channels, formulation can be used to describe this process: Frequency, and Time). BN Denotes Batch Normalization and ReLU denotes Rectified Linear Unit, MPIF and SE Are the = ⎧⎨ , ( 2 ) EMxucilttia-tpioernspLeacyteirv,eRIensfpoermctiavteiolyn. Module and the Squeeze And

Layer

Front-end Pre-processing

Layer1 &Layer3 Layer2 &Layer4 Output

Input:27000 samples

F0 subband Channel expansion

Conv2D_1 BN & ReLU ⎧2_1 ⎪ ⎪ ⎪⎨2_3 ⎪2_1 ⎪ ⎪⎩ 1 × 2 × ⎧2_1 ⎪ ⎪ ⎪⎨ 1 × ⎪2_1 ⎪ ⎪⎩ ⎧2_1 ⎪ ⎪ ⎪⎨2_3 1 × ⎪2_1 ⎪ ⎪⎩ ⎧2_1 ⎪ ⎪ ⎪⎨ ⎪2_1 ⎪ ⎪⎩ Avgpool2D( 1, 1 )

AngleLinear

Output shape (45,600)(F,T) ( 1,45,600 ) (16,45,600) Layer1 ( 32,45,600 ) Layer3(128,12,150) Layer2(64,23,300) Layer4( 256,6,75 ) As shown on the right of Figure 1(b), MPIF module in current channel group performs diferent convolution operation on or + −1 to get the spoofing information from diferent perspective. Firstly, is sent into the convolution operations with diferent dilation parameter , where ∈ [1, 2], at the beginning of MPIF module (Eq. 4). The results are then passing through the dilated convolution to recalculate the energy distribution of each channel, normalize them through the Sigmoid function, and then the average pooling layer is used to get the results as the weight of each channel from 2.

And the purpose of using dilated convolution is to increase the receptive field, ensuring that each convolution output contains information from a larger range while keeping the parameter and computation cost constant.

The weighting factor of each channel is calculated by Eq. 5.

2.3. Random Specmix Strategy

is the weight matrix with . The is the number of the channel of .

In this work, we use a random Specmix strategy to help the model to locate the discriminative information and = (2 ())) ( 4 ) enhance the generalization of the model. For the training of deep neural networks, we always transform the = ((2())) ( 5 ) raw audio from time domain into time-frequency do2 at the beginning of MPIF module takes as main. And inspired by [20], we conduct Specmix on input and outputs . is the ℎ channel in . the frequency dimension of the F0 subband [30], which 2 is a convolutional operation to recalculate en- is a subband of amplitude spectrum, and the maximum ergy distribution. is the function. span of Specmix operation is no more than 10. Specmix denotes the 2 function. cuts and pastes spectrograms among themselves in the

After the weighting factor , we perform multipli- same batch to help the model focus on the discriminacation on and , and sum up the results. Then we tive regions that may be worth to attend to. And difcan get the result () of -th channel group as ferent from [20], there is no Specmix operation on lafollows: bels. We cover the information on frequency dimension 2 with the corresponding parts of other samples in the () = ∑︁ × (Ω ) ( 6 ) same batch. At the same time, to avoid the conduction of =1 Specmix on all the samples, inspired by [21], we randomly choose speech samples according to the hyperparamewhere is a matrix composed with , Ω ∈ R×1×1 ter _ℎ in advance to conduct Specmix operation. For a batch of speech samples, the probability of them undergoing random Specmix is , when is bigger than The resulting feature size of F0 subband is 45× 600. Then, _ℎ, Specmix was conducted on them, otherwise no we determine whether to conduct random Specmix on conduction with Specmix. And in the evaluation phase, the samples in the current batch by setting the hyperpawe do not use the random Specmix strategy. rameter (_ℎ), the probability whether to conduct the random Specmix strategy. Considering that the F0 feature is a subband of the amplitude spectrum, we set 3. Expriments And Results the maximum span for Specmix to be no more than 10. In this article, we propose MPIF-Res2Net to fuse the 3.1. Experimental Setup information from diferent perspective to reduce the reIt must be a challenging task to learn a robust coun- dundant spoofing cues and introduce random Specmix termeasure suitable to low-quality scenario trained on to improve the generalization ability of the model. Table the training set without same interference conditions. 1 presents the design of MPIF-Res2Net, which includes In this work, we use the Rawboost [11] DA method to details on channels, convolution kernels, and repetition train the model, this technique can enhance the accu- frequency. In our experiments, Adam is utilized as the racy in the low-quality scenarios. To be more precise, optimizer, with the following parameter settings: 1 = the impulsive signal-dependent (ISD) additive noise and 0.9, 2 = 0.98, = 10−9 , and weight decay is 10−4 . The stationary signal-independent (SSI) additive noise are epoch is set to 32. And the number of channel groups is added to the raw waveform. After the STFT operation set to 8. The batch size is 16. with the window length is 1728 and the hop length is 130, we got a spectrogram of size 865. We then trun- 3.2. Dataset cate or concatenate the spectrogram to fix the number of frames at 600. We utilize the 0-400 Hz LPS feature with the first 0-45 dimension as our F0 subband feature.

The data in the ASVspoof 2019 logical access (LA) dataset is divided into three subsets: training set, development set, and evaluation set. The spoof speech in the training and development sets comes from six speech synthesis Table 3 and speech conversion technologies, which are known Results Comparison with Fusion Systems on the Performance attack types. The evaluation set contains audio gener- of ASVspoof2021 Dataset ated by 11 unknown attack types. We trained our model System t-DCF EER(%) on the ASVspoof 2019 training set and selected the best T23 [22] 0.2177 1.32 performing model on the development set. As stated T20 [23] 0.2608 3.21 in [33], the ASVspoof 2021 LA dataset is designed for T04 [24] 0.2747 5.58 developing anti-spoofing methods that can efectively T06 [25] 0.2853 5.66 adapt to unknown channel variations and does not pro- T35 [22] 0.2480 2.77 vide new matching training or development data. The T19 [22] 0.2495 3.13 speech samples from the ASVspoof 2021 evaluation set Fusion systems [27] 0.2882 4.66 were transmitted via actual telephone systems utilizing MPIF-Res2Net ours 0.2557 3.29 various bandwidths and codecs. The data in the 2019 LA training and development subset does not have sim- Table 4 ilar encoding and transmission, and these subsets only Results Comparison with Single System on the Performance contain clean data. Equal error rate (EER) and minimum of ASVspoof2021 Dataset tandem detection cost function (min t-DCF) are used as the metrics.

3.3. Experimental Results

3.3.1. Ablation Study Firstly, diferent values of the probability (_ℎ) should be considered as the guidance of the experimental conduction to obtain the best _ℎ. Table 2 shows the EER results of conduction with random Specmix strategy for diferent values of _ℎ. The MPIF-Res2Net Experimental results show that our proposed MPIFwith _ℎ=0.5 has the best performance whose EER Res2Net with random Specmix enhancement methods result is 3.29%, and the min t-DCF result is 0.2557, which can improve performance for FSD task in the low-quality means a relatively higher reliability of the countermea- scenarios. sure system when it is applied with an ASV system.

For experiments involving information from a single re- 3.3.2. Performance Comparison With Other ceptive field, we set up two models, Res2Net_k3 and Systems Res2Net_k5, with the parameter kernel sizes and dila- The Table 3 shows the results on ASVspoof 2021 LA tions are 3 and 1, 3 and 2, respectively. The EER result dataset of diferent fusion systems. Although the fusion of MPIF-Res2Net with the _ℎ equal to 1 is 3.70%, systems T23 [22], T20 [23], T35 [22] and T19 [22] outperhowever, the corresponding EER results of the other two form than our proposed MPIF-Res2Net, but their fusion systems are 4.58% and 4.49% respectively, which verifies ways are very complicated. Such as the T23, it is comthe MPIF-Res2Net we proposed do have the ability by posed by 12 other systems trained separately, and got fusing information from diferent perspective to reduce fused with finely adjusted weight assignment at the score the redundancy caused by learning the similar spoof- stage. The method we proposed is based on a single sysing clues with the single kernel size. The EER results of tem, which is less complicated compared to the fusion Res2Net_k3 and Res2Net_k5 undergoing Specmix demon- systems. Table 4 shows the EER result of single sysstrate that Specmix can help help the model to locate the tems, the best EER result of other systems is 8.05%, the forgery information and improve model generalization method we proposed has improved the performance by performance. 59% relative to the RawNet2[29]system.

For random Specmix strategy, the MPIF-Res2Net with _ℎ is 0 got the EER result of 4.04%, the _ℎ with 0 means the Specmix conduction was conducted on 4. Conclusion all of the samples, this indicates that all the samples undergoing Specmix cause the serious performance degra- In this paper, we achieve accurate and useful infordation of system. Overall, the random Specmix has im- mation discrimination from two aspects. On the one proved the model’s generalization ability and enhanced hand, Specmix helps the model to focus on the location its performance. of key information in the sample by mixing information between samples, and randomly selects samples

for Specmix operations, efectively avoiding the phe- A raw data boosting and augmentation method nomenon of performance degradation caused by the applied to automatic speaker verification antidestruction of original data. On the other hand , MPIF- spoofing[C]//ICASSP 2022-2022

IEEE

International Res2Net reduces redundant information caused by learn- Conference on Acoustics, Speech and Signal Proing similar information from a single perspective by fus- cessing (ICASSP) . IEEE, 2022 : 6382 - 6386 .

ing information from multiple perspectives , removing [12] Park

D S

, Chan

, Zhang

, et al. Specaugthe influence of redundant information on the learning ment: A simple data augmentation method for of key information. The efectiveness of our method has automatic speech recognition[J]. arXiv preprint been demonstrated by experiments . The efectiveness arXiv:1904.08779 , 2019 .

of our proposed method was verified by the experiment [13] Kwak

I Y

, Choi

, Yang

, et al. CAU_KU team's results. submission to ADD 2022 Challenge task 1: Lowquality fake audio detection through frequency feature masking[J] . arXiv preprint arXiv:2202 .04328, References 2022 . [14] Lavrentyeva

, Novoselov

, Malykh

, et a1. Audio

[1] Naika

An overview of automatic speaker verifi- replay attack detection with deep learning framecation system [C]//Intelligent Computing and Infor- works [C] // Proc of Interspeech 2017 . Grenoble, mation and Communication: Proceedings of 2nd France: ISCA , 2017 : 82 -86 International Conference, ICICC 2017 . Springer Sin- [15] Alzantot

, Wang

, Srivastava M B. Deep residual gapore, 2018 : 603 - 610 . neural networks for audio spoofing detection [J].

[2] Wu

, Kinnunen

, Evans

, et al. ASVspoof 2015 : arXiv preprint arXiv: 1907 .00501, 2019 . the first automatic speaker verification spoofing [16] Lai

C I

, Chen

, Villalba

, et al. ASSERT: Antiand countermeasures challenge[C]//Sixteenth an- spoofing with squeeze-excitation and residual netnual conference of the international speech com- works [J]. arXiv preprint arXiv:1904.01120 , 2019 . munication association. 2015 . [17] Gao S H , Cheng M M , Zhao

, et al. Res2net: A new

[3] Kinnunen

, Sahidullah

, Delgado

, et al. The multi-scale backbone architecture[J] . IEEE transacASVspoof 2017 challenge : Assessing the limits of tions on pattern analysis and machine intelligence, replay spoofing attack detection [J]. 2017 . 2019 , 43 ( 2 ): 652 - 662 .

[4] Todisco

, Wang

, Vestman

, et al. ASVspoof [ 18 ] Li

, Li

, Weng

, et al. Replay and syn2019: Future horizons in spoofed and fake audio thetic speech detection with res2net architecdetection[J] . arXiv preprint arXiv:1904.05441 , 2019 . ture[C]//ICASSP 2021-2021 IEEE international con-

[5] Yamagishi

, Wang

, Todisco

, et al. ASVspoof ference on acoustics, speech and signal processing 2021: accelerating progress in spoofed and (ICASSP) . IEEE, 2021 : 6354 - 6358 . deepfake speech detection[J]. arXiv preprint [19]

Li ,

Wu ,

Lu ,

Liu , and

Meng , “ ChannelarXiv:2109.00537 , 2021 . wise gated res2net: Towards robust detection of syn-

[6]

Yi ,

Fu ,

Tao ,

Nie , H. Ma,

Wang ,

Wang , Z. thetic speech attacks , ” Proc. Interspeech 2021 , 2021 . Tian,

Bai ,

Fan et al., “Add 2022 : the first audio [20] Kim

, Han

D K

, Ko H. Specmix : A mixed sample deep synthesis detection challenge,” in ICASSP 2022 . data augmentation method for training withtimeIEEE , 2022 , pp. 9216 - 9220 . frequency domain features[J]. arXiv preprint

[7] Arif

, Javed

, Alhameed

, et al. Voice spoof- arXiv:2108.03020 , 2021 . ing countermeasure for logical access attacks detec- [21] Zhong

, Zheng

, Kang

, et al. Random erasing tion[J] . IEEE Access , 2021 , 9 : 162857 - 162868 . data augmentation[C]//Proceedings of the AAAI

[8] Zhang

, Jiang

, Duan Z. One-class learning to- conference on artificial intelligence . 2020 , 34 ( 07 ) : wards synthetic voice spoofing detection[J] . IEEE 13001-13008. Signal Processing Letters , 2021 , 28 : 937 - 941 . [22]

Tomilov ,

Svishchev ,

Volkova , A.

[9] Das

R K

, Yang

, Li

Long range acoustic and deep Chirkovskiy, A . Kondratev,and G. Lavrentyeva, features perspective on ASVspoof 2019 [C]//2019 “ STC Antispoofing Systems for the ASVspoof2021 IEEE Automatic Speech Recognition and Under- Challenge,” in Proc. 2021 Edition of the Automatic standing Workshop (ASRU). IEEE, 2019 : 1018 - 1025 . Speaker Verification and Spoofing Countermea-

[10] Nautsch

, Wang

, Evans

, et al. ASVspoof 2019 : sures Challenge, 2021 , pp. 61 - 67 . spoofing countermeasures for the detection of syn- [23]

Chen ,

Khoury ,

Phatak , and G. Sivaraman, thesized, converted and replayed speech[J] . IEEE “Pindrop Labs' Submission to the ASVspoof 2021 Transactions on Biometrics, Behavior, and Identity Challenge,” in Proc. 2021 Edition of the Automatic Science , 2021 , 3 ( 2 ): 252 - 265 . Speaker Verification and Spoofing Countermea-

[11] Tak

, Kamble

, Patino

, et al. Rawboost: sures Challenge , 2021 , pp. 89 - 93 .

[24] J. C ´aceres , R. Font,

Grau , and

Molina , “ The Biometric Vox System for the ASVspoof 2021 Challenge,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge , 2021 , pp. 68 - 74 .

[25]

W. H.

Kang ,

Alam , and

Fathan , “ CRIM's System Description for the ASVSpoof2021 Challenge,” in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge , 2021 pp. 100 - 106 .

[26]

Liu ,

Wang ,

Sahidullah ,

Patino ,

Delgado , T. Kin- nunen, M. Todisco,

Yamagishi ,

Evans ,

Nautsch et al., “Asvspoof 2021 : Towards spoofed and deepfake speech detection in the wild , ” arXiv preprint arXiv:2210.02437 , 2022 .

[27]

Cohen ,

Rimon , E. Aflalo, and

H. H.

Permuter , “ A study on data augmentation in voice anti-spoofing,” Speech Communication , vol. 141 , pp. 56 - 67 , 2022 .

[28]

Yamagishi ,

Wang ,

Todisco ,

Sahidullah ,

Patino ,

Nautsch ,

Liu ,

K. A.

Lee ,

Kinnunen ,

Evans et al., “Asvspoof 2021 : accelerating progress in spoofed and deepfake speech detection ,” in ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoof- ing Coutermeasures Challenge , 2021 .

[29]

Wang ,

Qin ,

Zhu ,

Wang ,

Zhang , and

Li , “ The dku-cmri system for the asvspoof 2021 challenge: vocoder based replay channel response estimation , ” Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge , pp. 16 - 21 , 2021 .

[30] Xue

, Fan

, Lv

, et al. Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features[C]//Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia . 2022 : 19 - 26 .

[31]

Hu ,

Shen , and G. Sun, “ Squeeze-and-excitation networks , ” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 , pp. 7132 - 7141 .

[32] Xue

, Fan

, Yi

, et al. Learning from yourself: A self-distillation method for fake speech detection [C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023 : 1 - 5 .

[33] Liu

, Wang

, Sahidullah

, et al. ASVspoof 2021 : Towards spoofed and deepfake speech detection in the wild [J]. arXiv preprint arXiv:2210.02437 , 2022 .