1. Introduction

Multi-grained Backend Fusion for Manipulation Region Location of Partially Fake Audio

Jun Li

Lin Li

0 2

Mengjie Luo

0 2

Xiaoqin Wang

0 2

Shushan Qiao

0 2

Yumei Zhou

0 2 0 Institute of Microelectronics of the Chinese Academy of Sciences , Beijing , China 1 Nanjing Institute of Intelligent Technology , Nanjing , China 2 University of Chinese Academy of Sciences , Beijing , China

2023

43 48

Fake audio detection is an important research area to prevent the misuse of speech synthesis and voice conversion technologies. While progress has been made in detecting partially fake audio at the utterance level, accurately locating the manipulation region at the segment level remains challenging. Aiming to promote the development of manipulation region location of partially fake audio, ADD 2023 is organized and Track 2 seeks to locate the fake clips. This paper introduces our system submitted to ADD 2023 Track 2, combining AASIST-based and Wav2Vec2-based subsystems through multi-grained backend fusion. With the proposed method, the bias of AASIST towards fake class, and Wav2Vec2 towards genuine class are mitigated. Our system achieves a of 59.12%, a 40.7% increase compared to the best baseline system in this paper.

eol>Fake Audio Detection Audio Deepfake Detection Partially Fake Wav2Vec2 AASIST Backend Fusion

1. Introduction

frameworks[10] have been widely adopted in ASVspoof challenges [11, 12], and the AASIST[13] was served as The advancement of speech synthesis and voice con- a baseline model and employed by several top-ranked version(VC) technologies has significantly enha nced the participants in SASV 2022 who would like to achieve a quality and naturalness of synthesized speech [1, 2, 3, 4]. low equal error rate of FAD[14, 15, 16, 17]. However, an issue of potential technology abuse such as Previous challenges have primarily focused on detecttelecom fraud may be brought up. Consequently, there ing fully fake audio at the utterance level, without adis a growing concern about fake audio detection(FAD), dressing realistic scenarios involving partially fake auwhere the synthesized audio for inappropriate uses is dio. Partially fake audio refers to fake audio with small defined as fake audio or spoofing attacks. fake clips hidden in genuine speech audio[18, 19]. To

The Asvspoof challenges have gathered attention from address this gap, ADD challenges are launched to encourresearchers who aim to protect automatic speaker veri- age researchers to explore new frameworks for detecting ifcation(ASV) systems from spoofing attacks, [ 5, 6, 7, 8]. partially fake audio[20, 21]. I n ADD 2022 (Audio Deep The Asvspoof 2015, 2017 and 2019 focused on the log- Synthesis Detection Challenge) , Track 2 targeted at deical access(LA) task, physical access(PA) task or both. tecting partially fake audio at the utterance level. In ADD The LA task involved detecting spoofing audio gener- 2023(Audio Deepfake Detection Challenge), the goal of ated by statistical or neural text-to-speech(TTS) and VC Track 2 is localizing manipulated clips within a speech methods, While the PA task aimed to disti nguish replay sentence. In ADD 2022 Track 2, the best partially FAD audio implemented in various simulated acoustic envi- system at the utterance level is based on pretrai ned selfronments. In Asvspoof 2021 , a new deepfake track was supervised Wav2Vec2[22, 23], but it fails to spot fake introduced to detect compressed manipulated audio, aim- clips[24]. On the other hand, methods focusing on the ing to enhance system robustness. Furthermore, the frame-wisely boundary detection of manipulated clips spoofing-aware speaker verification(SASV) challenge in have shown capability in locating fake clips [25, 26]. 2022 attempted to jointly optimize FAD and ASV systems This paper presents our system for the manipulation instead of utilizing a FAD system as a gate to start the region location of partially fake audio in ADD 2023 Track ASV system[9]. Among these challenges, ResNet-based 2. The backend fused system combines AASIST for deIJCAI 2023 Workshop on Deepfake Audio Detection and Analysis tecting fake audio at the utterance level, and Wav2Vec2 (DADA 2023), August 19, 2023, Macao, S.A.R at the frame level. The main contribution of this paper * Corresponding author: Xiaoqin Wang. is the proposal of multi-grained backend fusion, which $ lj@niit.ac.cn (J. Li); lilin2020@ime.ac.cn (L. Li); aims to mitigate the biases of AASIST towards fake audio (luXo.mWeannggji)e;@qiaimoseh.aucs.hcnan(@M.imLue.oa)c;.cwna(nSg.xQiaioaoq)i;ny@mimzheo.auc@.cinme.ac.cn and Wav2Vec2 towards genuine audio. Our submitted (Y. Zhou) system achieves a of 59.12%, a relative increase of © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 40.7% compared to the best baseline system. CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

The rest of this paper is as follows. The proposed method is described in Section 2. Section 3 details the experiment settings. Experimental results and analysis are discussed in Section 4. Finally, Section 5 concludes the paper.

2. Method 2.1. Task Definition

FAD at the utterance level is a binary classification task to detect if a sentence is genuine or fake. In contrast, manipulation region location identifies fake segments at a finer granularity. In Track 2 of ADD 2023, the duration of each segment is 10ms. Therefore, given an utterance with segments, represented as = [1, 2, ..., ], the output at the utterance level should be ∈ {0, 1}, while the output at the segment level is a vector y = [1, 2, ..., ] ∈ {0, 1} , where 0 denotes fake and 1 denotes genuine. Besides, since the duration of segments is similar to that of a frame commonly used in speech processing, models generating frame outputs can be used to detect segments.

2.2. Proposed System AASIST-based subsystem at the utterance level

AASIST is an end-to-end architecture based on graph attention network, proposed to detect diferent spoofing attacks[13]. The raw waveform is adopted as input, with a minimum of 64,600 samples, about 4s at a sampling rate of 16kHz. While the original AASIST aims to classify genuine and spoofed utterances with a binary classifier 1, the classifier is replaced by a 5-class FC(fully connected) layer to detect 4 types of fake audio along with a genuine class. In the training and development set of ADD 2023 Track 2, we refer to the 4 fake forms as Fake01, Fake101, Fake10 and Fake0, where 0 denotes the presence of manipulated fake clips. Finally, the logits of the last FC layer are fed into a softmax function.

Wav2Vec2-based subsystem at the frame level To

address the limitation of AASIST in fake clips location, Wav2Vec2-based subsystem is employed to determine the authenticity of each frame. A self-supervised pretrained model called XLS-R-300M with 300M parameters is utilized to capture contextualized acoustic representations 2[23]. Similar to AASIST, Wav2Vec2 also takes the raw waveform as input. It generates frame representations at a hop length of 20ms, with each frame length 25ms, given an input sampling rate of 16kHz. The last hidden output of Wav2Vec2 is passed through a dropout layer, followed by a binary linear layer for frame classification. 1https://github.com/clovaai/aasist 2https://huggingface.co/facebook/Wav2Vec2-xls-r-300m

2.2.2. Multi-grained Backend Fusion

Manipulation region location system As depicted in Figure 1(c), the manipulation region location system consists of an AASIST-based subsystem and a Wav2Vec2based subsystem. By fusing multi-grained results from these subsystems, the system aims to mitigate biases observed in experiments detailed in Section 4. The alignment of utterance results to frame level involves two main steps. Firstly, probabilities of all types of fake audio are summed, converting the 5-class to binary classification. Then the binary classification outputs at the utterance level are expanded along the time domain to match the number of frames in Wav2Vec2 outputs. The expanded utterance outputs with frame outputs are combined by weighted fusion, and the argmax function is applied to determine the authenticity of each frame.

FAD system at the utterance level To enhance sentence accuracy, average fusion of AASIST models trained on diferent datasets is utilized. The averaged utterance result is then fused with the frame output of a Wav2Vec2based subsystem. Following the definition in ADD 2023, if any frame is identified as fake, the label of fake is assigned to the entire utterance. Only when all frames are classified as genuine, the utterance is labeled as genuine.

3. Experiment Settings 3.1. Data Preparation

Various datasets are used for training. The sampling rate of all data is 16kHz. The details of the datasets provided by ADD 2023 Track 2 are presented in Table 1. This includes a train set used for model fitting, a development set used for an early stop during training, and a test set whose labels are unknown, and used to evaluate the FAD system. The distributions of genuine and fake utterances in both train set and development set are balanced, However, the percentages of each fake type vary, with Fake101 being the majority, and Fake0 being the minority. To enhance the generalization capability of our system, new training data is constructed as outlined set. The FS denotes fake sentences generated by splitting continuous fake clips from each fake sentence in the train set. Three traditional vocoders, namely GL(grifin-Lim) 3 [27], Straight 4 [28] and World 5 [29] are employed to synthesize fake audio from the real segments of RS. Additionally, utterances in MidAug are created by randomly inserting newly constructed fake clips into the audio of RS. In MidAug, the duration range of fake clips is [0.2s, 3s], and any utterances shorter than 0.2s are discarded.

During training, Online data augmentation is employed. The MUSAN dataset[30] is utilized to add background noise with noise and music, while the RIR database[31] is used to simulate reverberation. Dynamic padding is applied. Additionally, the duration of audio is ifxed to 5s during the training of AASIST, whereas the full length when testing. For Wav2Vec2, 4s is mainly employed as the maximum duration both for training and testing.

3.2. Training

The system is mainly built on top of [32] and each model is trained on an Nvidia 3090 GPU card. Cross entropy is adopted as loss function and Adam[33] as optimizer. The train batch size is 16. Baseline subsystems are trained with the train set from ADD 2023 Track 2 with max epoch 50. The initial learning rate is 1e-3, and it decreases by 5% after every epoch. To quickly converge to new data, we finetune the baseline models with lowest loss on development set for another 20 epochs. The finetune learning rate starts from 1e-4. in Table 2. The RS represents the individual genuine sen- 3https://librosa.org/doc/main/generated/librosa.grifinlim.html tences obtained by splitting continuous real segments 4https://github.com/HidekiKawahara/legacy_STRAIGHT from each genuine or partially fake sentence in the train 5http://www.isc.meiji.ac.jp/ mmorise/world/english/download.html

3.3. Evaluation Metrics

The sentence accuracy() and segment F1 score( 1) are simultaneously adopted as evaluation metrics for ADD 2023 Track 2. Taking fake as positive and genuine as negative, , , , are the numbers of true positive, true negative, false positive, false negative samples.

At the utterance level, , , , samples denote utterances, is defined as =

+ + + + .

The metrics at the segment level aim to measure the ability of models to correctly identify fake clips from fake audio[21], including for segment precision, for segment recall, and 1 for F1 score, they are defined as follows: =

+ = 1 =

+ 2 + where , and samples denote segments.

The final of ADD 2023 Track 2 is defined as = 0.3 × + 0.7 × 1.

4. Results and Analysis

trained with additional constructed data. The number of in FS1-FS9, suggesting the significance to recognize fake Wav2Vec2 experiments is limited due to the unacceptable audio when evaluating. The best of 58.65% of a systraining and evaluation time. FW1 is the result of average tem is obtained in FS2, with a balanced and 1, fusion of Wav2Vec2 at the frame level. B3 and FS1-FS9 Finally, the submitted results for ADD 2023 Track2 utilize are the results of fused systems shown in Figure 1 (b) the results of FS5 at the utterance level, and FS3 at the and (c), where B3 is a baseline, and AASISTs are chosen segment level, achieving a of 59.12%. based on and , Wav2Vec2 based on . The weighted fusion factors of each subsystem are provided.

Baselines. Comparing the results obtained from B1 5. Conclusions and B2, it can be observed that AASIST performs better In this paper, a system based on multi-grained backend in terms of and , while Wav2Vec2 achieves fusion is proposed to locate the manipulation region. The tahheiughtteerrasnccoereleivneltend.sTthoeurseeagsolonbmalaiynfboermAaAtiSoInSTanadt performance is improved with the proposed system by the process of transforming utterance to frame outputs mitigating the biases brought by AASIST at the utterof AASIST makes fake segments majority, leading to ance level and Wav2Vec2 at the frame level. Our method misidentification of genuine segments. Conversely, as achieves an of 74.52%, a 1 of 52.53%, and the the genuine segments are the majority in the train set, ifnal is 59.12%. Compared to the best baseline Wav2Vec2 at the frame level has a bias to the genuine system B1 with a of 42.02%, the proposed system class. To address the biases in AASIST and Wav2Vec2, B3 achieves a relative improvement of 40.7%. utilizes multi-grained fusion. Although most metrics in B1, and in B2 decrease, the improves relatively References 402.2% compared to B2, and by 37.6% to B1, revealing the deviations of AASIST towards fake, and Wav2Vec2 [1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. towards genuine are lessened to some extent. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Ben

AASIST. When only one kind of constructed data is gio, et al., Tacotron: Towards end-to-end speech added to the train set, A3 with Straight exhibits a notable synthesis, arXiv preprint arXiv:1703.10135 (2017). improvement in , a relative 47.7% increase compared [2] J.-M. Valin, J. Skoglund, Lpcnet: Improving neuto B1. The highest 94.13% is achieved by A10, indi- ral speech synthesis through linear prediction, in: cating that the generalization can be improved by using ICASSP, IEEE, 2019, pp. 5891–5895. all train data. Finally, through the fusion of top-ranked [3] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H. AASIST subsystems, the rises to 73.36% in M. Wang, Voice conversion from unaligned corFA3, to 21.87% in FA6, 1 to 35.09% in FA5, and pora using variational autoencoding wasserstein 46.40% in FA3. generative adversarial networks, arXiv preprint

Wav2Vec2. When all available data is utilized in W2, arXiv:1704.00849 (2017). there is an improvement in all metrics compared to B2, [4] T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo, with a growth of 28.2% in , 9.6% in , 283.0% in Cyclegan-vc2: Improved cyclegan-based non, 215.4% in 1, 94.5% in . The highest parallel voice conversion, in: ICASSP, IEEE, 2019, 81.43% is achieved by combining W1 and W2 in FW1. pp. 6820–6824.

Multi-grained Backend Fusion. Having discussed [5] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, in Baselines, though the performance of AASIST and C. Hanilçi, M. Sahidullah, A. Sizov, Asvspoof 2015: Wav2Vec2 is improved by adding more constructed data the first automatic speaker verification spoofing or fusing subsystems separately, there remain biases for and countermeasures challenge, in: Sixteenth anAASIST towards fake and Wav2Vec2 towards genuine. nual conference of the international speech comThe selection of top-performing subsystems aims to mit- munication association, 2015. igate the biases by multi-grained backend fusion. How- [6] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, ever, it can be observed that the adoption of FW1 such N. Evans, J. Yamagishi, K. A. Lee, The asvspoof 2017 as FS7 with a of 50.18% performs inferior to the challenge: Assessing the limits of replay spoofing fused systems with a single Wav2Vec2. This could be at- attack detection (2017). tributed to the decrease in , as the confidence of real [7] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, segments generated by fused Wav2Vec2 increases. Ad- H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, ditionally, as shown in Table 3, the best 1 of 52.35% T. Kinnunen, K. A. Lee, Asvspoof 2019: Future is achieved by combining A10 and W2, both are single horizons in spoofed and fake audio detection, arXiv subsystems trained with all data, indicating the impor- preprint arXiv:1904.05441 (2019). tance of model generalization. Conversely, the best [8] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, is acquired in FS5, with a relatively high of 63.70% J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen,

Evans , H. Delgado, ASVspoof 2021 : accelerating

Yan ,

Xu ,

Wen ,

Li , Add 2022 : the first audio

in: Proc. 2021 Edition of the Automatic Speaker Ver- 2022 , pp. 9216 - 9220 .

ification and Spoofing Countermeasures Challenge, [21]

Yi ,

Tao ,

Fu ,

Yan ,

Wang ,

2021 , pp. 47 - 54 .

Zhang ,

Zhao ,

Ren ,

Xu ,

Zhou , [9] J. weon Jung ,

Tak , H. jin Shim, H.-S. Heo,

B.-J. H.

Gu ,

Wen ,

Liang ,

Lian ,

Li , Add 2023 :

SASV 2022: The First Spoofing-Aware Speaker Veri- cepted by IJCAI 2023 Workshop on Deepfake Audio

ifcation Challenge, in: Proc. Interspeech 2022 , 2022 , Detection and Analysis (DADA 2023 ) ( 2023 ).

pp. 2893 - 2897 . [22]

Baevski ,

Zhou ,

Mohamed ,

Auli , [10]

He ,

Zhang , S. Ren,

Sun , Deep wav2vec 2 . 0: A framework for self-supervised

residual learning for image recognition (2015). learning of speech representations (

2020 ).

arXiv:1512.03385 . arXiv: 2006 . 11477 . [11]

Nautsch ,

Wang ,

Evans ,

T. H.

Kinnunen , [23]

Babu ,

Wang ,

Tjandra ,

Lakhotia ,

Xu ,

Yamagishi ,

K. A.

Lee , Asvspoof 2019 : Spoofing A . Baevski , A.

Conneau , M.

Auli , Xls-r: Self-

converted and replayed speech, IEEE Transactions learning at scale (

2021 ). arXiv: 2111 . 09296 .

on Biometrics, Behavior, and Identity Science 3 [24]

Lv ,

Zhang ,

Tang , P. Hu, Fake audio detec-

( 2021 ) 252 - 265 . tion based on unsupervised pretraining models , in: [12]

Liu ,

Wang ,

Sahidullah ,

Patino , H. Del- ICASSP , IEEE, 2022 , pp. 9231 - 9235 .

gado , T. Kinnunen, M.

Todisco , J. Yamagishi, [25] H.

Wu , H. -C. Kuo , N. Zheng , K.-H. Hung , H.-Y. Lee,

Evans ,

Nautsch ,

K. A.

Lee , Asvspoof 2021 :

Tsao , H. -M. Wang , H. Meng , Partially fake audio

the wild (

2022 ). arXiv: 2210 .02437. ery, in: ICASSP, IEEE, 2022 , pp. 9236 - 9240 . [13] J. weon Jung , H. -S. Heo, H.

Tak , H. jin Shim, [26] Z.

Cai , W.

Wang , M.

Li , Waveform boundary detec-

Aasist: Audio anti-spoofing using integrated 2023 , pp. 1 - 5 .

spectro-temporal graph attention networks (

2021 ). [27]

Perraudin ,

Balazs ,

P. L.

Søndergaard , A fast

arXiv:2110 .01200. grifin-lim algorithm , in: 2013 IEEE Workshop [14]

Wang ,

Qin ,

Wang ,

Xu ,

Li , The DKU- on Applications of Signal Processing to Audio and

OPPO System for the 2022 Spoofing-Aware Speaker Acoustics , 2013 , pp. 1 - 4 .

Verification

Challenge , in: Proc. Interspeech 2022 , [28]

Kawahara , Straight, exploitation of the other

2022 , pp. 4396 - 4400 . aspect of vocoder: Perceptually isomorphic decom[15] J.-H. Choi , J.-Y.

Yang , Y.-R.

Jeoung , J.-H.

Chang , position of speech sounds, Acoustical science and

HYU Submission for the SASV Challenge 2022 : Re- technology 27 ( 2006 ) 349 - 353 .

forming Speaker Embeddings with Spoofing-Aware [29] M.

Morise , F.

Yokomori , K.

Ozawa , World: a

Conditioning , in : Proc. Interspeech 2022 , 2022 , pp. vocoder -based high-quality speech synthesis sys-

2873- 2877 . tem for real-time applications , IEICE TRANSAC [16]

Zhang ,

Hu ,

Zhang , Norm-constrained Score- TIONS on Information and Systems 99 ( 2016 ) 1877 -

level Ensemble for Spoofing Aware Speaker Verifi-

1884 .

cation, in : Proc. Interspeech 2022 , 2022 , pp. 4371 - [ 30]

Snyder ,

Chen ,

Povey , Musan:

4375. A music, speech, and noise corpus ( 2015 ). [17]

Zhang ,

Li ,

Zhao ,

Wang ,

Xie , Backend arXiv: 1510 . 08484 .

Ensemble for Speaker Verification and Spoofing [31]

Ko ,

Peddinti ,

Povey ,

M. L.

Seltzer , S. Khudan-

Countermeasure , in : Proc. Interspeech 2022 , 2022 , pur, A study on data augmentation of reverberant

pp. 4381 - 4385 . speech for robust speech recognition , in: ICASSP, [18]

Yi ,

Bai ,

Tao ,

Tian ,

Wang ,

Fu , IEEE, 2017 , pp. 5220 - 5224 .

Half-truth: A partially fake audio detection dataset , [32]

J. S.

Chung ,

Huh ,

Mun ,

Lee , H.-S. Heo,

arXiv preprint arXiv:2104.03617 ( 2021 ). S. Choe,

Ham ,

Jung ,

B.-J.

Lee , I. Han, In De[19]

Ma , J. Yi,

Wang ,

Yan ,

Tao , T.

Wang, fence of Metric Learning for Speaker Recognition,

Wang ,

Xu ,

Fu , Fad: A chinese dataset for in: Proc. Interspeech 2020 , 2020 , pp. 2977 - 2981 .

fake audio detection (

2022 ). arXiv: 2207 . 12308 . [33]

D. P.

Kingma ,

Ba , Adam: A method for stochastic [20]

Yi ,

Fu ,

Tao ,

Nie , H. Ma,

Wang , T. Wang, optimization ( 2014 ). arXiv: 1412 . 6980 .