-

ADD 2023: the Second Audio Deepfake Detection Challenge

Jiangyan Yi

Jianhua Tao

Ruibo Fu

Xinrui Yan

Chenglong Wang

Tao Wang

Chu Yuan Zhang

Xiaohui Zhang

Yan Zhao

Yong Ren

Le Xu

Junzuo Zhou

Hao Gu

Zhengqi Wen

Shan Liang

Zheng Lian

Shuai Nie

Haizhou Li

1 3 0 Department of Automation, Tsinghua University , Beijing , China 1 Department of Electrical and Computer Engineering, National University of Singapore , Singapore 2 State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences , Beijing , China 3 The Chinese University of Hong Kong , Hong Kong

2023

125 130

Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Diferent from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks.

eol>Audio deepfake fake detection audio fake game manipulation region location deepfake algorithm recognition

1. Introduction the ADD 2022 1 included three Tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) Over the last decades, the development of artificial intel- and audio fake game (FG). However, some limitations ligence has brought forth great improvements in speech still existed in ADD 2022. The techniques used in the synthesis [ 1, 2, 3 ] and voice conversion [ 4, 5, 6 ] technolo- challenge focused more on performing binary classificagies. The models are able to generate realistic and human- tion between real and fake audio. In addition, there were like speech. The technology nevertheless poses a serious limited rounds of evaluation for the FG Track. threat to the society if someone misuses it [ 7 ]. Therefore, Moreover, there is also an interest in surpassing the audio deepfake detection is an emerging topic of interest. constraints of binary real/fake classification, and actually An increasing number of eforts have been made to detect localizing the manipulated intervals in a partially fake the deepfake audio recently [ 8, 9, 10, 11, 12, 13, 14 ]. speech as well as pinpointing the source responsible for

A series of challenges, including Automatic Speaker generating any fake audio. Therefore, we launched a secVerification Spoofing and Countermeasures Challenge ond Audio Deepfake Detection Challenge (ADD 2023 2) (ASVspoof 2021) [ 11 ], the First Audio Deepfake Detec- to spur researchers around the world to build new innotion Challenge (ADD 2022) [ 12 ] have played a critical vative technologies that can further accelerate and foster role in fostering research on this area. The ASVspoof research on detecting and analysing deepfake utterances. 2021 introduced a new task involving audio deepfake In the following sections, we describe the datasets (DF) detection, accelerating progress in deepfake audio and evaluation metrics designed for diferent subchaldetection. To address more challenges in the real world, lenges. Finally, we briefly report on the performance of the results submitted by the ADD 2023 participants to further explore the current state and future direction of IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis real-world audio deepfake detection. (DADA 2023), August 19, 2023, Macao, S.A.R * Corresponding author. † These authors contributed equally. 2. Subchallenges $ jiangyan.yi@nlpr.ia.ac.cn (J. Yi); jhtao@tsinghua.edu.cn (J. Tao); ruibo.fu@nlpr.ia.ac.cn (R. Fu); yanxinrui2021@ia.ac.cn (X. Yan); The ADD 2023 challenge includes three subchallenges: (cTh.eWngalnong)g;.wzhaanngg@chnulpyru.iaan.a2c0.2c1n@(Cia..aWc.acnng()C;t.aYo..wZhaanngg@);nlpr.ia.ac.cn audio fake game (FG) [ 15, 16 ], manipulation region locahaizhou.li@u.nus.edu (H. Li)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 1http://addchallenge.cn/add2022 CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) 2http://addchallenge.cn/add2023 tion (RL) and deepfake algorithm recognition (AR). The Track 1.2: We use the same training and dev. sets as RL and AR subchallenges are new to ADD. Track 3.2 of ADD 2022, including the real and fake utter

Track 1. audio fake game (FG): Diferent from ADD ances based on AISHELL-3. 2022, ADD 2023 has two rounds of evaluations for the Track 2: The dataset consists of real utterances and generation task and two rounds of evaluations for the partially fake utterances. Fake utterances generated by detection task. manipulated the original genuine utterances with real or

Track 1.1 generation task (FG-G): aiming to gener- synthesized audios. ate fake audio that can fool the fake detection model of Track 3: The training and dev. sets include 7 classes Track 1.2. (1 real and 6 counterfeit) as shown in Table 2. The 7

Track 1.2 detection task (FG-D): attempting to de- categories are labeled 0, 1, 2, 3, 4, 5, 6. Fake audio taken tect fake utterances, especially the fake samples gener- from speech synthesized by diferent speech generation ated from Track 1.1. algorithms and tools.

Track 2. manipulation region location (RL): focusing on locating the manipulated regions in a partially fake 2.2. Test sets audio in which the original utterances are manipulated with real or generated audio [ 13 ]. The test sets of ADD 2023 are more challenging compared

Track 3. deepfake algorithm recognition (AR): to the previous one. The number of utterances in the four aiming to recognize the algorithms of the deepfake utter- subsets are shown in Table 3. ances, and the evaluation dataset includes samples from an unknown deepfake algorithm [ 17, 18 ].

2.1. Training and dev sets

The training and dev. sets of ADD 2023 contain four subsets, as summarized in Table 1.

Track 1.1: We use the AISHELL-3 [19] corpus, which is a large-scale Chinese speech corpus containing over 88,000 utterances, composing 85 hours of speech.

Test #Real #Fake Track 1.2 R1 80,000 31,976

Track 1.1: It consists of test sets for two rounds, with two speakers, one male and one female, randomly selected from the AISHELL3 dataset in each round. There are 499 text contents in the test set file, and the text content of each line corresponds to an audio file generated for each target speaker ID.

Track 1.2: The real audio of the test set for two rounds consists of sources including AISHELL-1 [20], Thchs30 [21], etc. The fake audio consists of audio generated by using TTS and voice conversion techniques, and a portion of audio generated from the two rounds of track 1.1 submissions.

Track 2: The test set includes unseen partially fake and real utterances. Additional noise addition and format conversions were done on this base.

Track 3: The test set includes 8 classes (the 7 classes included in the training and dev. sets and unknown counterfeit class, as shown in Table 2). The unknown category data was synthesized by an unknown speech generation tool.

3. Evaluation metrics

Track 1.1 aims to generate fake audio that can fool the detection models. Therefore, the deception success rate (DSR) [ 12 ] is chosen as the metric. The goal of Track 1.2 is audio deepfake detection. So the weighted equal error rate (WEER) [ 12 ] is used as the metric. To better evaluate Table 4 the manipulation region location performance of Track 2, Description of detection baseline systems the final score is the weighted sum of sentence accuracy and segment F1-score [ 13 ]. For Track 3, participants ID Model should recognize the known and unknown algorithms of the deepfake utterances. Therefore, we utilize the macro-average F1-score [22] in open set recognition. 3.1. Track 1.1 FG-G DSR reflects the degree to which the audio deepfake detection model is deceived by the generated utterances, and is defined as follows:

= (1)

× where is the count of wrong detection samples by all the detection models on the condition of reaching each own equal error rate (EER) [ 8 ] performance, is the total number of evaluation samples, and is the number of detection models. For the first round, the DSR against the Track 1.2 submissions forms the totality of generation performance metric, where as in the second round, weighted consideration is also given to the DSR against the model we release, efectively:

= 1 + 2 2 = 2 + 2 (2) (3) where =0.4, =0.6, =0.4 and =0.6, and they represent the weights for DSR in our consideration. DSRR1 and WDSRR2 represent the generation performance metrics for the first and second rounds, respectively. DSRR2 and DSRR2baseline refers to the DSR achieved by using the synthesized speech submitted by the participants to attack the model submitted in track 1.2 and the detection baseline model 3 provided by organizers, respectively. 3.2. Track 1.2 FG-D The WEER is defined as: = 1 + 2 (4) where =0.4 and =0.6, which represent the weights of EERR1 obtained in the first round and EERR2 obtained in the second round, respectively. The EER is defined and calculated in the same way as in ADD 2022. 3.3. Track 2 RL S01 S02 S03 S04 S05 S06

GMM

LCNN LCNN LCNN ResNet

(Softmax with threshold)

ResNet (Openmax) Features LFCC LFCC

Wav2vec2

LFCC LFCC LFCC Task Track 1.2 Track 1.2 Track 1.2 Track 2 Track 3 Track 3

where , , , and denote the numbers of true positive, true negative, false positive, and false negative samples. Additionally, we use Segment Precision Psegment, Segment Recall Rsegment, and Segment F1score F1segment to measure the ability of the model to correctly identify fake areas from fake audios, defined respectively as:

= +

= + 1 = 2 × ×

+ The final score is the weighted sum of Sentence Accuracy and Segment F1-score, as shown below.

= + 1 where =0.3 and =0.7, which represent the weights of and 1. 3.4. Track 3 AR For the algorithm recognition tasks in Track 3, we use the macro-average F1-score, defined as: = 1 ∑︁

=1 + = 1 ∑︁

=1 + 1 = 2 × × + (6) (7) (8) (9) (10) (11) (12) For Track 2, sentence accuracy measures the ability of the model to correctly distinguish between genuine and fake audio, and is defined as follows:

+ = + + + where denotes the number of known classes, , and denote the true positive, false positive, (5) and false negative samples of class [22]. Note that while the formulae iterate only over known classes, and 3https://github.com/asvspoof-challenge/2021/tree/main/DF/Baseline- take unknown class samples into consideration. RawNet2

4. Challenge results

LFCC-LCNN system (baseline S02) operates on LFCC features with a light convolutional neural network (LCNN) ADD 2023 has challenge data requests from 145 teams [24]. The frame length and shift are set to 20 ms and 10 from 12 countries. Participants submit task results and ms. The back end is based on the LCNN reported in [24]. receive scores through the CodaLab website. In this sec- The third system operates on wav2vec2 features with tion, we report on the detection baselines provided by an LCNN (baseline S03). The wav2vec2 [25] pretrained the organizers and the results and analysis submitted by model variant “wav2vec XLSR” is used as a pretrained the participants. feature extractor, which is trained on 56k hours of audio samples in 53 languages using additional linear transfor4.1. Detection baselines mations and a larger context network. For the detection task of track 2, the front-end LFCC ADD 2023 provides six baseline systems, which are de- feature extraction settings of the baseline system S04 scribed in summary as shown in Table 5. For the detec- are the same as those of S02. For back-end model archition task of track 1.2, we present three diferent detection tecture, we remove all pooling layers from the convensystems. The first system is a GMM-based system that op- tional LCNN to ensure the output size aligns with the erates on linear frequency cepstral coeficients (LFCCs) segment label. For the recognition task of track 3, we [23] (baseline S01). The feature extraction of LFCC is introduce two diferent recognition systems. Both basethe same as that of ASVspoof 2021, where the window lines are LFCC-ResNet based systems. The LFCC were length and shift are set to 30 ms and 15 ms. The second extracted similar to baseline system S01. The model struc

4.2. Results and analysis

The four tracks of ADD 2023 have all received suficient submissions, and the summary data of the rankings are shown in Tables 5, 6, 7, and 8. The ID number of each participating team is determined by their ranking order.

Track 2 and 3 are the first subchallenge of fake region location and algorithm recognition in the field of deepfake audio detection. For track 1.1, we received 14 submissions. The average WDSR of all submissions was 27.11%, and the two-round combined performance of the best team was 44.97%. Track 1.2 received a total of 49 submissions, with 11 WEER below the best baseline S01, and the best team had a WEER of 12.45%. The average WEER of all submissions was 49.94%. For Track 2, 11 teams scored higher than the baseline S04, with the highest score of 67.13%. The average score of the 16 submissions was 48.82%. The results show that it is challenging for fake region location.

For Track 3, Nine teams performed better than the baseline systems S06 and S07. Although the best team achieved an F1-score of 89.63%, the average F1-score of Track 3 is still low. We hope that the challenge data and evaluation results of track 3 will further promote researchers to explore new deepfake audio algorithm recognition methods.

5. Conclusions

This paper provides an overview of the ADD 2023 Challenge, which consists of four distinct subchallenges. In order to better simulate real-world challenges, the challenge introduces two new tasks and more dificult datasets. The results indicate that the fake region location task and the algorithm recognition task are still challenging, especially for fake region location track. The solutions of participants and further analysis will be presented at the ADD 2023 workshop. In future competitions, we plan to optimize the datasets and competition rules, aiming to promote more advanced research in the deepfake audio community.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61831022, No. U21B2010, No. 62101553, No. 61971419, No. 62006223, No. 62276259, No. 62201572, No. 62206278), Beijing Municipal Science&Technology Commission, Administrative Commission of Zhongguancun Science Park No. Z211100004821013, Open Research Projects of Zhejiang Lab (NO. 2021KH0AB06). Thanks to AISHELL 4 for providing the open source dataset for this challenge. 4https://www.aishelltech.com

[1]

Tan ,

Qin ,

Soong , T.-Y. Liu, A survey on neural speech synthesis , arXiv preprint arXiv:2106.15561 ( 2021 ).

[2]

Popov ,

Vovk ,

Gogoryan ,

Sadekova ,

Kudinov , Grad-tts: A difusion probabilistic model for text-to-speech , in: International Conference on Machine Learning, PMLR , 2021 , pp. 8599 - 8608 .

[3]

Kim ,

Kong ,

Son , Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech , in: International Conference on Machine Learning, PMLR , 2021 , pp. 5530 - 5540 .

[4]

Sisman ,

Yamagishi ,

King ,

Li , An overview of voice conversion and its challenges: From statistical modeling to deep learning , IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 ( 2020 ) 132 - 157 .

[5]

Lorenzo-Trueba ,

Yamagishi ,

Toda ,

Saito ,

Villavicencio ,

Kinnunen ,

Ling , The Voice Conversion Challenge 2018 : Promoting Development of Parallel and Nonparallel Methods , in: Proc. The Speaker and Language Recognition Workshop of the 1st International Workshop on Deepfake De(Odyssey 2018 ), 2018 , pp. 195 - 202 . doi: 10 .21437/ tection for Audio Multimedia, 2022 , pp. 61 - 68 . Odyssey. 2018 - 28 . [18]

Yan ,

Yi ,

Tao ,

Wang ,

Ma ,

Tian , R. Fu,

[6]

Yi ,

W.-C.

Huang ,

Tian ,

Yamagishi , R. K. Das , System fingerprints detection for deepfake audio: T.

Kinnunen , Z.-H.

Ling , T. Toda, Voice Conver- An initial dataset and investigation, arXiv preprint sion Challenge 2020 - Intra-lingual semi-parallel arXiv:2208.10489 ( 2022 ). and cross-lingual voice conversion - , in: Proc. [ 19]

Shi ,

Bu ,

Xu ,

Zhang ,

Li , Aishell-3:

Joint

Workshop for the Blizzard Challenge and multi-speaker mandarin tts corpus and the baseVoice

Conversion Challenge 2020 , 2020 , pp. 80 - 98 . lines, in: arXiv preprint arXiv: 2010 .11567, 2020 . URL: http://dx.doi.org/10.21437/VCC_BC. 2020 - 14 . [20]

Bu ,

Du ,

Na ,

Wu ,

Zheng , Aishell-1 : An doi:10 .21437/VCC_BC. 2020 - 14 . open-source mandarin speech corpus and a speech

[7]

Harwell , Remember the 'deepfake cheerleader recognition baseline , in: 2017 20th Conference of mom'? prosecutors now admit they can't prove the Oriental Chapter of the International Coordifake-video claims , March 14 ( 2021 ) 2021 . nating Committee on Speech Databases and Speech

[8]

Wu ,

Kinnunen ,

Evans ,

Yamagishi , I/O Systems and Assessment (O-COCOSDA) , 2017 , C. Hanilc¸i , et al., Asvspoof 2015 : the first automatic pp. 1 - 5 . doi: 10 .1109/ICSDA. 2017 . 8384449 . speaker verification spoofing and countermeasures [21]

Wang ,

Zhang , Thchs-30: A free chinese challenge , in: Proc. of INTERSPEECH , 2015 . speech corpus, arXiv preprint arXiv:1512 .01882

[9]

Kinnunen ,

Sahidullah ,

Delgado , N. E. ( 2015 ). M. Todisco , et al., The asvspoof 2017 challenge: [22]

Geng , S.-j. Huang,

Chen , Recent advances in Assessing the limits of replay spoofing attack de- open set recognition: A survey, IEEE transactions tection , in: Proc. of INTERSPEECH , 2017 . on pattern analysis and machine intelligence 43

[10]

Todisco ,

Wang ,

Vestman , M. Sahidullah, ( 2020 ) 3614 - 3631 . K. Lee, Asvspoof 2019 : Future horizons in spoofed [23]

Sahidullah ,

Kinnunen ,

Hanilçi , A compariand fake audio detection, in: Proc. of INTER- son of features for synthetic speech detection , in: SPEECH , 2019 . Proc. of INTERSPEECH , 2015 .

[11]

Yamagishi ,

Wang ,

Todisco ,

Sahidullah , [24]

Wu ,

R. K.

Das1 ,

Yang ,

Li , Light convoluJ . Patino , A.

Nautsch , X.

Liu , K. A.

Lee , T. Kinnunen, tional neural network with feature genuinization N. Evans, Asvspoof 2021: accelerating progress in for detection of synthetic speech attacks , in: Proc. spoofed and deepfake speech detection , 2021 . of INTERSPEECH, 2020 .

[12]

Yi ,

Fu ,

Tao ,

Nie , H. Ma,

Wang ,

Wang , [25]

Baevski ,

Zhou ,

Mohamed ,

Auli , wav2vec

Tian ,

Bai ,

Fan , et al., Add 2022 : the first au- 2.0: A framework for self-supervised learning of dio deep synthesis detection challenge, in: ICASSP speech representations , Advances in neural infor2022-2022 IEEE International Conference on Acous- mation processing systems 33 ( 2020 ) 12449 - 12460 . tics, Speech and Signal Processing (ICASSP) , IEEE, [26]

He ,

Zhang , S. Ren,

Sun , Deep residual learn2022 , pp. 9216 - 9220 . ing for image recognition , in: Proceedings of the

[13]

Yi ,

Bai ,

Tao ,

Ma ,

Tian ,

Wang ,

Wang , IEEE conference on computer vision and pattern R. Fu, Half-truth: A partially fake audio detection recognition , 2016 , pp. 770 - 778 . dataset, in: Proc. of INTERSPEECH , 2021 . [27]

Bendale ,

T. E.

Boult , Towards open set deep

[14]

Ma , J. Yi,

Tao ,

Bai ,

Tian ,

Wang , Con - networks, in: Proceedings of the IEEE conference tinual learning for fake audio detection , in: Proc. on computer vision and pattern recognition , 2016 , of

INTERSPEECH

, 2021 . pp. 1563 - 1572 .

[15]

Peng ,

Fan ,

Wang ,

Dong ,

Li ,

Lyu ,

Li ,

Sun ,

Chen ,

Chen , et al., Dfgc 2021 : A deepfake game competition , in: 2021 IEEE International Joint Conference on Biometrics (IJCB) , IEEE, 2021 , pp. 1 - 8 .

[16]

Peng ,

Xiang ,

Jiang ,

Wang ,

Dong ,

Sun ,

Lei ,

Lyu , Dfgc 2022 : The second deepfake game competition , in: 2022 IEEE International Joint Conference on Biometrics (IJCB) , IEEE, 2022 , pp. 1 - 10 .

[17]

Yan ,

Yi ,

Tao ,

Wang , H. Ma,

Wang ,

Fu , An initial investigation for detecting vocoder fingerprints of fake audio , in: Proceedings