<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ADD 2023: the Second Audio Deepfake Detection Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiangyan Yi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianhua Tao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruibo Fu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinrui Yan</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chenglong Wang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tao Wang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chu Yuan Zhang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaohui Zhang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yan Zhao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Ren</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Le Xu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junzuo Zhou</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hao Gu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhengqi Wen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shan Liang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zheng Lian</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuai Nie</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haizhou Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Automation, Tsinghua University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical and Computer Engineering, National University of Singapore</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The Chinese University of Hong Kong</institution>
          ,
          <country country="HK">Hong Kong</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>125</fpage>
      <lpage>130</lpage>
      <abstract>
        <p>Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Diferent from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Audio deepfake</kwd>
        <kwd>fake detection</kwd>
        <kwd>audio fake game</kwd>
        <kwd>manipulation region location</kwd>
        <kwd>deepfake algorithm recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction the ADD 2022 1 included three Tracks: low-quality fake
audio detection (LF), partially fake audio detection (PF)
Over the last decades, the development of artificial intel- and audio fake game (FG). However, some limitations
ligence has brought forth great improvements in speech still existed in ADD 2022. The techniques used in the
synthesis [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] and voice conversion [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] technolo- challenge focused more on performing binary
classificagies. The models are able to generate realistic and human- tion between real and fake audio. In addition, there were
like speech. The technology nevertheless poses a serious limited rounds of evaluation for the FG Track.
threat to the society if someone misuses it [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Therefore, Moreover, there is also an interest in surpassing the
audio deepfake detection is an emerging topic of interest. constraints of binary real/fake classification, and actually
An increasing number of eforts have been made to detect localizing the manipulated intervals in a partially fake
the deepfake audio recently [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref8 ref9">8, 9, 10, 11, 12, 13, 14</xref>
        ]. speech as well as pinpointing the source responsible for
      </p>
      <p>
        A series of challenges, including Automatic Speaker generating any fake audio. Therefore, we launched a
secVerification Spoofing and Countermeasures Challenge ond Audio Deepfake Detection Challenge (ADD 2023 2)
(ASVspoof 2021) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the First Audio Deepfake Detec- to spur researchers around the world to build new
innotion Challenge (ADD 2022) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] have played a critical vative technologies that can further accelerate and foster
role in fostering research on this area. The ASVspoof research on detecting and analysing deepfake utterances.
2021 introduced a new task involving audio deepfake In the following sections, we describe the datasets
(DF) detection, accelerating progress in deepfake audio and evaluation metrics designed for diferent
subchaldetection. To address more challenges in the real world, lenges. Finally, we briefly report on the performance of
the results submitted by the ADD 2023 participants to
further explore the current state and future direction of
IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis real-world audio deepfake detection.
(DADA 2023), August 19, 2023, Macao, S.A.R
* Corresponding author.
† These authors contributed equally. 2. Subchallenges
$ jiangyan.yi@nlpr.ia.ac.cn (J. Yi); jhtao@tsinghua.edu.cn (J. Tao);
ruibo.fu@nlpr.ia.ac.cn (R. Fu); yanxinrui2021@ia.ac.cn (X. Yan); The ADD 2023 challenge includes three subchallenges:
(cTh.eWngalnong)g;.wzhaanngg@chnulpyru.iaan.a2c0.2c1n@(Cia..aWc.acnng()C;t.aYo..wZhaanngg@);nlpr.ia.ac.cn audio fake game (FG) [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], manipulation region
locahaizhou.li@u.nus.edu (H. Li)
      </p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 1http://addchallenge.cn/add2022
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) 2http://addchallenge.cn/add2023
tion (RL) and deepfake algorithm recognition (AR). The Track 1.2: We use the same training and dev. sets as
RL and AR subchallenges are new to ADD. Track 3.2 of ADD 2022, including the real and fake
utter</p>
      <p>Track 1. audio fake game (FG): Diferent from ADD ances based on AISHELL-3.
2022, ADD 2023 has two rounds of evaluations for the Track 2: The dataset consists of real utterances and
generation task and two rounds of evaluations for the partially fake utterances. Fake utterances generated by
detection task. manipulated the original genuine utterances with real or</p>
      <p>Track 1.1 generation task (FG-G): aiming to gener- synthesized audios.
ate fake audio that can fool the fake detection model of Track 3: The training and dev. sets include 7 classes
Track 1.2. (1 real and 6 counterfeit) as shown in Table 2. The 7</p>
      <p>Track 1.2 detection task (FG-D): attempting to de- categories are labeled 0, 1, 2, 3, 4, 5, 6. Fake audio taken
tect fake utterances, especially the fake samples gener- from speech synthesized by diferent speech generation
ated from Track 1.1. algorithms and tools.</p>
      <p>
        Track 2. manipulation region location (RL):
focusing on locating the manipulated regions in a partially fake 2.2. Test sets
audio in which the original utterances are manipulated
with real or generated audio [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The test sets of ADD 2023 are more challenging compared
      </p>
      <p>
        Track 3. deepfake algorithm recognition (AR): to the previous one. The number of utterances in the four
aiming to recognize the algorithms of the deepfake utter- subsets are shown in Table 3.
ances, and the evaluation dataset includes samples from
an unknown deepfake algorithm [
        <xref ref-type="bibr" rid="ref17">17, 18</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2.1. Training and dev sets</title>
      <p>The training and dev. sets of ADD 2023 contain four
subsets, as summarized in Table 1.</p>
      <p>Track 1.1: We use the AISHELL-3 [19] corpus, which
is a large-scale Chinese speech corpus containing over
88,000 utterances, composing 85 hours of speech.</p>
      <sec id="sec-2-1">
        <title>Test #Real #Fake</title>
      </sec>
      <sec id="sec-2-2">
        <title>Track 1.2 R1 80,000 31,976</title>
        <p>Track 1.1: It consists of test sets for two rounds, with
two speakers, one male and one female, randomly
selected from the AISHELL3 dataset in each round. There
are 499 text contents in the test set file, and the text
content of each line corresponds to an audio file generated
for each target speaker ID.</p>
        <p>Track 1.2: The real audio of the test set for two rounds
consists of sources including AISHELL-1 [20], Thchs30
[21], etc. The fake audio consists of audio generated
by using TTS and voice conversion techniques, and a
portion of audio generated from the two rounds of track
1.1 submissions.</p>
        <p>Track 2: The test set includes unseen partially fake and
real utterances. Additional noise addition and format
conversions were done on this base.</p>
        <p>Track 3: The test set includes 8 classes (the 7 classes
included in the training and dev. sets and unknown
counterfeit class, as shown in Table 2). The unknown category
data was synthesized by an unknown speech generation
tool.</p>
        <sec id="sec-2-2-1">
          <title>3. Evaluation metrics</title>
          <p>
            Track 1.1 aims to generate fake audio that can fool the
detection models. Therefore, the deception success rate
(DSR) [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] is chosen as the metric. The goal of Track 1.2
is audio deepfake detection. So the weighted equal error
rate (WEER) [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] is used as the metric. To better evaluate Table 4
the manipulation region location performance of Track 2, Description of detection baseline systems
the final score is the weighted sum of sentence accuracy
and segment F1-score [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. For Track 3, participants ID Model
should recognize the known and unknown algorithms
of the deepfake utterances. Therefore, we utilize the
macro-average F1-score [22] in open set recognition.
3.1. Track 1.1 FG-G
DSR reflects the degree to which the audio deepfake
detection model is deceived by the generated utterances,
and is defined as follows:
          </p>
          <p>= (1)</p>
          <p>
            × 
where  is the count of wrong detection samples by
all the detection models on the condition of reaching
each own equal error rate (EER) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] performance,  is
the total number of evaluation samples, and  is the
number of detection models. For the first round, the DSR
against the Track 1.2 submissions forms the totality of
generation performance metric, where as in the second
round, weighted consideration is also given to the DSR
against the model we release, efectively:
          </p>
          <p>=  1 +   2
 2 =  2 +  2
(2)
(3)
where  =0.4,  =0.6,  =0.4 and  =0.6, and they represent
the weights for DSR in our consideration. DSRR1 and
WDSRR2 represent the generation performance metrics
for the first and second rounds, respectively. DSRR2
and DSRR2baseline refers to the DSR achieved by using
the synthesized speech submitted by the participants to
attack the model submitted in track 1.2 and the detection
baseline model 3 provided by organizers, respectively.
3.2. Track 1.2 FG-D
The WEER is defined as:
  =  1 +  2
(4)
where  =0.4 and  =0.6, which represent the weights of
EERR1 obtained in the first round and EERR2 obtained
in the second round, respectively. The EER is defined and
calculated in the same way as in ADD 2022.
3.3. Track 2 RL
S01
S02
S03
S04
S05
S06</p>
          <p>GMM</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>LCNN</title>
      </sec>
      <sec id="sec-2-4">
        <title>LCNN</title>
      </sec>
      <sec id="sec-2-5">
        <title>LCNN</title>
      </sec>
      <sec id="sec-2-6">
        <title>ResNet</title>
        <p>(Softmax with threshold)</p>
      </sec>
      <sec id="sec-2-7">
        <title>ResNet (Openmax)</title>
      </sec>
      <sec id="sec-2-8">
        <title>Features</title>
      </sec>
      <sec id="sec-2-9">
        <title>LFCC</title>
      </sec>
      <sec id="sec-2-10">
        <title>LFCC</title>
        <p>Wav2vec2</p>
      </sec>
      <sec id="sec-2-11">
        <title>LFCC</title>
      </sec>
      <sec id="sec-2-12">
        <title>LFCC</title>
      </sec>
      <sec id="sec-2-13">
        <title>LFCC</title>
      </sec>
      <sec id="sec-2-14">
        <title>Task</title>
      </sec>
      <sec id="sec-2-15">
        <title>Track 1.2</title>
      </sec>
      <sec id="sec-2-16">
        <title>Track 1.2</title>
      </sec>
      <sec id="sec-2-17">
        <title>Track 1.2</title>
      </sec>
      <sec id="sec-2-18">
        <title>Track 2 Track 3 Track 3</title>
        <p>where   ,   ,   , and   denote the numbers of
true positive, true negative, false positive, and false
negative samples. Additionally, we use Segment Precision
Psegment, Segment Recall Rsegment, and Segment
F1score F1segment to measure the ability of the model to
correctly identify fake areas from fake audios, defined
respectively as:</p>
        <p>=   +</p>
        <p>=   +  
 1 =
2 ×  ×</p>
        <p>+ 
The final score is the weighted sum of Sentence Accuracy
and Segment F1-score, as shown below.</p>
        <p>=   +  1
where  =0.3 and  =0.7, which represent the weights
of  and  1.
3.4. Track 3 AR
For the algorithm recognition tasks in Track 3, we use
the macro-average F1-score, defined as:
 = 1 ∑︁</p>
        <p>=1   +  
 = 1 ∑︁</p>
        <p>=1   +  
1 =
2 ×  × 
 + 
(6)
(7)
(8)
(9)
(10)
(11)
(12)
For Track 2, sentence accuracy measures the ability of
the model to correctly distinguish between genuine and
fake audio, and is defined as follows:</p>
        <p>+  
 =   +   +   +  
where  denotes the number of known classes,  ,
  and   denote the true positive, false positive,
(5) and false negative samples of class  [22]. Note that while
the formulae iterate only over known classes,   and
3https://github.com/asvspoof-challenge/2021/tree/main/DF/Baseline-   take unknown class samples into consideration.
RawNet2</p>
        <sec id="sec-2-18-1">
          <title>4. Challenge results</title>
          <p>LFCC-LCNN system (baseline S02) operates on LFCC
features with a light convolutional neural network (LCNN)
ADD 2023 has challenge data requests from 145 teams [24]. The frame length and shift are set to 20 ms and 10
from 12 countries. Participants submit task results and ms. The back end is based on the LCNN reported in [24].
receive scores through the CodaLab website. In this sec- The third system operates on wav2vec2 features with
tion, we report on the detection baselines provided by an LCNN (baseline S03). The wav2vec2 [25] pretrained
the organizers and the results and analysis submitted by model variant “wav2vec XLSR” is used as a pretrained
the participants. feature extractor, which is trained on 56k hours of audio
samples in 53 languages using additional linear
transfor4.1. Detection baselines mations and a larger context network.
For the detection task of track 2, the front-end LFCC
ADD 2023 provides six baseline systems, which are de- feature extraction settings of the baseline system S04
scribed in summary as shown in Table 5. For the detec- are the same as those of S02. For back-end model
archition task of track 1.2, we present three diferent detection tecture, we remove all pooling layers from the
convensystems. The first system is a GMM-based system that op- tional LCNN to ensure the output size aligns with the
erates on linear frequency cepstral coeficients (LFCCs) segment label. For the recognition task of track 3, we
[23] (baseline S01). The feature extraction of LFCC is introduce two diferent recognition systems. Both
basethe same as that of ASVspoof 2021, where the window lines are LFCC-ResNet based systems. The LFCC were
length and shift are set to 30 ms and 15 ms. The second extracted similar to baseline system S01. The model
struc</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4.2. Results and analysis</title>
      <p>The four tracks of ADD 2023 have all received suficient
submissions, and the summary data of the rankings are
shown in Tables 5, 6, 7, and 8. The ID number of each
participating team is determined by their ranking order.</p>
      <p>Track 2 and 3 are the first subchallenge of fake
region location and algorithm recognition in the field of
deepfake audio detection. For track 1.1, we received 14
submissions. The average WDSR of all submissions was
27.11%, and the two-round combined performance of the
best team was 44.97%. Track 1.2 received a total of 49
submissions, with 11 WEER below the best baseline S01, and
the best team had a WEER of 12.45%. The average WEER
of all submissions was 49.94%. For Track 2, 11 teams
scored higher than the baseline S04, with the highest
score of 67.13%. The average score of the 16 submissions
was 48.82%. The results show that it is challenging for
fake region location.</p>
      <p>For Track 3, Nine teams performed better than the
baseline systems S06 and S07. Although the best team
achieved an F1-score of 89.63%, the average F1-score of
Track 3 is still low. We hope that the challenge data
and evaluation results of track 3 will further promote
researchers to explore new deepfake audio algorithm
recognition methods.</p>
      <sec id="sec-3-1">
        <title>5. Conclusions</title>
        <p>This paper provides an overview of the ADD 2023
Challenge, which consists of four distinct subchallenges.
In order to better simulate real-world challenges, the
challenge introduces two new tasks and more dificult
datasets. The results indicate that the fake region
location task and the algorithm recognition task are still
challenging, especially for fake region location track. The
solutions of participants and further analysis will be
presented at the ADD 2023 workshop. In future
competitions, we plan to optimize the datasets and competition
rules, aiming to promote more advanced research in the
deepfake audio community.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Acknowledgments</title>
        <p>This work is supported by the National Natural
Science Foundation of China (NSFC) (No. 61831022, No.
U21B2010, No. 62101553, No. 61971419, No. 62006223,
No. 62276259, No. 62201572, No. 62206278), Beijing
Municipal Science&amp;Technology Commission,
Administrative Commission of Zhongguancun Science Park No.
Z211100004821013, Open Research Projects of Zhejiang
Lab (NO. 2021KH0AB06). Thanks to AISHELL 4 for
providing the open source dataset for this challenge.
4https://www.aishelltech.com</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Soong</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>A survey on neural speech synthesis</article-title>
          ,
          <source>arXiv preprint arXiv:2106.15561</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Popov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Vovk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gogoryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sadekova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kudinov</surname>
          </string-name>
          ,
          <article-title>Grad-tts: A difusion probabilistic model for text-to-speech</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8599</fpage>
          -
          <lpage>8608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <article-title>Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5530</fpage>
          -
          <lpage>5540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sisman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An overview of voice conversion and its challenges: From statistical modeling to deep learning</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>132</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lorenzo-Trueba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Toda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Villavicencio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <source>The Voice Conversion Challenge</source>
          <year>2018</year>
          :
          <article-title>Promoting Development of Parallel and Nonparallel Methods</article-title>
          ,
          <source>in: Proc. The Speaker and Language Recognition Workshop of the 1st International Workshop on Deepfake De(Odyssey</source>
          <year>2018</year>
          ),
          <year>2018</year>
          , pp.
          <fpage>195</fpage>
          -
          <lpage>202</lpage>
          . doi:
          <volume>10</volume>
          .21437/ tection for Audio Multimedia,
          <year>2022</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>68</lpage>
          . Odyssey.
          <year>2018</year>
          -
          <volume>28</volume>
          . [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          , R. Fu,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. K. Das</surname>
          </string-name>
          ,
          <article-title>System fingerprints detection for deepfake audio: T.</article-title>
          <string-name>
            <surname>Kinnunen</surname>
            ,
            <given-names>Z.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Ling</surname>
          </string-name>
          , T. Toda, Voice Conver-
          <article-title>An initial dataset and investigation, arXiv preprint sion Challenge 2020 - Intra-lingual semi-parallel</article-title>
          <source>arXiv:2208.10489</source>
          (
          <year>2022</year>
          ).
          <article-title>and cross-lingual voice conversion -</article-title>
          ,
          <source>in: Proc. [</source>
          19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          , Aishell-3:
          <string-name>
            <given-names>A</given-names>
            <surname>Joint</surname>
          </string-name>
          <article-title>Workshop for the Blizzard Challenge and multi-speaker mandarin tts corpus and the baseVoice</article-title>
          <source>Conversion Challenge</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>98</lpage>
          . lines, in: arXiv preprint arXiv:
          <year>2010</year>
          .11567,
          <year>2020</year>
          . URL: http://dx.doi.org/10.21437/VCC_BC.
          <year>2020</year>
          -
          <volume>14</volume>
          . [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Na</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Aishell-1
          <source>: An doi:10</source>
          .21437/VCC_BC.
          <year>2020</year>
          -
          <fpage>14</fpage>
          .
          <article-title>open-source mandarin speech corpus and a speech</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Harwell</surname>
          </string-name>
          ,
          <article-title>Remember the 'deepfake cheerleader recognition baseline</article-title>
          ,
          <source>in: 2017 20th</source>
          <article-title>Conference of mom'? prosecutors now admit they can't prove the Oriental Chapter of the International Coordifake-video claims</article-title>
          ,
          <source>March</source>
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <year>2021</year>
          .
          <article-title>nating Committee on Speech Databases</article-title>
          and Speech
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          , I/O Systems and
          <string-name>
            <surname>Assessment (O-COCOSDA)</surname>
          </string-name>
          ,
          <year>2017</year>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Hanilc¸i</article-title>
          , et al.,
          <year>Asvspoof 2015</year>
          :
          <article-title>the first</article-title>
          automatic pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSDA.
          <year>2017</year>
          .
          <volume>8384449</volume>
          .
          <article-title>speaker verification spoofing</article-title>
          and countermeasures [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Thchs-30:
          <article-title>A free chinese challenge</article-title>
          ,
          <source>in: Proc. of INTERSPEECH</source>
          ,
          <year>2015</year>
          . speech corpus,
          <source>arXiv preprint arXiv:1512</source>
          .01882
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. E.</surname>
          </string-name>
          (
          <year>2015</year>
          ). M.
          <string-name>
            <surname>Todisco</surname>
          </string-name>
          , et al.,
          <source>The asvspoof 2017 challenge:</source>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Geng</surname>
          </string-name>
          , S.-j. Huang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Recent advances in Assessing the limits of replay spoofing attack de- open set recognition: A survey, IEEE transactions tection</article-title>
          ,
          <source>in: Proc. of INTERSPEECH</source>
          ,
          <year>2017</year>
          .
          <article-title>on pattern analysis</article-title>
          and
          <source>machine intelligence 43</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vestman</surname>
          </string-name>
          , M. Sahidullah, (
          <year>2020</year>
          )
          <fpage>3614</fpage>
          -
          <lpage>3631</lpage>
          . K. Lee,
          <year>Asvspoof 2019</year>
          : Future horizons in spoofed [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hanilçi</surname>
          </string-name>
          ,
          <article-title>A compariand fake audio detection, in: Proc. of INTER- son of features for synthetic speech detection</article-title>
          ,
          <source>in: SPEECH</source>
          ,
          <year>2019</year>
          .
          <source>Proc. of INTERSPEECH</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          , [24]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Das1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Light convoluJ</article-title>
          .
          <string-name>
            <surname>Patino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nautsch</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , T. Kinnunen,
          <article-title>tional neural network with feature genuinization N. Evans, Asvspoof 2021: accelerating progress in for detection of synthetic speech attacks</article-title>
          ,
          <source>in: Proc. spoofed and deepfake speech detection</source>
          ,
          <year>2021</year>
          . of INTERSPEECH,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          , [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          , et al.,
          <year>Add 2022</year>
          :
          <article-title>the first au- 2.0: A framework for self-supervised learning of dio deep synthesis detection challenge, in: ICASSP speech representations</article-title>
          ,
          <source>Advances in neural infor2022-2022 IEEE International Conference on Acous- mation processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          . tics,
          <source>Speech and Signal Processing (ICASSP)</source>
          , IEEE, [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <source>Deep residual learn2022</source>
          , pp.
          <fpage>9216</fpage>
          -
          <lpage>9220</lpage>
          .
          <article-title>ing for image recognition</article-title>
          ,
          <source>in: Proceedings of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          , IEEE conference on computer vision and pattern R. Fu,
          <article-title>Half-truth: A partially fake audio detection recognition</article-title>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . dataset,
          <source>in: Proc. of INTERSPEECH</source>
          ,
          <year>2021</year>
          . [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bendale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Boult</surname>
          </string-name>
          , Towards open set deep
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J. Yi,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Con
          <article-title>- networks, in: Proceedings of the IEEE conference tinual learning for fake audio detection</article-title>
          ,
          <source>in: Proc. on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          ,
          <string-name>
            <surname>of</surname>
            <given-names>INTERSPEECH</given-names>
          </string-name>
          ,
          <year>2021</year>
          . pp.
          <fpage>1563</fpage>
          -
          <lpage>1572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <year>Dfgc 2021</year>
          :
          <article-title>A deepfake game competition</article-title>
          ,
          <source>in: 2021 IEEE International Joint Conference on Biometrics (IJCB)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <year>Dfgc 2022</year>
          :
          <article-title>The second deepfake game competition</article-title>
          ,
          <source>in: 2022 IEEE International Joint Conference on Biometrics (IJCB)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>An initial investigation for detecting vocoder fingerprints of fake audio</article-title>
          , in: Proceedings
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>