<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-grained Backend Fusion for Manipulation Region Location of Partially Fake Audio</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jun Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mengjie Luo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoqin Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shushan Qiao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yumei Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Microelectronics of the Chinese Academy of Sciences</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nanjing Institute of Intelligent Technology</institution>
          ,
          <addr-line>Nanjing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Chinese Academy of Sciences</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>43</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>Fake audio detection is an important research area to prevent the misuse of speech synthesis and voice conversion technologies. While progress has been made in detecting partially fake audio at the utterance level, accurately locating the manipulation region at the segment level remains challenging. Aiming to promote the development of manipulation region location of partially fake audio, ADD 2023 is organized and Track 2 seeks to locate the fake clips. This paper introduces our system submitted to ADD 2023 Track 2, combining AASIST-based and Wav2Vec2-based subsystems through multi-grained backend fusion. With the proposed method, the bias of AASIST towards fake class, and Wav2Vec2 towards genuine class are mitigated. Our system achieves a  of 59.12%, a 40.7% increase compared to the best baseline system in this paper.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fake Audio Detection</kwd>
        <kwd>Audio Deepfake Detection</kwd>
        <kwd>Partially Fake</kwd>
        <kwd>Wav2Vec2</kwd>
        <kwd>AASIST</kwd>
        <kwd>Backend Fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        frameworks[10] have been widely adopted in ASVspoof
challenges [11, 12], and the AASIST[13] was served as
The advancement of speech synthesis and voice con- a baseline model and employed by several top-ranked
version(VC) technologies has significantly enha
        <xref ref-type="bibr" rid="ref1">nced the participants in SASV 2022</xref>
        who would like to achieve a
quality and naturalness of synthesized speech [1, 2, 3, 4]. low equal error rate of FAD[14, 15, 16, 17].
However, an issue of potential technology abuse such as Previous challenges have primarily focused on
detecttelecom fraud may be brought up. Consequently, there ing fully fake audio at the utterance level, without
adis a growing concern about fake audio detection(FAD), dressing realistic scenarios involving partially fake
auwhere the synthesized audio for inappropriate uses is dio. Partially fake audio refers to fake audio with small
defined as fake audio or spoofing attacks. fake clips hidden in genuine speech audio[18, 19]. To
      </p>
      <p>
        The Asvspoof challenges have gathered attention from address this gap, ADD challenges are launched to
encourresearchers who aim to protect automatic speaker veri- age researchers to explore new frameworks for detecting
ifcation(ASV) systems from spoofing attacks, [ 5, 6, 7, 8]. partially fake audio[20, 21]. I
        <xref ref-type="bibr" rid="ref1">n ADD 2022</xref>
        <xref ref-type="bibr" rid="ref29">(Audio Deep
The Asvspoof 2015, 2017 and 2019 focused on the log- Synthesis Detection Challenge)</xref>
        , Track 2 targeted at
deical access(LA) task, physical access(PA) task or both. tecting partially fake audio at the utterance level. In ADD
The LA task involved detecting spoofing audio gener- 2023(Audio Deepfake Detection Challenge), the goal of
ated by statistical or neural text-to-speech(TTS) and VC Track 2 is localizing manipulated clips within a speech
methods, While the PA task aimed to disti
        <xref ref-type="bibr" rid="ref1">nguish replay sentence. In ADD 2022</xref>
        Track 2, the best partially FAD
audio implemented in various simulated acoustic envi- system at the utterance level is based on pretrai
        <xref ref-type="bibr" rid="ref1 ref15">ned
selfronments. In Asvspoof 2021</xref>
        , a new deepfake track was supervised Wav2Vec2[22, 23], but it fails to spot fake
introduced to detect compressed manipulated audio, aim- clips[24]. On the other hand, methods focusing on the
ing to enhance system robustness. Furthermore, the frame-wisely boundary detection of manipulated clips
spoofing-aware speaker verification(SASV) challenge in have shown capability in locating fake clips [25, 26].
2022 attempted to jointly optimize FAD and ASV systems This paper presents our system for the manipulation
instead of utilizing a FAD system as a gate to start the region location of partially fake audio in ADD 2023 Track
ASV system[9]. Among these challenges, ResNet-based 2. The backend fused system combines
        <xref ref-type="bibr" rid="ref17">AASIST for
deIJCAI 2023</xref>
        Workshop on Deepfake Audio Detection and Analysis tecting fake audio at the utterance level, and Wav2Vec2
(DADA 2023), August 19, 2023, Macao, S.A.R at the frame level. The main contribution of this paper
* Corresponding author: Xiaoqin Wang. is the proposal of multi-grained backend fusion, which
$ lj@niit.ac.cn (J. Li); lilin2020@ime.ac.cn (L. Li); aims to mitigate the biases of AASIST towards fake audio
(luXo.mWeannggji)e;@qiaimoseh.aucs.hcnan(@M.imLue.oa)c;.cwna(nSg.xQiaioaoq)i;ny@mimzheo.auc@.cinme.ac.cn and Wav2Vec2 towards genuine audio. Our submitted
(Y. Zhou) system achieves a  of 59.12%, a relative increase of
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 40.7% compared to the best baseline system.
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
      </p>
      <p>The rest of this paper is as follows. The proposed
method is described in Section 2. Section 3 details the
experiment settings. Experimental results and analysis
are discussed in Section 4. Finally, Section 5 concludes
the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Task Definition</title>
        <p>FAD at the utterance level is a binary classification task
to detect if a sentence is genuine or fake. In contrast,
manipulation region location identifies fake segments at
a finer granularity. In Track 2 of ADD 2023, the duration
of each segment is 10ms. Therefore, given an utterance 
with  segments, represented as  = [1, 2, ...,  ],
the output at the utterance level should be  ∈ {0, 1},
while the output at the segment level is a vector y =
[1, 2, ...,  ] ∈ {0, 1} , where 0 denotes fake and 1
denotes genuine. Besides, since the duration of segments
is similar to that of a frame commonly used in speech
processing, models generating frame outputs can be used
to detect segments.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Proposed System</title>
        <sec id="sec-2-2-1">
          <title>AASIST-based subsystem at the utterance level</title>
          <p>AASIST is an end-to-end architecture based on graph
attention network, proposed to detect diferent spoofing
attacks[13]. The raw waveform is adopted as input, with
a minimum of 64,600 samples, about 4s at a sampling rate
of 16kHz. While the original AASIST aims to classify
genuine and spoofed utterances with a binary classifier 1,
the classifier is replaced by a 5-class FC(fully connected)
layer to detect 4 types of fake audio along with a genuine
class. In the training and development set of ADD 2023
Track 2, we refer to the 4 fake forms as Fake01, Fake101,
Fake10 and Fake0, where 0 denotes the presence of
manipulated fake clips. Finally, the logits of the last FC layer
are fed into a softmax function.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Wav2Vec2-based subsystem at the frame level To</title>
          <p>address the limitation of AASIST in fake clips location,
Wav2Vec2-based subsystem is employed to determine the
authenticity of each frame. A self-supervised pretrained
model called XLS-R-300M with 300M parameters is
utilized to capture contextualized acoustic representations
2[23]. Similar to AASIST, Wav2Vec2 also takes the raw
waveform as input. It generates frame representations
at a hop length of 20ms, with each frame length 25ms,
given an input sampling rate of 16kHz. The last hidden
output of Wav2Vec2 is passed through a dropout layer,
followed by a binary linear layer for frame classification.
1https://github.com/clovaai/aasist
2https://huggingface.co/facebook/Wav2Vec2-xls-r-300m</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.2. Multi-grained Backend Fusion</title>
          <p>Manipulation region location system As depicted
in Figure 1(c), the manipulation region location system
consists of an AASIST-based subsystem and a
Wav2Vec2based subsystem. By fusing multi-grained results from
these subsystems, the system aims to mitigate biases
observed in experiments detailed in Section 4. The
alignment of utterance results to frame level involves two main
steps. Firstly, probabilities of all types of fake audio are
summed, converting the 5-class to binary classification.
Then the binary classification outputs at the utterance
level are expanded along the time domain to match the
number of frames in Wav2Vec2 outputs. The expanded
utterance outputs with frame outputs are combined by
weighted fusion, and the argmax function is applied to
determine the authenticity of each frame.</p>
          <p>FAD system at the utterance level To enhance
sentence accuracy, average fusion of AASIST models trained
on diferent datasets is utilized. The averaged utterance
result is then fused with the frame output of a
Wav2Vec2based subsystem. Following the definition in ADD 2023,
if any frame is identified as fake, the label of fake is
assigned to the entire utterance. Only when all frames are
classified as genuine, the utterance is labeled as genuine.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment Settings</title>
      <sec id="sec-3-1">
        <title>3.1. Data Preparation</title>
        <p>Various datasets are used for training. The sampling rate
of all data is 16kHz. The details of the datasets provided
by ADD 2023 Track 2 are presented in Table 1. This
includes a train set used for model fitting, a
development set used for an early stop during training, and a
test set whose labels are unknown, and used to
evaluate the FAD system. The distributions of genuine and
fake utterances in both train set and development set are
balanced, However, the percentages of each fake type
vary, with Fake101 being the majority, and Fake0 being
the minority. To enhance the generalization capability of
our system, new training data is constructed as outlined
set. The FS denotes fake sentences generated by splitting
continuous fake clips from each fake sentence in the train
set. Three traditional vocoders, namely GL(grifin-Lim)
3 [27], Straight 4 [28] and World 5 [29] are employed to
synthesize fake audio from the real segments of RS.
Additionally, utterances in MidAug are created by randomly
inserting newly constructed fake clips into the audio of
RS. In MidAug, the duration range of fake clips is [0.2s,
3s], and any utterances shorter than 0.2s are discarded.</p>
        <p>During training, Online data augmentation is
employed. The MUSAN dataset[30] is utilized to add
background noise with noise and music, while the RIR
database[31] is used to simulate reverberation. Dynamic
padding is applied. Additionally, the duration of audio is
ifxed to 5s during the training of AASIST, whereas the
full length when testing. For Wav2Vec2, 4s is mainly
employed as the maximum duration both for training
and testing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Training</title>
        <p>The system is mainly built on top of [32] and each model
is trained on an Nvidia 3090 GPU card. Cross entropy is
adopted as loss function and Adam[33] as optimizer. The
train batch size is 16. Baseline subsystems are trained
with the train set from ADD 2023 Track 2 with max epoch
50. The initial learning rate is 1e-3, and it decreases
by 5% after every epoch. To quickly converge to new
data, we finetune the baseline models with lowest loss
on development set for another 20 epochs. The finetune
learning rate starts from 1e-4.
in Table 2. The RS represents the individual genuine sen- 3https://librosa.org/doc/main/generated/librosa.grifinlim.html
tences obtained by splitting continuous real segments 4https://github.com/HidekiKawahara/legacy_STRAIGHT
from each genuine or partially fake sentence in the train 5http://www.isc.meiji.ac.jp/ mmorise/world/english/download.html</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Metrics</title>
        <p>The sentence accuracy() and segment F1
score( 1) are simultaneously adopted as
evaluation metrics for ADD 2023 Track 2. Taking fake as
positive and genuine as negative,   ,   ,   ,   are
the numbers of true positive, true negative, false positive,
false negative samples.</p>
        <p>At the utterance level,   ,   ,   ,   samples
denote utterances,  is defined as
 =</p>
        <p>+  
  +   +   +  
.</p>
        <p>The metrics at the segment level aim to measure the
ability of models to correctly identify fake clips from fake
audio[21], including  for segment precision,  for
segment recall, and  1 for F1 score, they are defined
as follows:
 =</p>
        <p>+  
 =
 1 =</p>
        <p>+  
2 
 + 
where   ,   and   samples denote segments.</p>
        <p>The final  of ADD 2023 Track 2 is defined as
 = 0.3 ×  + 0.7 ×  1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>trained with additional constructed data. The number of in FS1-FS9, suggesting the significance to recognize fake
Wav2Vec2 experiments is limited due to the unacceptable audio when evaluating. The best  of 58.65% of a
systraining and evaluation time. FW1 is the result of average tem is obtained in FS2, with a balanced  and  1,
fusion of Wav2Vec2 at the frame level. B3 and FS1-FS9 Finally, the submitted results for ADD 2023 Track2 utilize
are the results of fused systems shown in Figure 1 (b) the results of FS5 at the utterance level, and FS3 at the
and (c), where B3 is a baseline, and AASISTs are chosen segment level, achieving a  of 59.12%.
based on  and , Wav2Vec2 based on . The
weighted fusion factors of each subsystem are provided.</p>
      <p>Baselines. Comparing the results obtained from B1 5. Conclusions
and B2, it can be observed that AASIST performs better
In this paper, a system based on multi-grained backend
in terms of  and , while Wav2Vec2 achieves fusion is proposed to locate the manipulation region. The
tahheiughtteerrasnccoereleivneltend.sTthoeurseeagsolonbmalaiynfboermAaAtiSoInSTanadt performance is improved with the proposed system by
the process of transforming utterance to frame outputs mitigating the biases brought by AASIST at the
utterof AASIST makes fake segments majority, leading to ance level and Wav2Vec2 at the frame level. Our method
misidentification of genuine segments. Conversely, as achieves an  of 74.52%, a  1 of 52.53%, and the
the genuine segments are the majority in the train set, ifnal  is 59.12%. Compared to the best baseline
Wav2Vec2 at the frame level has a bias to the genuine system B1 with a  of 42.02%, the proposed system
class. To address the biases in AASIST and Wav2Vec2, B3 achieves a relative improvement of 40.7%.
utilizes multi-grained fusion. Although most metrics in
B1, and  in B2 decrease, the  improves relatively References
402.2% compared to B2, and  by 37.6% to B1, revealing
the deviations of AASIST towards fake, and Wav2Vec2 [1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J.
towards genuine are lessened to some extent. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S.
Ben</p>
      <p>AASIST. When only one kind of constructed data is gio, et al., Tacotron: Towards end-to-end speech
added to the train set, A3 with Straight exhibits a notable synthesis, arXiv preprint arXiv:1703.10135 (2017).
improvement in , a relative 47.7% increase compared [2] J.-M. Valin, J. Skoglund, Lpcnet: Improving
neuto B1. The highest  94.13% is achieved by A10, indi- ral speech synthesis through linear prediction, in:
cating that the generalization can be improved by using ICASSP, IEEE, 2019, pp. 5891–5895.
all train data. Finally, through the fusion of top-ranked [3] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao,
H. AASIST subsystems, the  rises to 73.36% in M. Wang, Voice conversion from unaligned
corFA3,  to 21.87% in FA6,  1 to 35.09% in FA5, and pora using variational autoencoding wasserstein
46.40% in FA3. generative adversarial networks, arXiv preprint</p>
      <p>Wav2Vec2. When all available data is utilized in W2, arXiv:1704.00849 (2017).
there is an improvement in all metrics compared to B2, [4] T. Kaneko, H. Kameoka, K. Tanaka, N. Hojo,
with a growth of 28.2% in , 9.6% in , 283.0% in Cyclegan-vc2: Improved cyclegan-based
non, 215.4% in  1, 94.5% in . The highest  parallel voice conversion, in: ICASSP, IEEE, 2019,
81.43% is achieved by combining W1 and W2 in FW1. pp. 6820–6824.</p>
      <p>Multi-grained Backend Fusion. Having discussed [5] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi,
in Baselines, though the performance of AASIST and C. Hanilçi, M. Sahidullah, A. Sizov, Asvspoof 2015:
Wav2Vec2 is improved by adding more constructed data the first automatic speaker verification spoofing
or fusing subsystems separately, there remain biases for and countermeasures challenge, in: Sixteenth
anAASIST towards fake and Wav2Vec2 towards genuine. nual conference of the international speech
comThe selection of top-performing subsystems aims to mit- munication association, 2015.
igate the biases by multi-grained backend fusion. How- [6] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco,
ever, it can be observed that the adoption of FW1 such N. Evans, J. Yamagishi, K. A. Lee, The asvspoof 2017
as FS7 with a  of 50.18% performs inferior to the challenge: Assessing the limits of replay spoofing
fused systems with a single Wav2Vec2. This could be at- attack detection (2017).
tributed to the decrease in , as the confidence of real [7] M. Todisco, X. Wang, V. Vestman, M. Sahidullah,
segments generated by fused Wav2Vec2 increases. Ad- H. Delgado, A. Nautsch, J. Yamagishi, N. Evans,
ditionally, as shown in Table 3, the best  1 of 52.35% T. Kinnunen, K. A. Lee, Asvspoof 2019: Future
is achieved by combining A10 and W2, both are single horizons in spoofed and fake audio detection, arXiv
subsystems trained with all data, indicating the impor- preprint arXiv:1904.05441 (2019).
tance of model generalization. Conversely, the best  [8] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah,
is acquired in FS5, with a relatively high  of 63.70% J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          , H. Delgado,
          <string-name>
            <surname>ASVspoof</surname>
          </string-name>
          <year>2021</year>
          : accelerating
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <year>Add 2022</year>
          :
          <article-title>the first audio</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>in: Proc. 2021 Edition of the Automatic Speaker Ver- 2022</source>
          , pp.
          <fpage>9216</fpage>
          -
          <lpage>9220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          ification and Spoofing Countermeasures Challenge, [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <year>2021</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          .
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , [9]
          <string-name>
            <surname>J. weon Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          , H. jin Shim, H.-S. Heo,
          <string-name>
            <given-names>B.-J. H.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <year>Add 2023</year>
          :
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>SASV 2022: The First Spoofing-Aware Speaker</surname>
          </string-name>
          Veri- cepted
          <source>by IJCAI 2023 Workshop on Deepfake Audio</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          ifcation Challenge,
          <source>in: Proc. Interspeech</source>
          <year>2022</year>
          ,
          <year>2022</year>
          ,
          <article-title>Detection and Analysis (DADA</article-title>
          <year>2023</year>
          )
          <article-title>(</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          pp.
          <fpage>2893</fpage>
          -
          <lpage>2897</lpage>
          . [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <source>Deep wav2vec 2</source>
          .
          <article-title>0: A framework for self-supervised</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>residual learning for image recognition (2015). learning of speech representations (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>arXiv:1512.03385</source>
          . arXiv:
          <year>2006</year>
          .
          <volume>11477</volume>
          . [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          , [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tjandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <year>Asvspoof 2019</year>
          :
          <article-title>Spoofing A</article-title>
          .
          <string-name>
            <surname>Baevski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Auli</surname>
          </string-name>
          , Xls-r: Self-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>converted and replayed speech, IEEE Transactions learning at scale (</article-title>
          <year>2021</year>
          ). arXiv:
          <volume>2111</volume>
          .
          <fpage>09296</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>on Biometrics, Behavior, and Identity Science</source>
          <volume>3</volume>
          [24]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          , P. Hu, Fake audio detec-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          (
          <year>2021</year>
          )
          <fpage>252</fpage>
          -
          <lpage>265</lpage>
          .
          <article-title>tion based on unsupervised pretraining models</article-title>
          , in: [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Del- ICASSP</surname>
          </string-name>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>9231</fpage>
          -
          <lpage>9235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>gado</surname>
            , T. Kinnunen,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Todisco</surname>
            , J. Yamagishi, [25]
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-C. Kuo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-H. Hung</surname>
          </string-name>
          , H.-Y. Lee,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <year>Asvspoof 2021</year>
          :
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsao</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-M. Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Meng</surname>
          </string-name>
          , Partially fake audio
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>the wild (</article-title>
          <year>2022</year>
          ). arXiv:
          <volume>2210</volume>
          .02437. ery, in: ICASSP, IEEE,
          <year>2022</year>
          , pp.
          <fpage>9236</fpage>
          -
          <lpage>9240</lpage>
          . [13]
          <string-name>
            <surname>J. weon Jung</surname>
            ,
            <given-names>H.</given-names>
            -S. Heo, H.
          </string-name>
          <string-name>
            <surname>Tak</surname>
            , H. jin Shim, [26]
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , Waveform boundary detec-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>Aasist: Audio anti-spoofing using integrated 2023</article-title>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>spectro-temporal graph attention networks (</article-title>
          <year>2021</year>
          ). [27]
          <string-name>
            <given-names>N.</given-names>
            <surname>Perraudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Balazs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Søndergaard</surname>
          </string-name>
          , A fast
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>arXiv:2110</source>
          .01200.
          <article-title>grifin-lim algorithm</article-title>
          , in: 2013 IEEE Workshop [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <source>The DKU- on Applications of Signal</source>
          Processing to Audio and
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>OPPO System for the 2022 Spoofing-Aware Speaker Acoustics</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Verification</given-names>
            <surname>Challenge</surname>
          </string-name>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2022</year>
          , [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kawahara</surname>
          </string-name>
          , Straight, exploitation of the other
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <year>2022</year>
          , pp.
          <fpage>4396</fpage>
          -
          <lpage>4400</lpage>
          . aspect of vocoder: Perceptually isomorphic decom[15]
          <string-name>
            <surname>J.-H. Choi</surname>
            ,
            <given-names>J.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Jeoung</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , position
          <article-title>of speech sounds, Acoustical science</article-title>
          and
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>HYU Submission for the SASV Challenge</source>
          <year>2022</year>
          : Re- technology
          <volume>27</volume>
          (
          <year>2006</year>
          )
          <fpage>349</fpage>
          -
          <lpage>353</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>forming Speaker Embeddings with</surname>
            Spoofing-Aware [29]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Morise</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Yokomori</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ozawa</surname>
          </string-name>
          , World: a
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Conditioning</surname>
          </string-name>
          , in
          <source>: Proc. Interspeech</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp. vocoder
          <article-title>-based high-quality speech synthesis sys-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          2873-
          <fpage>2877</fpage>
          .
          <article-title>tem for real-time applications</article-title>
          ,
          <source>IEICE TRANSAC</source>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Norm-constrained
          <source>Score- TIONS on Information and Systems</source>
          <volume>99</volume>
          (
          <year>2016</year>
          )
          <fpage>1877</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>level Ensemble for Spoofing Aware Speaker Verifi-</article-title>
          <year>1884</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          cation, in
          <source>: Proc. Interspeech</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>4371</fpage>
          -
          <lpage>[</lpage>
          30]
          <string-name>
            <given-names>D.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , Musan:
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          4375.
          <article-title>A music, speech, and noise corpus (</article-title>
          <year>2015</year>
          ). [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Backend arXiv:
          <volume>1510</volume>
          .
          <fpage>08484</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>Ensemble for Speaker Verification</article-title>
          and Spoofing [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peddinti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Seltzer</surname>
          </string-name>
          , S. Khudan-
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Countermeasure</surname>
          </string-name>
          , in
          <source>: Proc. Interspeech</source>
          <year>2022</year>
          ,
          <year>2022</year>
          ,
          <article-title>pur, A study on data augmentation of reverberant</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          pp.
          <fpage>4381</fpage>
          -
          <lpage>4385</lpage>
          .
          <article-title>speech for robust speech recognition</article-title>
          , in: ICASSP, [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>5220</fpage>
          -
          <lpage>5224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>Half-truth: A partially fake audio detection dataset</article-title>
          , [32]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          , H.-S. Heo,
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>arXiv preprint arXiv:2104.03617</source>
          (
          <year>2021</year>
          ). S. Choe,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , I. Han, In De[19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J. Yi,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Wang, fence of Metric Learning for Speaker Recognition,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Fad: A chinese dataset for in:</article-title>
          <source>Proc. Interspeech</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>2977</fpage>
          -
          <lpage>2981</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>fake audio detection (</article-title>
          <year>2022</year>
          ). arXiv:
          <volume>2207</volume>
          .
          <fpage>12308</fpage>
          . [33]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          , Adam: A method for stochastic [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , T. Wang, optimization (
          <year>2014</year>
          ). arXiv:
          <volume>1412</volume>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>