<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The defender's perspective on automatic speaker verification: An overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haibin Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiawen Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lingwei Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helen Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hung-yi Lee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Systems Engineering &amp; Engineering Management, The Chinese University of Hong Kong</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate Institute of Communication Engineering, National Taiwan University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>6</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Automatic speaker verification (ASV) plays a critical role in security-sensitive environments. Regrettably, the reliability of ASV has been undermined by the emergence of spoofing attacks, such as replay and synthetic speech, as well as adversarial attacks and the relatively new partially fake speech. While there are several review papers that cover replay and synthetic speech, and adversarial attacks, there is a notable gap in a comprehensive review that addresses defense against adversarial attacks and the recently emerged partially fake speech. Thus, the aim of this paper is to provide a thorough and systematic overview of the defense methods used against these types of attacks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The past few years have witnessed significant advances
in ASV, and this technique is now widely integrated into
daily life, including voice activation in smartphones and Smooth and insert
e-banking authentication. However, ASV is serious vul- Text
nerable to malicious spoofing attacks includes tactics ASR and edit TTS
such as replay and synthetic speech, adversarial attacks real audio fake audio
and recently emerged partially fake speech. Figure 1: The partially fake audio generation process. A</p>
      <p>
        While there are several review papers that cover re- small clip is selected from the user’s utterance, the content is
play and synthetic speech [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], and adversarial recognized using Automatic Speech Recognition (ASR), and
attacks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], there is a notable gap in a comprehensive re-
tohfethreeceongtniriezesdpeceocnht.enTthiesfmakoedcifliipedistothmenangiepnuelraatteetdhuesminegaTneinxtgview that addresses defense methods against adversarial to-Speech (TTS) or Voice Conversion (VC), and inserted into
attacks and the recently emerged partially fake speech. the genuine utterance to generate the partially fake speech.
The objective of this thesis is to provide a thorough and
systematic overview of the defense methods used against
these two types of attacks. It is hoped that they will
inspire further researches within the ASV community.
audios by directly using existing state-of-the-art
countermeasure models fostered by the ASVspoof challenge
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. These countermeasure models address the
problem of identifying whether an entire audio utterance
is genuine or fabricated. However, they are not equipped
to identify anomalous regions within a single utterance.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Attacks</title>
      <sec id="sec-2-1">
        <title>2.1. Partially fake speech</title>
        <p>
          The first Audio Deep Synthesis Detection challenge (ADD
2022) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] releases a kind of brand new attack, known as
the partially fake speech attack [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The ASVspoof
challenge [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
          ] focuses on generating spoofing speech
in its entirety, ignoring the scenario of partially fake
speech, where small fake clips are hidden within a piece
of real speech. The generation of partially fake audio
involves the insertion of only small clips of synthetic
speech into the real speech as shown in Figure 1,
resulting in even more stealthy fake speech containing a
significant amount of the genuine user’s audio.
        </p>
        <p>
          Previous studies [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ] have shown that it is
challenging to diferentiate between partially fake and genuine
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Adversarial attacks</title>
        <p>Attacker</p>
        <p>Original wave
+ Noise</p>
        <p>Automatic speaker
verification</p>
        <p>Reject</p>
        <p>
          Attacker Adversarial wave User
Figure 2: A tiny adversarial noise is added to the original
wave to get the adversarial one to fool the ASV falsely accept.
in exposing the adversarial weakness of countermeasure by the extraction-based question-answering models [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
models and [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] further enhances the transferability of used in natural language processing (NLP), we refer to the
adversarial attacks through model ensemble. boundary detection task as a question-answering or fake
3. Defense methods span discovery proxy task. In this task, the model is
required to answer the question “where is the fake clip?" in
3.1. Tackle partially fake speech attacks a piece of partially fake audio. Extraction-based
questionanswering models in NLP typically take a question and a
Boundary 1: Boundary passage as input, construct representations for the
pasdetection 0 1 0 1 0 0: Not boundary sage and the question, match the question and passage
embeddings, and then output the start and end positions
of the answer within the passage. In our case, the passage
is the partially fake utterance, and the answer is the start
and end time of the fake clip. As depicted in the blue
block of Figure 3, when the model is presented with a
Scelagsmsiefnictalteiovenl 1: Contain fake audio 0: All real audio cbloipu,nidtasrhyoufrldampreebdeicttw“e1e".nCaornevaelr(sbellayc,kw)haenndtahefamkeod(erel dis)
presented with a non-boundary frame, it should predict
Figure 3: The two categories of methods to tackle partially “0". By training the model on the question-answering
fake speech attacks. The black and red parts of the utterance proxy task, the model can learn to find the
concatenaare real and fake, respectively. The first approach, illustrated in tion boundaries with discontinuity and identify fake clips
the blue block, focuses on detecting the transition boundaries within an utterance, thus improving its ability to
distinbdeetpwiceteend itnhethgeeonruainngeeabnldocfka,keensdeegamvoernsttso. Tdhisetisnegcuoinshdbmetewtheoedn, guish between audios with and without fake clips. The
genuine and fake short segments. proposed method placed the second out of 33
international teams in the ADD challenge [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], even without the
Partially fake speech attacks are generated as shown in assistance of self-supervised learning features.
Figure 1. As this kind of attack is brand new, there have Wang et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] divide the entire utterance into
sevbeen only a few initiatives to handle this attack, and we eral chunks, and extracted acoustic features from each
categorize these eforts into two categories as shown in chunk to feed into the deep learning model. The model is
Figure 3: transition boundary detection [
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ] and then tasked with determining whether a boundary exists
segment level classification [
          <xref ref-type="bibr" rid="ref19 ref20 ref7 ref8">7, 8, 19, 20</xref>
          ]. within the given chunk by predicting “1" if the chunk
3.1.1. SSL-based feature extractor contains a boundary, or a “0" if it does not. Through
trainBefore delving into the two main approaches, let’s first ing, the model gains the ability to identify clues such as
examine the feature engineering aspect of the task. Lv speech discontinuity or inconsistencies in ambient noise,
et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] are the pioneers in utilizing self-supervised allowing it to efectively highlight potential boundaries.
learning (SSL) models to tackle partially fake speech at- Cai et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] propose to introduce the self-supervised
tacks. Rather than using traditional acoustic features, learning model for frame-level boundary detection to
they instead adopt XLS-R [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], a self-supervised learn- detect partially fake speech. They modify the method
ing model, as the feature extractor. Their method [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to further boost the detection performance: 1).
which involved simply adding a lightweight prediction Instead of solely focusing on transition boundaries that
head on top of the XLS-R model and fine-tuning the large indicate inconsistency and discontinuity, [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] proposes
XLS-R model, ultimately achieved first place out of 33 setting nearby frames of the boundaries as boundaries
international teams in the ADD challenge [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. to increase robustness. 2). [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] employs wav2vec 2.0
        </p>
        <p>
          Their eforts [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] have taught us a valuable lesson [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], a self-supervised learning model as feature extractor
- the acoustic features extracted by a fine-tuned self- and also fine-tunes the feature extractor during training.
supervised learning model can be incredibly helpful for Utilizing the features from wav2vec 2.0 improves the
detecting partially fake speech. It’s worth noting that the performance by a relative 58.25% compared to traditional
two main approaches introduced below can also harness acoustic features extracted by digital signal processing
the power of self-supervised learning models, provided front-ends.
there are suficient computing resources available. The main takeaway message from this subsection is
that the transition boundaries can serve as a useful cue to
identify partially fake audio, as it indicates discontinuity
and inconsistency in speech. By tasking models with
detecting these boundaries, they can learn to identify
these cues and detect partially fake speech.
        </p>
        <sec id="sec-2-2-1">
          <title>3.1.2. Transition boundary detection</title>
          <p>
            [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] is the first to introduce the transition boundary
detection task for partially fake audio detection. The
transition boundaries contain artifacts, such as discontinuity
in speech and inconsistencies in ambient noise. Inspired
.T
.G
.H
.X
.H
.H
.H
.H
.J
.L-C
.S
.G
.X
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>3.1.3. Segment level classification</title>
          <p>
            The goal of segment level classification is to distinguish
between genuine and fake segments. The short segments
have diferent time resolutions, ranging from 1 frame
(around 20 ms) to the entire utterance. Segments that
only contain genuine speech will be labeled as “1", while
all other segments will be labeled as “0" as shown in the
orange block of Figure 3. Zhang et al. [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] do the initial
attempt to conduct segment level classification for
partially fake speech detection with a fixed time resolution.
In their subsequent works [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ], they propose to train
the countermeasure model by both the utterance level
classification and segment level classification. To
further boost the countermeasure’s performance, they [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]
introduce the self-supervised learning models [
            <xref ref-type="bibr" rid="ref24 ref25">25, 24</xref>
            ]
as the front-end feature extractor, and enable the model
to learn segment level classification with diferent time
          </p>
          <p>The time resolution used in segment level classification
is a crucial hyperparameter for training. If the segment’s
frame number is too small, the model may not extract
enough information to distinguish between genuine and
fake segments. On the other hand, if the frame number
is too large, the proportion of fake frames may be too
efective adversarial examples. 2). Adversarial sample
purification aims to alleviate the superficial adversarial
noise and transform adversarial samples into genuine
samples. 3). Adversarial sample detection aims to
distinguish between adversarial and genuine samples, allowing
the identification and removal of adversarial samples.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>3.2.1. Model enhancement</title>
          <p>
            [
            <xref ref-type="bibr" rid="ref26 ref27 ref28">26, 27, 28</xref>
            ] adopt adversarial training to alleviate the
vulnerability of ASV against adversarial attacks. Wu
et al. [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] also investigate improving the adversarial
robustness for countermeasures by adversarial training.
          </p>
          <p>Model enhancement methods involve modifying the
model’s parameters, and they can usually work together
with purification and detection methods.</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>3.2.2. Adversarial sample purification</title>
          <p>Previous eforts for purification can be classified into 5
tive method, denoising method and filtering.</p>
          <p>
            The “Lossy pre-processing" approach treats adversarial
perturbations as redundant information and discards it to
improve the model’s adversarial robustness. Chen et al.
[
            <xref ref-type="bibr" rid="ref30">30</xref>
            ] consider adversarial perturbations as redundant
information and use lossy speech compression techniques
resolutions, ranging from 1 frame to the entire utterance. categories: Lossy pre-processing, adding noise,
generasmall, resulting in fake frames being dominated by gen- to mitigate these perturbations. Quantization [
            <xref ref-type="bibr" rid="ref30">31, 30</xref>
            ]
uine frames. Enabling the model learn from diferent
time resolutions [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] is a reasonable solution to bypass
the hyperparameter search. Note that in Figure 1, the
inserted red clip can be from other genuine users. The
segment level classification [
            <xref ref-type="bibr" rid="ref19 ref20 ref8">8, 19, 20</xref>
            ] does not consider
involves rounding each audio sample point to the nearest
integer multiple of a factor , which can impact the
fragile adversarial perturbations. Chen et al. [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ] propose to
do k-means [32] on the acoustic features to get clusters
of acoustic features, and use the clusters to represent the
this condition into account as in their produced dataset, acoustic features.
the inserted clips are always fake.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.2. Defense against adversarial attacks</title>
        <p>We propose to classify the defense methods into three
categories and the timeline for related works is shown in</p>
        <p>
          The “adding noise" approach aims to disrupt and
neutralize adversarial perturbations, by introducing
additional noise, typically Gaussian. Randomized smoothing
[
          <xref ref-type="bibr" rid="ref30">31, 33, 30, 34</xref>
          ] involves adding random Gaussian noise
to the input utterances before sending them to the ASV
to counter the adversarial perturbations. [35] adopts to
the idea of “voting for the right answer" to prevent risky
decisions of ASV in blind spot areas. To achieve this, they
021
.R
02
.Z
resaildAv
023
samples the neighbors of a given utterance by random data. Li et al. [41] propose to train a detector using the
sampling using Gaussian noise, and allow the neighbors binary classification loss to distinguish the adversarial
to vote on whether the utterance should be accepted by and genuine samples. They find their detector is unable
the ASV model or not, rather than relying solely on the to detect unseen adversarial samples derived by other
prediction of the single utterance. Olivier et al. [36] is adversarial attack algorithms that are not used during
an enhanced version by adding Gaussian noise to the training. Based on that diferent kinds of adversarial
samhigh-frequency region rather than the entire utterance. ples attain diferent attack signatures, Villalba et al. [ 42]
        </p>
        <p>
          The “Denosing method" treats adversarial noise as a propose to train an x-vector [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] system to extract the
specific kind of noise and aims to estimate and eliminate bottleneck features as the attack signatures using various
it. Chang et al. [33] suggest using a denoising algo- types of adversarial samples. After training the x-vector
rithm tailored for Gaussian noise and they contend that system, attack signatures will be extracted for diferent
the denoising algorithm can also cleanse the adversarial types of attacks. During inference, the testing utterance
noise. Zhang et al. [37] propose to employ an adver- is inputted, and the x-vector feature extractor will extract
sarial separation network, which is trained using the the hidden embeddings. These embeddings are then
comadversarial-genuine data pairs, to estimate and purify the pared with the enrolled attack signatures to determine
adversarial noise. This method requires prior knowledge whether the testing utterance is an adversarial sample
of adversarial sample generation. or not. To further improve the performance of the
at
        </p>
        <p>
          The “generative method" approach typically involve tack signature extractor, Joshi et al. [43] propose training
training a generative model to model the genuine data the attack signature extractor using adversarial
perturmanifolds and using this model to pull the adversarial bations instead of adversarial examples. They argue that
samples towards the genuine data manifolds. Wu et al. the adversarial perturbations eliminate redundant
infor[38] propose the SSLM-based reconstruction to allevi- mation from the adversarial samples. They then train an
ate the superficial adversarial noise and maintain key adversarial perturbation estimator to extract adversarial
information for genuine samples. They [38] utilize the perturbations from the input utterance and use the attack
self-supervised learning models to extract key features signature extractor to extract hidden features to detect
from the adversarial samples, and do reconstruction to the adversarial samples.
pull the inputs to the genuine data manifold. Joshi et Attack-independent methods treat the detection of
adal. [34] use the encoder of a VAE [39] to project testing versarial samples as an anomaly detection problem.
Gendata onto a latent posterior that aligns with the genuine uine data samples always exhibit some properties that
manifold. They then use the decoder to re-generate the are absent or diferent for adversarial samples.
Thereinput data based on the hidden embedding sampled by fore, attack-independent detection methods can exploit
the latent posterior, thereby purifying superficial adver- the inconsistency of these internal properties to
distinsarial noise. Joshi et al. [34] borrow the DefenseGAN guish between adversarial and genuine samples. Wu et al.
from computer vision [40]. The DefenseGAN projects [38] leverage the ASV score diference before and after
the testing data, either adversarial or genuine, into the putting the testing utterance into SSLMs as an indicator
low-dimensional manifold of genuine data to get the hid- to diferentiate between adversarial and genuine samples.
den embeddings and then re-generate the testing data by Specifically, for genuine samples, the ASV score
diferthe generator using such embeddings. ence before and after putting the utterance into SSLMs
“Filtering", also known as local smoothing, helps is small, while for adversarial samples, the diference
smooth and alleviate the superficial adversarial pertur- is large. Peng et al. [44] propose to detect adversarial
bations. Local smoothing involves applying Gaussian, samples using twin ASV models, including one premier
mean, and median filters to the waveform to purify the model that is exposed to attackers and is fragile under
adversarial noise. [
          <xref ref-type="bibr" rid="ref30">31, 30</xref>
          ] and [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] utilize local smooth- adversarial attack, and one mirror model that is robust to
ing to defend ASV and countermeasures, respectively. adversarial attacks and cannot be accessed by attackers.
When a genuine sample is inputted, both the premier
3.2.3. Adversarial sample detection and mirror models produce similar predictions.
However, when an adversarial sample is inputted, the models
The detection methods can be classified into two cate- produce diferent predictions. Peng et al. [ 44] leverage
gories based on whether they require prior knowledge the score inconsistency between genuine and adversarial
about adversarial sample generation: attack-dependent samples to detect adversarial samples. Wu et al. [45]
utior attack-independent detection methods. lize the vocoders to re-synthesize the input utterance and
        </p>
        <p>The attack-dependent methods usually leverage the ifnd that the diference between the ASV scores for the
deep learning models to implicitly find cues to difer- original and re-synthesized utterance is a good indicator
entiate between specific kinds of adversarial samples for discrimination between genuine and adversarial
samand genuine samples using both adversarial and genuine ples. To be specific, the score diference for adversarial
samples is large, while it is small for genuine samples. synthesized utterances for adversarial samples should
Chen et al. [46] utilize two kinds of hand-crafted masks be substantial. Investigating approaches for refining the
to detect adversarial samples: they mask parts of the design of audio re-synthesis methods to further optimize
input speech features. They claim the masked parts con- these properties represents a valuable research
directain less speaker information and won’t afect the ASV tion. By enhancing the eficacy of the audio re-synthesis
scores for genuine samples two much, but will greatly im- method, it would be possible to improve the reliability
pact the adversarial samples. By comparing the absolute and accuracy of detection systems.
diference of scores before and after masking, they are
able to detect adversarial examples. The two masks used 5. Conclusion
are MLFB-H, which masks the high frequencies of LogF- This paper reviews the defense methods against
adverBank, and MLFB-D, which masks the time-frequency bins sarial attacks and partially fake speech attacks that have
whose absolute values of their one-order diference along recently emerged. We hope the comprehensive review
the frequency axis are smaller than a threshold. Chen et and comparisons can inspire future works to boost the
al. [47] further enhance the detection performance by robustness of ASV. Further investigation is needed to
learning such mask matrix by a deep recurrent networks, explore future directions as in Section 4
rather than using hand-crafted masks.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Future directions</title>
      <p>
        For the future directions of partially fake speech attacks:
1). Data collection. The collection of data is a crucial
component in developing an efective defense system
against partially fabricated speech. Only 100k utterances
are collected by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for partially fake detection and the
transition boundaries are not stealthy enough. To this
end, there exists a pressing need to investigate the
generation of more data with discreet transition boundaries,
while carefully considering the linguistic and acoustic
characteristics involved. This undertaking is of great
significance and warrants further exploration. 2). Reduce
training eforts. The state-of-the-art (SOTA)
methodology for partially fake speech detection involves the
ifne-tuning of the entire SSLMs. The SSLM in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is
with 2 billion parameters, which presents a challenge for
academic researchers when attempting to fine-tune the
model. Several works have emerged that ofer promising
avenues for minimizing training eforts while maximizing
the benefits of SSLMs, including linear probing, adapter,
and prompt techniques. Exploring these approaches may
significantly enhance the eficiency of adopting SSLMs
for partially fake speech detection. 3). Model
compression. The current state-of-the-art detection method relies
heavily on large-scale SSLMs. The parameter number of
the SSLM used in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is 2 billion parameters. Therefore,
investigating approaches to reduce the model size is a
crucial research endeavor. This issue warrants
considerable attention as it has significant implications for the
scalability, computational eficiency, and generalizability
of partially fake speech detection systems.
      </p>
      <p>The re-synthesis-based adversarial sample detection
methods achieves the SOTA [45, 46, 47]. An efective
audio re-synthesis method for adversarial sample
detection must possess two critical properties. Firstly, the
score variations between the original and re-synthesized
utterances should be minimal for genuine samples.
Secondly, the score variations between the original and
re</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          , et al.,
          <year>Asvspoof 2021</year>
          :
          <article-title>accelerating progress in spoofed and deepfake speech detection</article-title>
          ,
          <source>arXiv preprint arXiv:2109.00537</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          , et al.,
          <year>Asvspoof 2019</year>
          :
          <article-title>Future horizons in spoofed and fake audio detection</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>05441</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          , et al.,
          <article-title>The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <year>Asvspoof 2015</year>
          :
          <article-title>the first automatic speaker verification spoofing and countermeasures challenge</article-title>
          ,
          <source>in: Sixteenth Annual Conference of the ISCA</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial attack and defense strategies of speaker recognition systems: A survey</article-title>
          ,
          <source>Electronics</source>
          <volume>11</volume>
          (
          <year>2022</year>
          )
          <fpage>2183</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          , et al.,
          <year>Add 2022</year>
          :
          <article-title>the first audio deep synthesis detection challenge</article-title>
          ,
          <source>in: IEEE ICASSP</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          , et al.,
          <article-title>Half-truth: A partially fake audio detection dataset</article-title>
          ,
          <source>in: Interspeech</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1654</fpage>
          -
          <lpage>1658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>An initial investigation for detecting partially spoofed audio</article-title>
          ,
          <source>arXiv preprint arXiv:2104.02518</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          , et al.,
          <article-title>Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems</article-title>
          , in: IEEE SP,
          <year>2021</year>
          , pp.
          <fpage>730</fpage>
          -
          <lpage>747</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>R. K. Das</surname>
          </string-name>
          , et al.,
          <article-title>The attacker's perspective on automatic speaker verification: An overview</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>08849</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kreuk</surname>
          </string-name>
          , et al.,
          <article-title>Fooling end-to-end speaker verification with adversarial examples</article-title>
          ,
          <source>in: IEEE ICASSP</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1962</fpage>
          -
          <lpage>1966</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial attacks on gmm i-vector based speaker verification systems</article-title>
          , in: IEEE ICASSP,
          <year>2020</year>
          , pp.
          <fpage>6579</fpage>
          -
          <lpage>6583</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalba</surname>
          </string-name>
          , et al.,
          <article-title>x-vectors meet adversarial attacks: ing audio adversarial examples for speaker recogBenchmarking adversarial robustness in speaker nition</article-title>
          ,
          <source>IEEE TDSC</source>
          (
          <year>2022</year>
          ). verification, ISCA Interspeech (
          <year>2020</year>
          )
          <fpage>4233</fpage>
          -
          <lpage>4237</lpage>
          . [31]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Who is real bob? adversarial attacks</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial attacks on spoofing coun- on speaker recognition systems, arXiv preprint termeasures of automatic speaker verification</article-title>
          , in: arXiv:
          <year>1911</year>
          .
          <year>01840</year>
          (
          <year>2019</year>
          ).
          <source>IEEE ASRU</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>319</lpage>
          . [32]
          <string-name>
            <surname>J. A. Hartigan M. A. Wong</surname>
          </string-name>
          , Algorithm as 136:
          <article-title>A k-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Black-box attacks on spoofing coun- means clustering algorithm, Journal of the royal statermeasures using transferability of adversarial ex- tistical society</article-title>
          . series c (applied statistics)
          <volume>28</volume>
          (
          <year>1979</year>
          )
          <article-title>amples</article-title>
          .,
          <source>in: ISCA Interspeech</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>4238</fpage>
          -
          <lpage>4242</lpage>
          .
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <source>Partially fake audio detection by</source>
          [33]
          <string-name>
            <given-names>L.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          , et al.,
          <article-title>Defending against adversarial self-attention-based fake span discovery, in: IEEE attacks in speaker verification systems</article-title>
          , in: IEEE ICASSP, IEEE,
          <year>2022</year>
          , pp.
          <fpage>9236</fpage>
          -
          <lpage>9240</lpage>
          . IPCCC,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          , et al.,
          <source>Waveform boundary detection</source>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial attacks and defenses for partially spoofed audio, arXiv preprint for speaker identification systems</article-title>
          ,
          <source>arXiv preprint arXiv:2211.00226</source>
          (
          <year>2022</year>
          ). arXiv:
          <volume>2101</volume>
          .08909 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <source>Synthetic voice detection and audio</source>
          [35]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Voting for the right answer: Adversarsplicing detection using se-res2net-conformer ar- ial defense for speaker verification, arXiv preprint chitecture</article-title>
          ,
          <source>arXiv preprint arXiv:2210.03581</source>
          (
          <year>2022</year>
          ). arXiv:
          <volume>2106</volume>
          .07868 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Multi-task learning in utterance-</article-title>
          [36]
          <string-name>
            <given-names>R.</given-names>
            <surname>Olivier</surname>
          </string-name>
          , et al.,
          <article-title>High-frequency adversarial delevel and segmental-level spoof detection, arXiv fense for speech and audio</article-title>
          , in: IEEE ICASSP,
          <year>2021</year>
          . preprint arXiv:
          <volume>2107</volume>
          .14132 (
          <year>2021</year>
          ). [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <source>Adversarial separation network</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>The partialspoof database and coun- for speaker recognition</article-title>
          .,
          <source>in: Interspeech</source>
          ,
          <year>2020</year>
          .
          <article-title>termeasures for the detection of short fake speech</article-title>
          [38]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Improving the adversarial robustness segments embedded in an utterance, IEEE/ACM for speaker verification by self-supervised learnTransactions on Audio, Speech, and Language Pro- ing</article-title>
          , IEEE/ACM Transactions on Audio, Speech, cessing (
          <year>2022</year>
          ).
          <source>and Language Processing</source>
          <volume>30</volume>
          (
          <year>2021</year>
          )
          <fpage>202</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lv</surname>
          </string-name>
          , et al.,
          <source>Fake audio detection based on unsuper-</source>
          [39]
          <string-name>
            <surname>D. P. Kingma M. Welling</surname>
          </string-name>
          ,
          <article-title>Auto-encoding variational vised pretraining models</article-title>
          , in: IEEE ICASSP, IEEE, bayes,
          <source>arXiv preprint arXiv:1312.6114</source>
          (
          <year>2013</year>
          ).
          <year>2022</year>
          , pp.
          <fpage>9231</fpage>
          -
          <lpage>9235</lpage>
          . [40]
          <string-name>
            <given-names>P.</given-names>
            <surname>Samangouei</surname>
          </string-name>
          , et al.,
          <article-title>Defense-gan: Protecting clas-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Babu</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Xls-</surname>
          </string-name>
          r:
          <article-title>Self-supervised cross-lingual sifiers against adversarial attacks using generative speech representation learning at scale, arXiv models</article-title>
          , arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>06605</volume>
          (
          <year>2018</year>
          ). preprint arXiv:
          <volume>2111</volume>
          .09296 (
          <year>2021</year>
          ). [41]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <source>Investigating robustness of adversarial</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>A. M. N. Allam M. H. Haggag</surname>
          </string-name>
          ,
          <article-title>The question an- samples detection for automatic speaker verificaswering systems: A survey</article-title>
          ,
          <source>IJRRIS</source>
          <volume>2</volume>
          (
          <year>2012</year>
          ). tion, arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>06186</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          , et al.,
          <year>wav2vec</year>
          2.
          <article-title>0: A framework for</article-title>
          [42]
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalba</surname>
          </string-name>
          , et al.,
          <article-title>Representation learning to clasself-supervised learning of speech representations, sify and detect adversarial attacks against speaker Advances in neural information processing systems and speech recognition systems</article-title>
          ,
          <source>arXiv preprint 33</source>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          . arXiv:
          <volume>2107</volume>
          .04448 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Wavlm:</surname>
            Large-scale self-supervised [43]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Joshi</surname>
          </string-name>
          , et al.,
          <article-title>Advest: Adversarial perturbation pre-training for full stack speech processing, IEEE estimation to classify and detect adversarial atJournal of Selected Topics in Signal Processing 16 tacks against speaker identification, arXiv preprint (</article-title>
          <year>2022</year>
          )
          <fpage>1505</fpage>
          -
          <lpage>1518</lpage>
          . arXiv:
          <volume>2204</volume>
          .03848 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jati</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial attack</article-title>
          and defense strate- [44]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Peng</surname>
          </string-name>
          , et al.,
          <article-title>Pairing weak with strong: Twin gies for deep speaker recognition systems, Com- models for defending against adversarial attack on puter</article-title>
          <source>Speech &amp; Language</source>
          <volume>68</volume>
          (
          <year>2021</year>
          )
          <article-title>101199</article-title>
          . speaker verification., in: Interspeech,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Du</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Sirenattack</surname>
            : Generating adversarial [45]
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial sample detection for audio for end-to-end acoustic systems, in: ACM speaker verification by neural vocoders</article-title>
          , in: IEEE ASIACCS,
          <year>2020</year>
          . ICASSP,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Adversarial regularization for end-</article-title>
          [46]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Masking speech feature to detect to-end robust speaker verification</article-title>
          ., in: Interspeech,
          <article-title>adversarial examples for speaker verification</article-title>
          ,
          <source>in: 2019. IEEE APSIPA ASC</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <source>Defense against adversarial attacks</source>
          [47]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lmd: A learnable mask network to on spoofing countermeasures of asv, arXiv preprint detect adversarial examples for speaker verification</article-title>
          , arXiv:
          <year>2003</year>
          .
          <volume>03065</volume>
          (
          <year>2020</year>
          ).
          <source>arXiv preprint arXiv:2211.00825</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Towards understanding</article-title>
          and mitigat-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>