1. Introduction

The defender's perspective on automatic speaker verification: An overview

Haibin Wu

Jiawen Kang

Lingwei Meng

Helen Meng

Hung-yi Lee

1 0 Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong 1 Graduate Institute of Communication Engineering, National Taiwan University

2023

6 11

Automatic speaker verification (ASV) plays a critical role in security-sensitive environments. Regrettably, the reliability of ASV has been undermined by the emergence of spoofing attacks, such as replay and synthetic speech, as well as adversarial attacks and the relatively new partially fake speech. While there are several review papers that cover replay and synthetic speech, and adversarial attacks, there is a notable gap in a comprehensive review that addresses defense against adversarial attacks and the recently emerged partially fake speech. Thus, the aim of this paper is to provide a thorough and systematic overview of the defense methods used against these types of attacks.

1. Introduction

The past few years have witnessed significant advances in ASV, and this technique is now widely integrated into daily life, including voice activation in smartphones and Smooth and insert e-banking authentication. However, ASV is serious vul- Text nerable to malicious spoofing attacks includes tactics ASR and edit TTS such as replay and synthetic speech, adversarial attacks real audio fake audio and recently emerged partially fake speech. Figure 1: The partially fake audio generation process. A

While there are several review papers that cover re- small clip is selected from the user’s utterance, the content is play and synthetic speech [ 1, 2, 3, 4 ], and adversarial recognized using Automatic Speech Recognition (ASR), and attacks [ 5 ], there is a notable gap in a comprehensive re- tohfethreeceongtniriezesdpeceocnht.enTthiesfmakoedcifliipedistothmenangiepnuelraatteetdhuesminegaTneinxtgview that addresses defense methods against adversarial to-Speech (TTS) or Voice Conversion (VC), and inserted into attacks and the recently emerged partially fake speech. the genuine utterance to generate the partially fake speech. The objective of this thesis is to provide a thorough and systematic overview of the defense methods used against these two types of attacks. It is hoped that they will inspire further researches within the ASV community. audios by directly using existing state-of-the-art countermeasure models fostered by the ASVspoof challenge [ 1, 2, 3, 4 ]. These countermeasure models address the problem of identifying whether an entire audio utterance is genuine or fabricated. However, they are not equipped to identify anomalous regions within a single utterance.

2. Attacks 2.1. Partially fake speech

The first Audio Deep Synthesis Detection challenge (ADD 2022) [ 6 ] releases a kind of brand new attack, known as the partially fake speech attack [ 7 ]. The ASVspoof challenge [ 1, 2, 3, 4 ] focuses on generating spoofing speech in its entirety, ignoring the scenario of partially fake speech, where small fake clips are hidden within a piece of real speech. The generation of partially fake audio involves the insertion of only small clips of synthetic speech into the real speech as shown in Figure 1, resulting in even more stealthy fake speech containing a significant amount of the genuine user’s audio.

Previous studies [ 7, 8 ] have shown that it is challenging to diferentiate between partially fake and genuine

2.2. Adversarial attacks

Attacker

Original wave + Noise

Automatic speaker verification

Reject

Attacker Adversarial wave User Figure 2: A tiny adversarial noise is added to the original wave to get the adversarial one to fool the ASV falsely accept. in exposing the adversarial weakness of countermeasure by the extraction-based question-answering models [ 23 ] models and [ 15 ] further enhances the transferability of used in natural language processing (NLP), we refer to the adversarial attacks through model ensemble. boundary detection task as a question-answering or fake 3. Defense methods span discovery proxy task. In this task, the model is required to answer the question “where is the fake clip?" in 3.1. Tackle partially fake speech attacks a piece of partially fake audio. Extraction-based questionanswering models in NLP typically take a question and a Boundary 1: Boundary passage as input, construct representations for the pasdetection 0 1 0 1 0 0: Not boundary sage and the question, match the question and passage embeddings, and then output the start and end positions of the answer within the passage. In our case, the passage is the partially fake utterance, and the answer is the start and end time of the fake clip. As depicted in the blue block of Figure 3, when the model is presented with a Scelagsmsiefnictalteiovenl 1: Contain fake audio 0: All real audio cbloipu,nidtasrhyoufrldampreebdeicttw“e1e".nCaornevaelr(sbellayc,kw)haenndtahefamkeod(erel dis) presented with a non-boundary frame, it should predict Figure 3: The two categories of methods to tackle partially “0". By training the model on the question-answering fake speech attacks. The black and red parts of the utterance proxy task, the model can learn to find the concatenaare real and fake, respectively. The first approach, illustrated in tion boundaries with discontinuity and identify fake clips the blue block, focuses on detecting the transition boundaries within an utterance, thus improving its ability to distinbdeetpwiceteend itnhethgeeonruainngeeabnldocfka,keensdeegamvoernsttso. Tdhisetisnegcuoinshdbmetewtheoedn, guish between audios with and without fake clips. The genuine and fake short segments. proposed method placed the second out of 33 international teams in the ADD challenge [ 6 ], even without the Partially fake speech attacks are generated as shown in assistance of self-supervised learning features. Figure 1. As this kind of attack is brand new, there have Wang et al. [ 18 ] divide the entire utterance into sevbeen only a few initiatives to handle this attack, and we eral chunks, and extracted acoustic features from each categorize these eforts into two categories as shown in chunk to feed into the deep learning model. The model is Figure 3: transition boundary detection [ 16, 17, 18 ] and then tasked with determining whether a boundary exists segment level classification [ 7, 8, 19, 20 ]. within the given chunk by predicting “1" if the chunk 3.1.1. SSL-based feature extractor contains a boundary, or a “0" if it does not. Through trainBefore delving into the two main approaches, let’s first ing, the model gains the ability to identify clues such as examine the feature engineering aspect of the task. Lv speech discontinuity or inconsistencies in ambient noise, et al. [ 21 ] are the pioneers in utilizing self-supervised allowing it to efectively highlight potential boundaries. learning (SSL) models to tackle partially fake speech at- Cai et al. [ 17 ] propose to introduce the self-supervised tacks. Rather than using traditional acoustic features, learning model for frame-level boundary detection to they instead adopt XLS-R [ 22 ], a self-supervised learn- detect partially fake speech. They modify the method ing model, as the feature extractor. Their method [ 21 ], in [ 16 ] to further boost the detection performance: 1). which involved simply adding a lightweight prediction Instead of solely focusing on transition boundaries that head on top of the XLS-R model and fine-tuning the large indicate inconsistency and discontinuity, [ 17 ] proposes XLS-R model, ultimately achieved first place out of 33 setting nearby frames of the boundaries as boundaries international teams in the ADD challenge [ 6 ]. to increase robustness. 2). [ 17 ] employs wav2vec 2.0

Their eforts [ 21 ] have taught us a valuable lesson [ 24 ], a self-supervised learning model as feature extractor - the acoustic features extracted by a fine-tuned self- and also fine-tunes the feature extractor during training. supervised learning model can be incredibly helpful for Utilizing the features from wav2vec 2.0 improves the detecting partially fake speech. It’s worth noting that the performance by a relative 58.25% compared to traditional two main approaches introduced below can also harness acoustic features extracted by digital signal processing the power of self-supervised learning models, provided front-ends. there are suficient computing resources available. The main takeaway message from this subsection is that the transition boundaries can serve as a useful cue to identify partially fake audio, as it indicates discontinuity and inconsistency in speech. By tasking models with detecting these boundaries, they can learn to identify these cues and detect partially fake speech.

3.1.2. Transition boundary detection

[ 16 ] is the first to introduce the transition boundary detection task for partially fake audio detection. The transition boundaries contain artifacts, such as discontinuity in speech and inconsistencies in ambient noise. Inspired .T .G .H .X .H .H .H .H .J .L-C .S .G .X

3.1.3. Segment level classification

The goal of segment level classification is to distinguish between genuine and fake segments. The short segments have diferent time resolutions, ranging from 1 frame (around 20 ms) to the entire utterance. Segments that only contain genuine speech will be labeled as “1", while all other segments will be labeled as “0" as shown in the orange block of Figure 3. Zhang et al. [ 8 ] do the initial attempt to conduct segment level classification for partially fake speech detection with a fixed time resolution. In their subsequent works [ 19 ], they propose to train the countermeasure model by both the utterance level classification and segment level classification. To further boost the countermeasure’s performance, they [ 20 ] introduce the self-supervised learning models [ 25, 24 ] as the front-end feature extractor, and enable the model to learn segment level classification with diferent time

The time resolution used in segment level classification is a crucial hyperparameter for training. If the segment’s frame number is too small, the model may not extract enough information to distinguish between genuine and fake segments. On the other hand, if the frame number is too large, the proportion of fake frames may be too efective adversarial examples. 2). Adversarial sample purification aims to alleviate the superficial adversarial noise and transform adversarial samples into genuine samples. 3). Adversarial sample detection aims to distinguish between adversarial and genuine samples, allowing the identification and removal of adversarial samples.

3.2.1. Model enhancement

[ 26, 27, 28 ] adopt adversarial training to alleviate the vulnerability of ASV against adversarial attacks. Wu et al. [ 29 ] also investigate improving the adversarial robustness for countermeasures by adversarial training.

Model enhancement methods involve modifying the model’s parameters, and they can usually work together with purification and detection methods.

3.2.2. Adversarial sample purification

Previous eforts for purification can be classified into 5 tive method, denoising method and filtering.

The “Lossy pre-processing" approach treats adversarial perturbations as redundant information and discards it to improve the model’s adversarial robustness. Chen et al. [ 30 ] consider adversarial perturbations as redundant information and use lossy speech compression techniques resolutions, ranging from 1 frame to the entire utterance. categories: Lossy pre-processing, adding noise, generasmall, resulting in fake frames being dominated by gen- to mitigate these perturbations. Quantization [ 31, 30 ] uine frames. Enabling the model learn from diferent time resolutions [ 20 ] is a reasonable solution to bypass the hyperparameter search. Note that in Figure 1, the inserted red clip can be from other genuine users. The segment level classification [ 8, 19, 20 ] does not consider involves rounding each audio sample point to the nearest integer multiple of a factor , which can impact the fragile adversarial perturbations. Chen et al. [ 30 ] propose to do k-means [32] on the acoustic features to get clusters of acoustic features, and use the clusters to represent the this condition into account as in their produced dataset, acoustic features. the inserted clips are always fake.

3.2. Defense against adversarial attacks

We propose to classify the defense methods into three categories and the timeline for related works is shown in

The “adding noise" approach aims to disrupt and neutralize adversarial perturbations, by introducing additional noise, typically Gaussian. Randomized smoothing [ 31, 33, 30, 34 ] involves adding random Gaussian noise to the input utterances before sending them to the ASV to counter the adversarial perturbations. [35] adopts to the idea of “voting for the right answer" to prevent risky decisions of ASV in blind spot areas. To achieve this, they 021 .R 02 .Z resaildAv 023 samples the neighbors of a given utterance by random data. Li et al. [41] propose to train a detector using the sampling using Gaussian noise, and allow the neighbors binary classification loss to distinguish the adversarial to vote on whether the utterance should be accepted by and genuine samples. They find their detector is unable the ASV model or not, rather than relying solely on the to detect unseen adversarial samples derived by other prediction of the single utterance. Olivier et al. [36] is adversarial attack algorithms that are not used during an enhanced version by adding Gaussian noise to the training. Based on that diferent kinds of adversarial samhigh-frequency region rather than the entire utterance. ples attain diferent attack signatures, Villalba et al. [ 42]

The “Denosing method" treats adversarial noise as a propose to train an x-vector [ 13 ] system to extract the specific kind of noise and aims to estimate and eliminate bottleneck features as the attack signatures using various it. Chang et al. [33] suggest using a denoising algo- types of adversarial samples. After training the x-vector rithm tailored for Gaussian noise and they contend that system, attack signatures will be extracted for diferent the denoising algorithm can also cleanse the adversarial types of attacks. During inference, the testing utterance noise. Zhang et al. [37] propose to employ an adver- is inputted, and the x-vector feature extractor will extract sarial separation network, which is trained using the the hidden embeddings. These embeddings are then comadversarial-genuine data pairs, to estimate and purify the pared with the enrolled attack signatures to determine adversarial noise. This method requires prior knowledge whether the testing utterance is an adversarial sample of adversarial sample generation. or not. To further improve the performance of the at

The “generative method" approach typically involve tack signature extractor, Joshi et al. [43] propose training training a generative model to model the genuine data the attack signature extractor using adversarial perturmanifolds and using this model to pull the adversarial bations instead of adversarial examples. They argue that samples towards the genuine data manifolds. Wu et al. the adversarial perturbations eliminate redundant infor[38] propose the SSLM-based reconstruction to allevi- mation from the adversarial samples. They then train an ate the superficial adversarial noise and maintain key adversarial perturbation estimator to extract adversarial information for genuine samples. They [38] utilize the perturbations from the input utterance and use the attack self-supervised learning models to extract key features signature extractor to extract hidden features to detect from the adversarial samples, and do reconstruction to the adversarial samples. pull the inputs to the genuine data manifold. Joshi et Attack-independent methods treat the detection of adal. [34] use the encoder of a VAE [39] to project testing versarial samples as an anomaly detection problem. Gendata onto a latent posterior that aligns with the genuine uine data samples always exhibit some properties that manifold. They then use the decoder to re-generate the are absent or diferent for adversarial samples. Thereinput data based on the hidden embedding sampled by fore, attack-independent detection methods can exploit the latent posterior, thereby purifying superficial adver- the inconsistency of these internal properties to distinsarial noise. Joshi et al. [34] borrow the DefenseGAN guish between adversarial and genuine samples. Wu et al. from computer vision [40]. The DefenseGAN projects [38] leverage the ASV score diference before and after the testing data, either adversarial or genuine, into the putting the testing utterance into SSLMs as an indicator low-dimensional manifold of genuine data to get the hid- to diferentiate between adversarial and genuine samples. den embeddings and then re-generate the testing data by Specifically, for genuine samples, the ASV score diferthe generator using such embeddings. ence before and after putting the utterance into SSLMs “Filtering", also known as local smoothing, helps is small, while for adversarial samples, the diference smooth and alleviate the superficial adversarial pertur- is large. Peng et al. [44] propose to detect adversarial bations. Local smoothing involves applying Gaussian, samples using twin ASV models, including one premier mean, and median filters to the waveform to purify the model that is exposed to attackers and is fragile under adversarial noise. [ 31, 30 ] and [ 29 ] utilize local smooth- adversarial attack, and one mirror model that is robust to ing to defend ASV and countermeasures, respectively. adversarial attacks and cannot be accessed by attackers. When a genuine sample is inputted, both the premier 3.2.3. Adversarial sample detection and mirror models produce similar predictions. However, when an adversarial sample is inputted, the models The detection methods can be classified into two cate- produce diferent predictions. Peng et al. [ 44] leverage gories based on whether they require prior knowledge the score inconsistency between genuine and adversarial about adversarial sample generation: attack-dependent samples to detect adversarial samples. Wu et al. [45] utior attack-independent detection methods. lize the vocoders to re-synthesize the input utterance and

The attack-dependent methods usually leverage the ifnd that the diference between the ASV scores for the deep learning models to implicitly find cues to difer- original and re-synthesized utterance is a good indicator entiate between specific kinds of adversarial samples for discrimination between genuine and adversarial samand genuine samples using both adversarial and genuine ples. To be specific, the score diference for adversarial samples is large, while it is small for genuine samples. synthesized utterances for adversarial samples should Chen et al. [46] utilize two kinds of hand-crafted masks be substantial. Investigating approaches for refining the to detect adversarial samples: they mask parts of the design of audio re-synthesis methods to further optimize input speech features. They claim the masked parts con- these properties represents a valuable research directain less speaker information and won’t afect the ASV tion. By enhancing the eficacy of the audio re-synthesis scores for genuine samples two much, but will greatly im- method, it would be possible to improve the reliability pact the adversarial samples. By comparing the absolute and accuracy of detection systems. diference of scores before and after masking, they are able to detect adversarial examples. The two masks used 5. Conclusion are MLFB-H, which masks the high frequencies of LogF- This paper reviews the defense methods against adverBank, and MLFB-D, which masks the time-frequency bins sarial attacks and partially fake speech attacks that have whose absolute values of their one-order diference along recently emerged. We hope the comprehensive review the frequency axis are smaller than a threshold. Chen et and comparisons can inspire future works to boost the al. [47] further enhance the detection performance by robustness of ASV. Further investigation is needed to learning such mask matrix by a deep recurrent networks, explore future directions as in Section 4 rather than using hand-crafted masks.

4. Future directions

For the future directions of partially fake speech attacks: 1). Data collection. The collection of data is a crucial component in developing an efective defense system against partially fabricated speech. Only 100k utterances are collected by [ 6 ] for partially fake detection and the transition boundaries are not stealthy enough. To this end, there exists a pressing need to investigate the generation of more data with discreet transition boundaries, while carefully considering the linguistic and acoustic characteristics involved. This undertaking is of great significance and warrants further exploration. 2). Reduce training eforts. The state-of-the-art (SOTA) methodology for partially fake speech detection involves the ifne-tuning of the entire SSLMs. The SSLM in [ 21 ] is with 2 billion parameters, which presents a challenge for academic researchers when attempting to fine-tune the model. Several works have emerged that ofer promising avenues for minimizing training eforts while maximizing the benefits of SSLMs, including linear probing, adapter, and prompt techniques. Exploring these approaches may significantly enhance the eficiency of adopting SSLMs for partially fake speech detection. 3). Model compression. The current state-of-the-art detection method relies heavily on large-scale SSLMs. The parameter number of the SSLM used in [ 21 ] is 2 billion parameters. Therefore, investigating approaches to reduce the model size is a crucial research endeavor. This issue warrants considerable attention as it has significant implications for the scalability, computational eficiency, and generalizability of partially fake speech detection systems.

The re-synthesis-based adversarial sample detection methods achieves the SOTA [45, 46, 47]. An efective audio re-synthesis method for adversarial sample detection must possess two critical properties. Firstly, the score variations between the original and re-synthesized utterances should be minimal for genuine samples. Secondly, the score variations between the original and re

[1]

Yamagishi , et al., Asvspoof 2021 : accelerating progress in spoofed and deepfake speech detection , arXiv preprint arXiv:2109.00537 ( 2021 ).

[2]

Todisco , et al., Asvspoof 2019 : Future horizons in spoofed and fake audio detection , arXiv preprint arXiv: 1904 . 05441 ( 2019 ).

[3]

Kinnunen , et al., The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection ( 2017 ).

[4]

Wu , et al., Asvspoof 2015 : the first automatic speaker verification spoofing and countermeasures challenge , in: Sixteenth Annual Conference of the ISCA , 2015 .

[5]

Tan , et al., Adversarial attack and defense strategies of speaker recognition systems: A survey , Electronics 11 ( 2022 ) 2183 .

[6]

Yi , et al., Add 2022 : the first audio deep synthesis detection challenge , in: IEEE ICASSP , 2022 .

[7]

Yi , et al., Half-truth: A partially fake audio detection dataset , in: Interspeech , 2021 , pp. 1654 - 1658 .

[8]

Zhang , et al., An initial investigation for detecting partially spoofed audio , arXiv preprint arXiv:2104.02518 ( 2021 ).

[9]

Abdullah , et al., Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems , in: IEEE SP, 2021 , pp. 730 - 747 .

[10] R. K. Das , et al., The attacker's perspective on automatic speaker verification: An overview , arXiv preprint arXiv: 2004 . 08849 ( 2020 ).

[11]

Kreuk , et al., Fooling end-to-end speaker verification with adversarial examples , in: IEEE ICASSP , 2018 , pp. 1962 - 1966 .

[12]

Li , et al., Adversarial attacks on gmm i-vector based speaker verification systems , in: IEEE ICASSP, 2020 , pp. 6579 - 6583 .

[13]

Villalba , et al., x-vectors meet adversarial attacks: ing audio adversarial examples for speaker recogBenchmarking adversarial robustness in speaker nition , IEEE TDSC ( 2022 ). verification, ISCA Interspeech ( 2020 ) 4233 - 4237 . [31]

Chen , et al., Who is real bob? adversarial attacks

[14]

Liu , et al., Adversarial attacks on spoofing coun- on speaker recognition systems, arXiv preprint termeasures of automatic speaker verification , in: arXiv: 1911 . 01840 ( 2019 ). IEEE ASRU , 2019 , pp. 312 - 319 . [32] J. A. Hartigan M. A. Wong , Algorithm as 136: A k-

[15]

Zhang , et al., Black-box attacks on spoofing coun- means clustering algorithm, Journal of the royal statermeasures using transferability of adversarial ex- tistical society . series c (applied statistics) 28 ( 1979 ) amples ., in: ISCA Interspeech , 2020 , pp. 4238 - 4242 . 100 - 108 .

[16]

Wu , et al., Partially fake audio detection by [33]

L.-C.

Chang , et al., Defending against adversarial self-attention-based fake span discovery, in: IEEE attacks in speaker verification systems , in: IEEE ICASSP, IEEE, 2022 , pp. 9236 - 9240 . IPCCC, 2021 .

[17]

Cai , et al., Waveform boundary detection [34]

Joshi , et al., Adversarial attacks and defenses for partially spoofed audio, arXiv preprint for speaker identification systems , arXiv preprint arXiv:2211.00226 ( 2022 ). arXiv: 2101 .08909 ( 2021 ).

[18]

Wang , et al., Synthetic voice detection and audio [35]

Wu , et al., Voting for the right answer: Adversarsplicing detection using se-res2net-conformer ar- ial defense for speaker verification, arXiv preprint chitecture , arXiv preprint arXiv:2210.03581 ( 2022 ). arXiv: 2106 .07868 ( 2021 ).

[19]

Zhang , et al., Multi-task learning in utterance- [36]

Olivier , et al., High-frequency adversarial delevel and segmental-level spoof detection, arXiv fense for speech and audio , in: IEEE ICASSP, 2021 . preprint arXiv: 2107 .14132 ( 2021 ). [37]

Zhang , et al., Adversarial separation network

[20]

Zhang , et al., The partialspoof database and coun- for speaker recognition ., in: Interspeech , 2020 . termeasures for the detection of short fake speech [38]

Wu , et al., Improving the adversarial robustness segments embedded in an utterance, IEEE/ACM for speaker verification by self-supervised learnTransactions on Audio, Speech, and Language Pro- ing , IEEE/ACM Transactions on Audio, Speech, cessing ( 2022 ). and Language Processing 30 ( 2021 ) 202 - 217 .

[21]

Lv , et al., Fake audio detection based on unsuper- [39] D. P. Kingma M. Welling , Auto-encoding variational vised pretraining models , in: IEEE ICASSP, IEEE, bayes, arXiv preprint arXiv:1312.6114 ( 2013 ). 2022 , pp. 9231 - 9235 . [40]

Samangouei , et al., Defense-gan: Protecting clas-

[22]

Babu , et al., Xls- r: Self-supervised cross-lingual sifiers against adversarial attacks using generative speech representation learning at scale, arXiv models , arXiv preprint arXiv: 1805 . 06605 ( 2018 ). preprint arXiv: 2111 .09296 ( 2021 ). [41]

Li , et al., Investigating robustness of adversarial

[23] A. M. N. Allam M. H. Haggag , The question an- samples detection for automatic speaker verificaswering systems: A survey , IJRRIS 2 ( 2012 ). tion, arXiv preprint arXiv: 2006 . 06186 ( 2020 ).

[24]

Baevski , et al., wav2vec 2. 0: A framework for [42]

Villalba , et al., Representation learning to clasself-supervised learning of speech representations, sify and detect adversarial attacks against speaker Advances in neural information processing systems and speech recognition systems , arXiv preprint 33 ( 2020 ) 12449 - 12460 . arXiv: 2107 .04448 ( 2021 ).

[25]

Chen , et al., Wavlm: Large-scale self-supervised [43] S. Joshi , et al., Advest: Adversarial perturbation pre-training for full stack speech processing, IEEE estimation to classify and detect adversarial atJournal of Selected Topics in Signal Processing 16 tacks against speaker identification, arXiv preprint ( 2022 ) 1505 - 1518 . arXiv: 2204 .03848 ( 2022 ).

[26]

Jati , et al., Adversarial attack and defense strate- [44]

Peng , et al., Pairing weak with strong: Twin gies for deep speaker recognition systems, Com- models for defending against adversarial attack on puter Speech & Language 68 ( 2021 ) 101199 . speaker verification., in: Interspeech, 2021 .

[27]

Du , et al., Sirenattack : Generating adversarial [45] H. Wu , et al., Adversarial sample detection for audio for end-to-end acoustic systems, in: ACM speaker verification by neural vocoders , in: IEEE ASIACCS, 2020 . ICASSP, 2022 .

[28]

Wang , et al., Adversarial regularization for end- [46]

Chen , et al., Masking speech feature to detect to-end robust speaker verification ., in: Interspeech, adversarial examples for speaker verification , in: 2019. IEEE APSIPA ASC , 2022 .

[29]

Wu , et al., Defense against adversarial attacks [47]

Chen , et al., Lmd: A learnable mask network to on spoofing countermeasures of asv, arXiv preprint detect adversarial examples for speaker verification , arXiv: 2003 . 03065 ( 2020 ). arXiv preprint arXiv:2211.00825 ( 2022 ).

[30]

Chen , et al., Towards understanding and mitigat-