<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Single domain generalization for audio deepfake detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuankun Xie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haonan Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yutian Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Long Ye</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>State Key Laboratory of Media Convergence and Communication, Communication University of China</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>58</fpage>
      <lpage>63</lpage>
      <abstract>
        <p>Audio deepfake detection (ADD) is a prominent problem in artificial intelligence. With diverse spoofing attacks emerging continually, generalization of ADD algorithms in the face of unknown domains and robustness in complex environments become key points for this field. However, when only limited and low-quality learning data is available, as in the case of ADD 2023 Challenge Track 1.2, it is an open issue to achieve good generalization and robustness. In this paper, we propose a Shufle Mix Aggregation and Separation Domain Generalization (SM-ASDG) method which enables single-domain generalization. Specifically, we first design a pre-processing module to improve the robustness of the method against low-quality data. Next, we split the single domain into multiple data domains via the proposed data shufle module. Finally, a well-generalized feature space is constructed through the designed feature extractor and MixStyle domain classifier. The proposed SM-ASDG obtain the weighted equal error rate (WEER) of 23.17% on ADD Challenge Track 1.2, which achieves the Top-5 rank in the challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Audio deepfake detection</kwd>
        <kwd>single domain generalization</kwd>
        <kwd>self-supervised representation</kwd>
        <kwd>ADD challenge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tion and further strategies for improving generalizability
are required.</p>
      <p>
        Audio deepfake detection (ADD) is an important yet chal- To this end, several methods propose the domain
inlenging task, which has raised several concerns due to its variant representation learning (DIRL) strategy [
        <xref ref-type="bibr" rid="ref1 ref13 ref2">8, 9, 10</xref>
        ]
high societal impact [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">1, 2, 3</xref>
        ]. This task aims to accurately in order to overcome the issue of generalizing to invisible
classify real and fake audio, where one of the main chal- target domains with limited source data. The DIRL
stratlenges is to identify accurately in the face of unknown egy aims to reduce representation diferences between
spoofing methods or low quality audio. multiple diferent source domains to ensure domain
in
      </p>
      <p>
        In recent years, several works [
        <xref ref-type="bibr" rid="ref10 ref9">4, 5</xref>
        ] achieve promising variance. However, for situations where multiple source
results on intra-domain datasets. However, the perfor- domains are not available, as in the case of the ADD 2023
mance of these methods degrades significantly when Audio fake game (FG) Challenge [
        <xref ref-type="bibr" rid="ref3">11</xref>
        ] where there is only
extending to cross-domain scenarios [
        <xref ref-type="bibr" rid="ref11">6</xref>
        ]. This is mainly one acceptable training set, the DIRL strategy cannot
due to the fact that these methods do not take suficient be applied efectively. In addition, the performance of
account of the unknown domain and the damaged audio the ADD method degrades significantly when a large
quality. Consequently, the issues of generalization and amount of noise, reverberation and other disturbances
robustness become two key concerns for ADD. are mixed into the source domain data. Therefore, how
      </p>
      <p>
        To address generalization and robustness issues, some to construct ADD models with good generalizability and
methods [
        <xref ref-type="bibr" rid="ref7 ref8">2, 3</xref>
        ] adopt data augmentation schemes to im- robustness based on single-domain, low-quality data is an
prove model performance by learning diverse audio fea- open problem that remains to be explored.
tures over a larger amount of data. Specifically, Piotr et al. In this paper, we introduce a novel Shufle Mix
[
        <xref ref-type="bibr" rid="ref12">7</xref>
        ] utilize a combination of three deepfake and spoofing Aggregation and Separation Domain Generalization
(SMdatasets to increase the training stability. However, larger ASDG) method for single-domain ADD. The key idea of
data sets also lead to higher computational costs. More- our approach is assuming that in an ideal classification
over, as forgery techniques are constantly updated, there feature space, the data distribution of real audio can be
are always unknown attack methods outside the domain. clustered in a single set, while the data distribution of
Therefore, it is not suficient to rely on data augmenta- fake audio should be more scattered. This is because
IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis diferent types of attacks impact more on spoofing audio,
(DADA 2023), August 19, 2023, Macao, S.A.R although diferent recording devices or channel also have
* Corresponding author. some impact on real audio. Based on this idea, we
pro† These authors contributed equally. pose a modified DIRL strategy that allows the application
$ xieyuankun@cuc.edu.cn (Y. Xie); haonancheng@cuc.edu.cn to a single source domain. To be specific, the proposed
(H. Cheng); wangyutian@cuc.edu.cn (Y. Wang); SM-ASDG contains a total of four modules, namely
preyelohnttgp@s:/c/uhca.oendaun.ccnhe(Ln.gY.cen)/ (H. Cheng) processing, data shufle, feature extractor and MixStyle
0000-0002-8366-9011 (Y. Xie); 0000-0003-3407-4318 (H. Cheng) domain classifier. First, the pre-processing module
con© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tains three carefully designed pre-processing strategies
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
      </p>
      <p>…
ADD 2023
Dataset</p>
      <p>Pre-processing</p>
      <p>Low-pass Filter
Amplitude
Adjustment</p>
      <p>Noise
Augmentation</p>
      <p>Shuffle Mix Aggregation and Separation Domain Generalization (SM-ASDG)</p>
      <p>Data
Shuffle
…
…
…</p>
      <p>Feature
Extractor</p>
      <p>MixStyle
Domain
Classifier</p>
      <p>Feature
Space</p>
      <p>Triplet Loss
BCE Loss
Adversarial</p>
      <p>Loss</p>
      <p>
        Audio
Score
to eliminate the efects of noise and other factors on We further adjust the amplitude of signals due to the
the model and to improve the robustness of the algo- observation that the amplitude of genuine speech is
difrithm. Second, the data shufle module is introduced to ferent from that of spoofed speech. In the training set,
approximate a multi-source domain situation by splitting we observe that the genuine speech has higher
amplithe single domain. Then, we construct a feature extrac- tude than spoofed speech. This may cause the model
tor based on W2V2-XLS-R [
        <xref ref-type="bibr" rid="ref4">12</xref>
        ]. Finally, we propose a tends to classify high-amplitude speech as genuine and
MixStyle domain classifier by mixing feature statistics of low-amplitude speech as spoofed during inference. Thus,
training samples across source domains. By this means, we compute the average amplitude of genuine and fake
the model can diversify the style information at the bot- speech and increase the amplitude of each fake speech in
tom layers of the networks. Our proposed SM-ASDG the training set to match the average amplitude of
genmethod achieve outstanding results in the ADD 2023 Au- uine speech, thereby equalizing their average amplitudes
dio FG Challenge, demonstrating the efectiveness of our in training process.
method. In summary, our contributions are as follows: To enhance the robustness of the model in a noisy
situation, we introduce a noise enhancement strategy in the
pre-processing. We add reverberation and noise obtained
from MUSAN [14] and RIR [15] to the original speech,
which is a high efective strategy in speech recognition
and speaker verification.
• We propose SM-ASDG, a high eficient audio
deepfake detection method which achieves the
top-5 rank in the ADD 2023 challenge track 1.2.
• A modified DIRL strategy is proposed for the
situation where only a single source domain is
available. The proposed domain generalization
strategy can improve performance by 9% to 11% on
diferent models.
• The efects of a series of pre-processing
strategies are explored. In addition to common
preprocessing methods such as noise addition and
reverberation, we also explore the efect of silent
frames in forgery identification performance.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Method</title>
      <sec id="sec-2-1">
        <title>2.1. Preprocessing</title>
        <p>
          To address the efect of codec variabilities, we first adopt
a low-pass filter [
          <xref ref-type="bibr" rid="ref5">13</xref>
          ]. This is because that in complex
speech scenarios, focusing on the low-frequency speech
components can often make the model more efective.
Specifically, we utilize a Chebyshev Type I lowpass filter
to preprocess the original 16 kHz signal into a low-pass
ifltered signal. We set the order of the filter to 8, with
maximum ripple and critical frequencies set to 0.05 and
4 kHz, respectively.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Shufle</title>
        <p>To improve the generalization ability of the model, we
divide the training data into three diferent domains
randomly. Randomly shufling the domains enriches the
style information of each domain, allowing the domain
adversarial loss to aggregate all real speech from various
styles. In the experimental section, we further verify that
randomly shufling the domains is more efective than
direct grouping the validation set into one domain and
the training set into two domains.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Feature Extractor</title>
        <p>We first extract features via a W2V2 based front-end,
which is trained using a contrastive method with a
masked feature encoder. The front-end feature extractor
has a feature extractor with seven CNN layers to process
speech signals of diferent lengths, followed by a
Transformer network with 24 layers, 16 attention heads, and
an embedding size of 1024 to obtain context
representations. Consequently, the last hidden states from the</p>
        <sec id="sec-2-3-1">
          <title>Module</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Input</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>MixStyle</title>
        </sec>
        <sec id="sec-2-3-4">
          <title>Conv2d/MFM/BN</title>
        </sec>
        <sec id="sec-2-3-5">
          <title>Conv2d/MFM/Pool/BN</title>
        </sec>
        <sec id="sec-2-3-6">
          <title>Conv2d/MFM/BN</title>
        </sec>
        <sec id="sec-2-3-7">
          <title>Conv2d/MFM/Pool</title>
        </sec>
        <sec id="sec-2-3-8">
          <title>Conv2d/MFM/BN</title>
        </sec>
        <sec id="sec-2-3-9">
          <title>Conv2d/MFM/BN</title>
        </sec>
        <sec id="sec-2-3-10">
          <title>Conv2d/MFM/BN</title>
        </sec>
        <sec id="sec-2-3-11">
          <title>Conv2d/MFM/Pool</title>
        </sec>
        <sec id="sec-2-3-12">
          <title>Reshape/Transformer</title>
        </sec>
        <sec id="sec-2-3-13">
          <title>Flatten/FC</title>
          <p>ConvFilter</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. MixStyle Domain Classifier</title>
        <p>After get the W2V2 feature from feature extractor, we
propose a MixStyle Domain Classifier to generate the
feature space by optimizing three diferent loss function.
The detailed architecture is described in Table 1, which
is modified on the traditional LCNN [ 16]. In the
architecture, MFM means the Max-Feature-Map layer to select
the critical channels for ADD task of the feature and
BN means Batch Normalization. After MixStyle domain
classifier, we get the feature space of the shape (16,512).</p>
        <p>Through the mixing of training instance styles, we
can implicitly synthesize novel domains, which results
in increased domain diversity of the source domains and
ultimately improves the generalizability of the trained
model. Given an input batch , we first random choose a
reference batch ˜ from . Then, Mixstyle computes the
mixed feature statistics as follow:
 mix =  () + (1 − ) (˜),
  = () + (1
− )( ˜),
where  is the weight sample from the Beta distribution
Beta(, )</p>
        <p>. We set  to 0.1 in our paper. Then, the style
normalized feature  is computed by the mixed feature
statistics,</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Loss Function</title>
        <p>BCE loss. First, our main task is binary classification,
which is to determine whether the features obtained are
genuine or spoofed. We use several FC layer to down
sample the feature from 512 to 1 and compute Binary
Cross Entropy (BCE) to classify. It is worth mention that
the feature normalization and weight normalization is
used for this process, which will balance the numerical
values of features and weights from speech signals across
diferent domains, facilitating the convergence of the
model.</p>
        <p>Triplet loss. Our proposed ASDG strategy is that the
real speech from diferent domain should be aggregated
and the spoof one will be separate. The triplet mining
method is suitable for the goal, which is defined as follow:
(3)
(4)


 = ∑︁ ‖ ( ) −  ()‖22 −</p>
        <p>⃦⃦  () −   )︁ ⃦⃦⃦ 22</p>
        <p>︁(
⃦</p>
        <p>+ ,
 
where  ,</p>
        <p>,  represent the anchor sample, real
sample, and fake sample. By minimizing , the euclidean
distance between the anchor and the real sample may
get closer while the anchor may get further away from
the fake sample. We set  to 0.1 which is a margin value.</p>
        <p>Adversarial loss. In the feature space, the distribution
of real speech should be aggregated regardless of domain.</p>
        <p>Thus, we design a single-side domain discriminator with
Gradient Reverse Layer (GRL) [17]. Let () denotes
the distributions of real feature and  denotes the
domain of . The adversarial loss function of the domain
discriminator is defined as follows:</p>
        <p>minmax  (, ) =
− ∼ (),∼ 
3
=1
∑︁  ( = ) ( (()) ,
where  denotes the domain label. The feature
generator is trained to learn a robustness feature to spoof the
domain discriminator in order to maximize . In the
meantime, the discriminator is trained to identify the
feature domain by minimizing. To achieve this goal, we
(1)
use the Gradient Reversal Layer (GRL), which reverses
the gradient during back propagation by multiplying
negative dynamic coeficients. This makes the discriminator
unable to identify the domain of the real feature, which
leads to the aggregation of genuine speech in the feature
space without being divided by domains.</p>
        <p>Total loss. The total loss  for our system is defined
as follow:
 =  +  1 +  2,
(5)
MixStyle() =  
+  .</p>
        <p>(2)
 − ()</p>
        <p>()
where  1 and  2 set to 0.1 to balance the value of three Table 2
diferent losses. By utilizing the  loss, we can con- Performance comparison with the state-of-the-art ADD
modstruct an optimal classification feature space for ADD els on the ADD-FG dataset.
task, where genuine speech signals from diverse domains Method Feature 1 2
are clustered together while fake speech signals are
separated from them.</p>
        <sec id="sec-2-5-1">
          <title>AASIST [5] ResNet18 [18] LCNN [19] LCNN [19]</title>
          <p>SM-ASDG</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>Raw Audio</title>
        </sec>
        <sec id="sec-2-5-3">
          <title>LFCC</title>
          <p>Mel
W2V2
W2V2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset and metrics</title>
        <p>All experiments are conducted on the ADD 2023 Audio
FG-D datasets. There are 27,084 audio clips in the
training set and 28,324 audio clips in the development set. We
divide the dataset as 90%/10% for training and validation,
respectively. The audio amplitude in the training set is
inconsistent and contains noise, and there are repeated tail
segments without valid information. The audio situation
in the test set is much more complex, including noise,
reverberation, background music, and a large number of
silent clips. Therefore, how to improve the generalization
and robustness of methods is the core challenge.</p>
        <p>Weighted equal error rate (WEER) is used as the
evaluation metric, which is defined as
better performance than manual feature. This is due to
that the W2V2 is trained on a large amount of real
utterances from diferent source domains which can enhance
the diferential capability in complex scenarios.
Moreover, results show that our SM-ASDG model outperforms
all backbone models.
  =  1 +  2, (6) Impact of MixStyle. We further investigate the impact
of the MixStyle units. As shown in Table 3, “SM-ASDG
where  = 0.4 and  = 0.6 represent the weights for w/o MIX” denotes removing the MixStyle layer from
equal error rate (EER) obtained in round 1 (1) and our full model. It can be observed that the performance
round 2 (2) of ADD Challenge Track 1.2, respec- of the model decreases by 2.91% in round1 and 4.50%
tively. in round2 with the removal of MixStyle. This
demonstrates the efectiveness of our MixStyle domain classifier
3.2. Implementation details module. This is due to the fact that the bottom layer of
CNN corresponds to style information and the top layer
All training audio files are trimmed or padded to 4s. For corresponds to label information. MixStyle enables the
baseline AASIST, the input is the raw waveform of about diversification of the bottom style information of LCNN.
4s (64000 samples). For baseline Resnet18, we use 80- That is, our model can generate diverse new styles of real
dimensional LFCCs with a shape of (80,404) as front-end. speech and fake speech to enhance the ability of domain
During training, the parameters of W2V2 front-end are generalization.
frozen. After front-end, we can get the last hidden states Visualization for feature To analyze the efect of the
vector with shape of (201, 1024) as input of back-end. For MixStyle and our proposed ASDG backbone, we
visutraining strategy, the Adam optimizer is adopted with alize the distribution of diferent hidden features using
 1 = 0.9,  2 = 0.999,  = 10−9 and weight decay is T-SNE [20]. As shown in Figure 2, we randomly select
10−4 . The learning rate is initialized as 10−5 and halved 360 samples for three source data domains. In each
doevery 5 epochs. main, we select 60 samples for real utterances and 60
for fake utterances. Figure 2 (a) and (b) demonstrate
3.3. Ablation studies on architecture that the hidden feature distributions become more
distinct after applying MixStyle, indicating that MixStyle
facilitates the diversification of the bottom style
information in LCNN. The feature space depicted in Figure 2
(c) aligns with our conception of an ideal feature space
by ASDG, where genuine speech signals are clustered
together, while synthetic ones are segregated.</p>
        <p>Impact of backbone models and features. We first
investigate the impact of the backbone models and features.</p>
        <p>
          As shown in Table 2, we compare with three baseline
backbone models: AASIST [
          <xref ref-type="bibr" rid="ref10">5</xref>
          ], ResNet18 [18] and LCNN
[19]. Furthermore, we compare W2V2 based feature and
manual feature connected with the same LCNN back-end.
        </p>
        <p>It can be observed that the W2V2 based feature shows</p>
        <p>Real</p>
        <p>Fake
(a)
(b)</p>
        <p>(c)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.4. Ablation studies on pre-processing</title>
        <p>
          To improve the robustness and generalization of the
model, we explore a series of pre-processing strategies,
including data shufle, noise augmentation, low-pass
filtering, amplitude adjustment, and region of interest (ROI)
detection. Pre-processing strategy 1 2
Does data shufle help? We first explore the eficacy SM-ASDG w/o MIX 26.97 27.09
of data shufle strategy. As shown in Table 4, we design SM-ASDG w/o [MIX+AMP] 27.53 27.80
two domain segmentation schemes, namely data shuf- SM-ASDG w/o [MIX+AMP+LP] 28.05 29.24
lfe and direct division. The direct division refers to the SM-ASDG w/o [MIX+AMP+LP+NA] 37.07 38.67
directly using the test set and validation set as two sep- SM-ASDG with ROI 26.25 26.45
arate source domains. The two domain segmentation SM-ASDG 24.06 22.59
schemes are used in four variant, namely ASDG model
and the ASDG model with diferent data augmentation
strategies. “Rawboost” denotes the raw data boosting of the model. So we ultimately choose RM strategy as
and augmentation strategy [
          <xref ref-type="bibr" rid="ref8">3</xref>
          ]. We utilize the best per- the noise augmentation strategy.
formance strategy in ASVspoof2021LA, which combines Does low-pass filter help? To against complex speech
linear and non-linear convolutive noise with impulsive, scenarios, we add low-pass filters to focus on the core
signal-dependent noise. “RM” means adding noise and frequency of the speech. The result shown in Table 5
reverberation from RIRs [15] and MUSAN [14] datasets (the third row from top) indicates that low-pass filters
to the audio of training set in a Kaldi [21] like manner. In can help improve the forgery detection performance.
each pair of comparison data (the red row in Table 4 and Does amplitude adjustment help? The amplitude level
its upper row), we can observe that the shufle strategy of the data samples varies greatly in the training and
testcan efectively improve the forensic performance. Data ing datasets. However, we find in our experiments that
shufle can reduce the order and pattern in the dataset, the amplitude of the audio is learned by the model and
thus improving the generalization of the model. afects the classification results. Therefore, we adjust
Does noise augmentation help? Since the test data the amplitudes to the same interval range uniformly
durcontain a large amount of noise and background music ing both testing and training. As shown in Table 5 (the
that are not available in the source domain dataset, we second row from top), audio normalization has obvious
incorporate a noise augmentation strategy in the prepro- efects on the performance.
cessing stage, that is, introducing noise during training to Does ROI detection help? Due to the large number of
improve the robustness of the model. It can be seen from silent segments that do not contain speech content in the
Table 5 (the fourth row from top) that when the noise aug- test data, a straightforward idea to improve performance
mentation strategy is removed, the model performance is to detect only speech segments, that is, to detect ROIs.
decreases by 9.02% in round1 and 9.43% in round2. In ad- However, as shown in Table 5 (the red row), when ROI
dition, the efectiveness of diefrent noise augmentation detection is added, the overall performance of the model
strategies can also be seen in Table 4 (as shown in red decreases by 2.19% in round1 and 3.86% in round2. This
rows), as the RM strategy can maximize the robustness is mainly due to that silent segments also contain artifact
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>information for distinguishing real and fake audio [22].
Therefore, simply eliminating silent segments does not
improve the generalization and robustness of the model.
In this paper, we propose SM-ASDG, a novel shufle
mix aggregation and separation domain generalization
method for single domain ADD. The proposed method
achieves a WEER of 23.17% in ADD 2023 track 1.2 final
ranking, which is one of the top-5 performing methods.
The outstanding robustness and generalization of the
proposed SM-ASDG model is due to our carefully designed
preprocessing module, data shufle and MixStyle domain
classification module. In future works, we plan to
embed more high-level semantic features of audio, such as
sentiment features, into the model to further improve
generalization.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <article-title>Selective domain-invariant feature alignment network for face anti-spoofing</article-title>
          ,
          <source>IEEE Transactions on Information Forensics and Security</source>
          <volume>16</volume>
          (
          <year>2021</year>
          )
          <fpage>5352</fpage>
          -
          <lpage>5365</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Domain generalization via shufled style assembly for face anti-spoofing</article-title>
          ,
          <source>in: Proceedings of the CVPR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4123</fpage>
          -
          <lpage>4133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiangyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jianhua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruibo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xinrui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chenglong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chuyuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiaohui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Junzuo</surname>
          </string-name>
          , G. Hao,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhengqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhen</surname>
          </string-name>
          , L. Haizhou,
          <year>Add 2023</year>
          :
          <article-title>the second audio deepfake detection challenge</article-title>
          ,
          <source>in: Proceedings of the IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Babu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tjandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          von Platen,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Saraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pino</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Xls-</surname>
          </string-name>
          r:
          <article-title>Self-supervised cross-lingual speech representation learning at scale</article-title>
          ,
          <source>arXiv preprint arXiv:2111.09296</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nishizaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Low pass ifltering and bandwidth extension for robust anti-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ryan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Saravanan</surname>
          </string-name>
          ,
          <article-title>Voice spoofing countermeasure against codec variabilispoofing countermeasures: Taxonomy, state-of-the- ties</article-title>
          , arXiv preprint arXiv:
          <volume>2211</volume>
          .06546 (
          <year>2022</year>
          ).
          <article-title>art, experimental analysis of generalizability</article-title>
          , open [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guoguo</surname>
          </string-name>
          , P. Daniel, Musan:
          <article-title>A music, challenges, and the way forward, arXiv preprint speech, and noise corpus</article-title>
          ,
          <source>arXiv:1510.08484</source>
          (
          <year>2015</year>
          ). arXiv:
          <volume>2210</volume>
          .00417 (
          <year>2022</year>
          ). [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vijayaditya</surname>
          </string-name>
          , P. Daniel,
          <string-name>
            <surname>S. Michael L</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Rimon</surname>
          </string-name>
          , E. Aflalo,
          <string-name>
            <given-names>H.</given-names>
            <surname>Permuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A K.</given-names>
            <surname>Sanjeev</surname>
          </string-name>
          ,
          <article-title>A study on data augmentation of reverstudy on data augmentation in voice anti-spoofing, berant speech for robust speech recognition</article-title>
          .,
          <source>in: Speech Communication</source>
          <volume>141</volume>
          (
          <year>2022</year>
          )
          <fpage>56</fpage>
          -
          <lpage>67</lpage>
          .
          <source>Proceedings of the ICASSP</source>
          ,
          <year>2017</year>
          , p.
          <fpage>5220</fpage>
          -
          <lpage>5224</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kamble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          , [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <article-title>A comparative study on reRawboost: A raw data boosting and augmentation cent neural spoofing countermeasures for synthetic method applied to automatic speaker verification speech detection</article-title>
          ,
          <source>arXiv preprint arXiv:2103</source>
          .11326 anti-spoofing,
          <source>in: Proceedings of the ICASSP</source>
          , IEEE, (
          <year>2021</year>
          ).
          <year>2022</year>
          , pp.
          <fpage>6382</fpage>
          -
          <lpage>6386</lpage>
          . [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ganin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          , Unsupervised domain adap-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Evans, tation by backpropagation</article-title>
          ,
          <source>in: Proceedings of the A. Larcher</source>
          ,
          <article-title>End-to-end anti-spoofing with rawnet2</article-title>
          , ICML,
          <year>2015</year>
          , pp.
          <fpage>1180</fpage>
          -
          <lpage>1189</lpage>
          . in
          <source>: Proceedings of the ICASSP</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6369</fpage>
          -
          <lpage>6373</lpage>
          . [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          , S. Xu,
          <string-name>
            <surname>RW-Resnet</surname>
          </string-name>
          : A Novel Speech
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.-w.</given-names>
            <surname>Jung</surname>
          </string-name>
          , H.-S. Heo,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          , H.-j. Shim,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <article-title>Anti-Spoofing Model Using Raw Waveform</article-title>
          , in: ProB.
          <string-name>
            <surname>-J. Lee</surname>
            ,
            <given-names>H.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
          </string-name>
          , Aasist: Audio anti- ceedings
          <source>of the Interspeech</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4144</fpage>
          -
          <lpage>4148</lpage>
          .
          <article-title>spoofing using integrated spectro-temporal graph</article-title>
          [19]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lavrentyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Novoselov</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Volkova, attention networks</article-title>
          ,
          <source>in: Proceedings of the ICASSP, A. Gorlanov, A. Kozlov, Stc antispoofing systems 2022</source>
          , pp.
          <fpage>6367</fpage>
          -
          <lpage>6371</lpage>
          .
          <article-title>for the asvspoof2019 challenge, arXiv preprint</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Czempin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diekmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Froghyar</surname>
          </string-name>
          , arXiv:
          <year>1904</year>
          .
          <volume>05576</volume>
          (
          <year>2019</year>
          ). K. Böttinger, Does Audio Deepfake Detection Gen- [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Van der Maaten</surname>
          </string-name>
          , G. Hinton,
          <article-title>Visualizing data eralize?</article-title>
          ,
          <source>in: Proceedings of the Interspeech</source>
          ,
          <year>2022</year>
          , using t-sne.,
          <source>Journal of machine learning</source>
          research pp.
          <fpage>2783</fpage>
          -
          <lpage>2787</lpage>
          . 9 (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Piotr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marcin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Piotr</surname>
          </string-name>
          , Attack agnostic dataset: [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mirco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Titouan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yoshua</surname>
          </string-name>
          ,
          <article-title>The pytorch-kaldi Towards generalization and stabilization of audio speech recognition toolkit, in: Proceedings of the deepfake detection</article-title>
          ,
          <source>in: Proceedings of the Inter- ICASSP</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6465</fpage>
          -
          <lpage>6469</lpage>
          . speech,
          <year>2022</year>
          , pp.
          <fpage>4023</fpage>
          -
          <lpage>4027</lpage>
          . [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuxiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wenchao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pengyuan</surname>
          </string-name>
          , The efect
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Matsuura</surname>
          </string-name>
          , T. Harada,
          <article-title>Domain generalization of silence and dual-band fusion in anti-spoofing using a mixture of multiple latent domains</article-title>
          , in: Pro- system,
          <source>in: Proceedings of the Interspeech</source>
          ,
          <year>2021</year>
          .
          <article-title>ceedings of the AAAI</article-title>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>11749</fpage>
          -
          <lpage>11756</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>