<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-perspective Information Fusion Res2Net with Random Specmix for Fake Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shunbo Dong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jun Xue</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cunhang Fan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kang Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujie Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhao Lv</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University</institution>
          ,
          <addr-line>AHU</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jiulong Road, Hefei</institution>
          ,
          <addr-line>230601</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>19</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the model's ability to locate discriminative information. Specmix cuts and pastes the frequency dimension information of the spectrogram in the same batch of samples without introducing other data, which helps the model to locate the really useful information. At the same time, we randomly select samples for augmentation to reduce the impact of data augmentation directly changing all the data. Once the purpose of helping the model to locate information is achieved, it is also important to reduce unnecessary information. The role of MPIF-Res2Net is to reduce redundant interference information. Deceptive information from a single perspective is always similar, so the model learning this similar information will produce redundant spoofing clues and interfere with truly discriminative information. The proposed MPIF-Res2Net fuses information from diferent perspectives, making the information learned by the model more diverse, thereby reducing the redundancy caused by similar information and avoiding interference with the learning of discriminative information. The results on the ASVspoof 2021 LA dataset demonstrate the efectiveness of our proposed method, achieving EER and min-tDCF of 3.29% and 0.2557, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multi-perspective information fusion</kwd>
        <kwd>fake speech detection task</kwd>
        <kwd>random Specmix strategy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
(a) Res2Net backbone
(b) MPIF-Res2Net
This enables the model to acquire more comprehensive ening the fitting ability of system. Specifically, we
raninformation of features. Li et al. [18] investigated the domly cover the part of the frequency information with
efectiveness of Res2Net in conjunction with diferent corresponding frequency information of another sample
acoustic features. Li et al.[19] proposed a channel-wise in the same batch. This approach can improve the
pergating mechanism to suppress channels with lower cor- formance greatly. Our proposed method has been shown
relations which they thought not useful. However, the to be efective on the ASVspoof 2021 LA dataset, with
models mentioned above may not achieve better results achieved EER and min-tDCF results of 3.29% and 0.2557,
in the low-quality scenarios as they only conducted their respectively.
experiments in the celan scenarios.</p>
      <p>
        In this work, we propose multi-perspective
information fusion Res2Net (MPIF-Res2Net) with random 2. Methodology
Specmix. Spoofing information from a single perspective
is always similar during learning process, which causes 2.1. Proposed method
redundant information and blurs the truly discriminative In this section, we introduce the structure of the proposed
information. The MPIF module fuses the information MPIF-Res2Net, it reduces redundancy caused by learning
from diferent receptive field to reduce the redundant single-perspective forgery information by integrating
spoofing cues and enhance the robustness of system in information from multiple perspectives. The
convoluthe poor-quality scenarios. Specmix can increase the tional operations with single kernel size are learning the
diversity of training data, thereby improving the general- similar forgery clues, producing too much redundant
inization ability of the model. The generated spectrogram formation and obscuring the important discriminative
will incorporate information from other spectrograms, information. Therefore, the MPIF-Res2Net as shown on
allowing the model to pay attention to the noteworthy the left of Figure 1(b) is proposed to fuse information
information. And it performs cut and paste among spec- from diferent convolutional operations with diferent
trograms without introducing data that was not present kernel size. The architecture of Res2Net is shown as
Figin the original dataset with a modest impact on the origi- ure 1(a), the outcome from the 1 × 1 convolution was
nal dataset. DA method conducted on all the samples may splited into  equal parts by the channel dimension,
deafect the distribution characteristics of the original data. noted as , where  is the integer between 1 to n. And
For this issue, we randomly choose samples according to each part has  (Eq. 1) channels.
the probability _ℎ in advance to conduct Specmix
to prevent excessive augmentation methods from weak-  = (
        <xref ref-type="bibr" rid="ref4">1</xref>
        )
#ℎ
      </p>
      <p>
        () ,
⎩  ( + −1 ) ,
 = 1
 = 2
2 &lt;  ≤ 
where #ℎ means the total number of channels. Table 1
Res2Net uses the residual-like connection to perform The Proposed MPIF-Res2net Model Architecture and
Configuaddition between the channel groups. The following ration. the Dimensions Are Arranged in the Order of Channels,
formulation can be used to describe this process: Frequency, and Time). BN Denotes Batch Normalization and
ReLU denotes Rectified Linear Unit, MPIF and SE Are the
 = ⎧⎨ , (
        <xref ref-type="bibr" rid="ref5">2</xref>
        ) EMxucilttia-tpioernspLeacyteirv,eRIensfpoermctiavteiolyn. Module and the Squeeze And
      </p>
      <p>Layer</p>
      <p>Front-end
Pre-processing</p>
      <p>Layer1
&amp;Layer3
Layer2
&amp;Layer4
Output</p>
      <p>Input:27000 samples</p>
      <p>F0 subband
Channel expansion</p>
      <p>
        Conv2D_1
BN &amp; ReLU
⎧2_1
⎪
⎪
⎪⎨2_3
⎪2_1
⎪
⎪⎩
1 ×
2 ×
⎧2_1
⎪
⎪
⎪⎨  
1 × ⎪2_1
⎪
⎪⎩
⎧2_1
⎪
⎪
⎪⎨2_3
1 × ⎪2_1
⎪
⎪⎩
⎧2_1
⎪
⎪
⎪⎨  
⎪2_1
⎪
⎪⎩
Avgpool2D(
        <xref ref-type="bibr" rid="ref4 ref4">1, 1</xref>
        )
      </p>
      <p>AngleLinear</p>
      <p>
        Output shape
(45,600)(F,T)
(
        <xref ref-type="bibr" rid="ref4">1,45,600</xref>
        )
(16,45,600)
Layer1 (
        <xref ref-type="bibr" rid="ref23">32,45,600</xref>
        )
Layer3(128,12,150)
Layer2(64,23,300)
Layer4(
        <xref ref-type="bibr" rid="ref9">256,6,75</xref>
        )
As shown on the right of Figure 1(b), MPIF module in
current channel group  performs diferent convolution
operation on  or  + −1 to get the spoofing
information from diferent perspective. Firstly,  is sent into the
convolution operations with diferent dilation parameter
, where  ∈ [1, 2], at the beginning of MPIF module
(Eq. 4). The results are then passing through the dilated
convolution to recalculate the energy distribution of each
channel, normalize them through the Sigmoid function,
and then the average pooling layer is used to get the
results  as the weight of each channel  from 2.
      </p>
      <p>And the purpose of using dilated convolution is to
increase the receptive field, ensuring that each convolution
output contains information from a larger range while
keeping the parameter and computation cost constant.</p>
      <p>The weighting factor  of each channel  is calculated
by Eq. 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2.3. Random Specmix Strategy</title>
      <p>is the weight matrix with . The  is the number of the
channel of .</p>
      <p>
        In this work, we use a random Specmix strategy to help
the model to locate the discriminative information and
 = (2 ())) (
        <xref ref-type="bibr" rid="ref7">4</xref>
        ) enhance the generalization of the model. For the
training of deep neural networks, we always transform the
 = ((2())) (
        <xref ref-type="bibr" rid="ref8">5</xref>
        ) raw audio from time domain into time-frequency
do2 at the beginning of MPIF module takes  as main. And inspired by [20], we conduct Specmix on
input and outputs  .  is the ℎ channel in  . the frequency dimension of the F0 subband [30], which
2 is a convolutional operation to recalculate en- is a subband of amplitude spectrum, and the maximum
ergy distribution.  is the  function. span of Specmix operation is no more than 10. Specmix
 denotes the  2 function. cuts and pastes spectrograms among themselves in the
      </p>
      <p>
        After the weighting factor , we perform multipli- same batch to help the model focus on the
discriminacation on  and , and sum up the results. Then we tive regions that may be worth to attend to. And
difcan get the result   () of -th channel group as ferent from [20], there is no Specmix operation on
lafollows: bels. We cover the information on frequency dimension
2 with the corresponding parts of other samples in the
  () = ∑︁  × (Ω  ) (
        <xref ref-type="bibr" rid="ref9">6</xref>
        ) same batch. At the same time, to avoid the conduction of
=1 Specmix on all the samples, inspired by [21], we randomly
choose speech samples according to the
hyperparamewhere  is a matrix composed with , Ω  ∈ R×1×1 ter _ℎ in advance to conduct Specmix operation.
For a batch of speech samples, the probability of them
undergoing random Specmix is , when  is bigger than The resulting feature size of F0 subband is 45× 600. Then,
_ℎ, Specmix was conducted on them, otherwise no we determine whether to conduct random Specmix on
conduction with Specmix. And in the evaluation phase, the samples in the current batch by setting the
hyperpawe do not use the random Specmix strategy. rameter (_ℎ), the probability whether to conduct
the random Specmix strategy. Considering that the F0
feature is a subband of the amplitude spectrum, we set
3. Expriments And Results the maximum span for Specmix to be no more than 10.
In this article, we propose MPIF-Res2Net to fuse the
3.1. Experimental Setup information from diferent perspective to reduce the
reIt must be a challenging task to learn a robust coun- dundant spoofing cues and introduce random Specmix
termeasure suitable to low-quality scenario trained on to improve the generalization ability of the model. Table
the training set without same interference conditions. 1 presents the design of MPIF-Res2Net, which includes
In this work, we use the Rawboost [11] DA method to details on channels, convolution kernels, and repetition
train the model, this technique can enhance the accu- frequency. In our experiments, Adam is utilized as the
racy in the low-quality scenarios. To be more precise, optimizer, with the following parameter settings:  1 =
the impulsive signal-dependent (ISD) additive noise and 0.9,  2 = 0.98,  = 10−9 , and weight decay is 10−4 . The
stationary signal-independent (SSI) additive noise are epoch is set to 32. And the number of channel groups is
added to the raw waveform. After the STFT operation set to 8. The batch size is 16.
with the window length is 1728 and the hop length is
130, we got a spectrogram of size 865. We then trun- 3.2. Dataset
cate or concatenate the spectrogram to fix the number
of frames at 600. We utilize the 0-400 Hz LPS feature
with the first 0-45 dimension as our F0 subband feature.
      </p>
      <p>The data in the ASVspoof 2019 logical access (LA) dataset
is divided into three subsets: training set, development
set, and evaluation set. The spoof speech in the training
and development sets comes from six speech synthesis Table 3
and speech conversion technologies, which are known Results Comparison with Fusion Systems on the Performance
attack types. The evaluation set contains audio gener- of ASVspoof2021 Dataset
ated by 11 unknown attack types. We trained our model System t-DCF EER(%)
on the ASVspoof 2019 training set and selected the best T23 [22] 0.2177 1.32
performing model on the development set. As stated T20 [23] 0.2608 3.21
in [33], the ASVspoof 2021 LA dataset is designed for T04 [24] 0.2747 5.58
developing anti-spoofing methods that can efectively T06 [25] 0.2853 5.66
adapt to unknown channel variations and does not pro- T35 [22] 0.2480 2.77
vide new matching training or development data. The T19 [22] 0.2495 3.13
speech samples from the ASVspoof 2021 evaluation set Fusion systems [27] 0.2882 4.66
were transmitted via actual telephone systems utilizing MPIF-Res2Net ours 0.2557 3.29
various bandwidths and codecs. The data in the 2019
LA training and development subset does not have sim- Table 4
ilar encoding and transmission, and these subsets only Results Comparison with Single System on the Performance
contain clean data. Equal error rate (EER) and minimum of ASVspoof2021 Dataset
tandem detection cost function (min t-DCF) are used as
the metrics.</p>
    </sec>
    <sec id="sec-3">
      <title>3.3. Experimental Results</title>
      <p>3.3.1. Ablation Study
Firstly, diferent values of the probability (_ℎ)
should be considered as the guidance of the experimental
conduction to obtain the best _ℎ. Table 2 shows
the EER results of conduction with random Specmix
strategy for diferent values of _ℎ. The MPIF-Res2Net Experimental results show that our proposed
MPIFwith _ℎ=0.5 has the best performance whose EER Res2Net with random Specmix enhancement methods
result is 3.29%, and the min t-DCF result is 0.2557, which can improve performance for FSD task in the low-quality
means a relatively higher reliability of the countermea- scenarios.
sure system when it is applied with an ASV system.</p>
      <p>For experiments involving information from a single re- 3.3.2. Performance Comparison With Other
ceptive field, we set up two models, Res2Net_k3 and Systems
Res2Net_k5, with the parameter kernel sizes and dila- The Table 3 shows the results on ASVspoof 2021 LA
tions are 3 and 1, 3 and 2, respectively. The EER result dataset of diferent fusion systems. Although the fusion
of MPIF-Res2Net with the _ℎ equal to 1 is 3.70%, systems T23 [22], T20 [23], T35 [22] and T19 [22]
outperhowever, the corresponding EER results of the other two form than our proposed MPIF-Res2Net, but their fusion
systems are 4.58% and 4.49% respectively, which verifies ways are very complicated. Such as the T23, it is
comthe MPIF-Res2Net we proposed do have the ability by posed by 12 other systems trained separately, and got
fusing information from diferent perspective to reduce fused with finely adjusted weight assignment at the score
the redundancy caused by learning the similar spoof- stage. The method we proposed is based on a single
sysing clues with the single kernel size. The EER results of tem, which is less complicated compared to the fusion
Res2Net_k3 and Res2Net_k5 undergoing Specmix demon- systems. Table 4 shows the EER result of single
sysstrate that Specmix can help help the model to locate the tems, the best EER result of other systems is 8.05%, the
forgery information and improve model generalization method we proposed has improved the performance by
performance. 59% relative to the RawNet2[29]system.</p>
      <p>For random Specmix strategy, the MPIF-Res2Net with
_ℎ is 0 got the EER result of 4.04%, the _ℎ
with 0 means the Specmix conduction was conducted on 4. Conclusion
all of the samples, this indicates that all the samples
undergoing Specmix cause the serious performance degra- In this paper, we achieve accurate and useful
infordation of system. Overall, the random Specmix has im- mation discrimination from two aspects. On the one
proved the model’s generalization ability and enhanced hand, Specmix helps the model to focus on the location
its performance. of key information in the sample by mixing
information between samples, and randomly selects samples</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>for Specmix operations, efectively avoiding the phe- A raw data boosting and augmentation method nomenon of performance degradation caused by the applied to automatic speaker verification antidestruction of original data. On the other hand</article-title>
          , MPIF- spoofing[C]//ICASSP 2022-2022
          <string-name>
            <given-names>IEEE</given-names>
            <surname>International</surname>
          </string-name>
          <article-title>Res2Net reduces redundant information caused by learn- Conference on Acoustics, Speech and Signal Proing similar information from a single perspective by fus- cessing (ICASSP)</article-title>
          . IEEE,
          <year>2022</year>
          :
          <fpage>6382</fpage>
          -
          <lpage>6386</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>ing information from multiple perspectives</article-title>
          , removing [12]
          <string-name>
            <surname>Park</surname>
            <given-names>D S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chan</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y</given-names>
          </string-name>
          , et al.
          <article-title>Specaugthe influence of redundant information on the learning ment: A simple data augmentation method for of key information. The efectiveness of our method has automatic speech recognition[J]. arXiv preprint been demonstrated by experiments</article-title>
          .
          <source>The efectiveness arXiv:1904.08779</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>of our proposed method was verified by the experiment</article-title>
          [13]
          <string-name>
            <surname>Kwak</surname>
            <given-names>I Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>J</given-names>
          </string-name>
          , et al.
          <article-title>CAU_KU team's results. submission to ADD 2022 Challenge task 1: Lowquality fake audio detection through frequency feature masking[J]</article-title>
          .
          <source>arXiv preprint arXiv:2202</source>
          .04328,
          <year>References 2022</year>
          . [14]
          <string-name>
            <surname>Lavrentyeva</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novoselov</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malykh</surname>
            <given-names>E</given-names>
          </string-name>
          , et a1. Audio
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Naika</surname>
            <given-names>R.</given-names>
          </string-name>
          <article-title>An overview of automatic speaker verifi- replay attack detection with deep learning framecation system</article-title>
          [C]//Intelligent Computing and Infor- works [C] //
          <source>Proc of Interspeech</source>
          <year>2017</year>
          . Grenoble, mation and
          <source>Communication: Proceedings of 2nd France: ISCA</source>
          ,
          <year>2017</year>
          :
          <fpage>82</fpage>
          -86 International Conference,
          <string-name>
            <surname>ICICC</surname>
          </string-name>
          <year>2017</year>
          . Springer Sin- [15]
          <string-name>
            <surname>Alzantot</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava M B. Deep</surname>
          </string-name>
          residual gapore,
          <year>2018</year>
          :
          <fpage>603</fpage>
          -
          <lpage>610</lpage>
          .
          <article-title>neural networks for audio spoofing detection</article-title>
          [J].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Wu</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinnunen</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            <given-names>N</given-names>
          </string-name>
          , et al.
          <source>ASVspoof</source>
          <year>2015</year>
          : arXiv preprint arXiv:
          <year>1907</year>
          .00501,
          <year>2019</year>
          .
          <article-title>the first automatic speaker verification spoofing</article-title>
          [16]
          <string-name>
            <surname>Lai</surname>
            <given-names>C I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villalba</surname>
            <given-names>J</given-names>
          </string-name>
          , et al. ASSERT:
          <article-title>Antiand countermeasures challenge[C]//Sixteenth an- spoofing with squeeze-excitation and residual netnual conference of the international speech com- works</article-title>
          [J].
          <source>arXiv preprint arXiv:1904.01120</source>
          ,
          <year>2019</year>
          . munication association.
          <year>2015</year>
          . [17]
          <string-name>
            <surname>Gao S H</surname>
            , Cheng
            <given-names>M M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>K</given-names>
          </string-name>
          , et al.
          <source>Res2net: A new</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kinnunen</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahidullah</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delgado</surname>
            <given-names>H</given-names>
          </string-name>
          , et al.
          <article-title>The multi-scale backbone architecture[J]</article-title>
          .
          <source>IEEE transacASVspoof</source>
          <year>2017</year>
          challenge
          <article-title>: Assessing the limits of tions on pattern analysis and machine intelligence, replay spoofing attack detection</article-title>
          [J].
          <year>2017</year>
          .
          <year>2019</year>
          ,
          <volume>43</volume>
          (
          <issue>2</issue>
          ):
          <fpage>652</fpage>
          -
          <lpage>662</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Todisco</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vestman</surname>
            <given-names>V</given-names>
          </string-name>
          , et al.
          <source>ASVspoof</source>
          [
          <volume>18</volume>
          ]
          <string-name>
            <surname>Li</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            <given-names>C</given-names>
          </string-name>
          , et al.
          <article-title>Replay and syn2019: Future horizons in spoofed and fake audio thetic speech detection with res2net architecdetection[J]</article-title>
          .
          <source>arXiv preprint arXiv:1904.05441</source>
          ,
          <year>2019</year>
          . ture[C]//ICASSP 2021-2021 IEEE international con-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Yamagishi</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todisco</surname>
            <given-names>M</given-names>
          </string-name>
          , et al.
          <article-title>ASVspoof ference on acoustics, speech and signal processing 2021: accelerating progress in spoofed and (ICASSP)</article-title>
          . IEEE,
          <year>2021</year>
          :
          <fpage>6354</fpage>
          -
          <lpage>6358</lpage>
          . deepfake speech detection[J].
          <source>arXiv preprint</source>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          , “
          <source>ChannelarXiv:2109.00537</source>
          ,
          <year>2021</year>
          . wise gated res2net:
          <article-title>Towards robust detection of syn-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>thetic speech attacks</article-title>
          ,
          <source>” Proc. Interspeech</source>
          <year>2021</year>
          ,
          <year>2021</year>
          . Tian,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          et al.,
          <source>“Add</source>
          <year>2022</year>
          :
          <article-title>the first audio</article-title>
          [20]
          <string-name>
            <surname>Kim</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Han</surname>
            <given-names>D K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ko H. Specmix</surname>
          </string-name>
          :
          <article-title>A mixed sample deep synthesis detection challenge,” in ICASSP 2022</article-title>
          .
          <article-title>data augmentation method for training withtimeIEEE</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>9216</fpage>
          -
          <lpage>9220</lpage>
          . frequency domain features[J].
          <source>arXiv preprint</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Arif</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Javed</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alhameed</surname>
            <given-names>M</given-names>
          </string-name>
          , et al.
          <source>Voice spoof- arXiv:2108.03020</source>
          ,
          <year>2021</year>
          .
          <article-title>ing countermeasure for logical access attacks detec-</article-title>
          [21]
          <string-name>
            <surname>Zhong</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            <given-names>G</given-names>
          </string-name>
          , et al.
          <article-title>Random erasing tion[J]</article-title>
          .
          <source>IEEE Access</source>
          ,
          <year>2021</year>
          ,
          <volume>9</volume>
          :
          <fpage>162857</fpage>
          -
          <lpage>162868</lpage>
          . data augmentation[C]//Proceedings of the AAAI
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            <given-names>F</given-names>
          </string-name>
          , Duan Z. One-class learning to-
          <source>conference on artificial intelligence</source>
          .
          <year>2020</year>
          ,
          <volume>34</volume>
          (
          <issue>07</issue>
          )
          <article-title>: wards synthetic voice spoofing detection[J]</article-title>
          .
          <source>IEEE 13001-13008. Signal Processing Letters</source>
          ,
          <year>2021</year>
          ,
          <volume>28</volume>
          :
          <fpage>937</fpage>
          -
          <lpage>941</lpage>
          . [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Svishchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Volkova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Das</surname>
            <given-names>R K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>H.</given-names>
          </string-name>
          <article-title>Long range acoustic and deep Chirkovskiy, A</article-title>
          . Kondratev,and G. Lavrentyeva, features perspective on
          <source>ASVspoof</source>
          <year>2019</year>
          [C]//2019 “
          <article-title>STC Antispoofing Systems for the ASVspoof2021 IEEE Automatic Speech Recognition</article-title>
          and Under- Challenge,”
          <source>in Proc. 2021 Edition of the Automatic standing Workshop</source>
          (ASRU). IEEE,
          <year>2019</year>
          :
          <fpage>1018</fpage>
          -
          <lpage>1025</lpage>
          . Speaker Verification and Spoofing Countermea-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nautsch</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            <given-names>N</given-names>
          </string-name>
          , et al.
          <source>ASVspoof</source>
          <year>2019</year>
          : sures Challenge,
          <year>2021</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>67</lpage>
          .
          <article-title>spoofing countermeasures for the detection of syn-</article-title>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Khoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Phatak</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Sivaraman, thesized, converted and replayed speech[J]</article-title>
          . IEEE “Pindrop Labs'
          <article-title>Submission to the ASVspoof 2021 Transactions on Biometrics, Behavior,</article-title>
          and Identity Challenge,”
          <source>in Proc. 2021 Edition of the Automatic Science</source>
          ,
          <year>2021</year>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>252</fpage>
          -
          <lpage>265</lpage>
          . Speaker Verification and Spoofing Countermea-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Tak</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamble</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patino</surname>
            <given-names>J</given-names>
          </string-name>
          , et al.
          <source>Rawboost: sures Challenge</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [24]
          <string-name>
            <surname>J. C</surname>
          </string-name>
          <article-title>´aceres</article-title>
          , R. Font,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Molina</surname>
          </string-name>
          , “
          <article-title>The Biometric Vox System for the ASVspoof 2021 Challenge,”</article-title>
          <source>in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Fathan</surname>
          </string-name>
          , “
          <article-title>CRIM's System Description for the ASVSpoof2021 Challenge,”</article-title>
          <source>in Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge</source>
          ,
          <year>2021</year>
          pp.
          <fpage>100</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Delgado</surname>
          </string-name>
          , T. Kin- nunen, M. Todisco,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          et al.,
          <source>“Asvspoof</source>
          <year>2021</year>
          :
          <article-title>Towards spoofed and deepfake speech detection in the wild</article-title>
          ,
          <source>” arXiv preprint arXiv:2210.02437</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Rimon</surname>
          </string-name>
          , E. Aflalo, and
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Permuter</surname>
          </string-name>
          , “
          <article-title>A study on data augmentation in voice anti-spoofing,” Speech Communication</article-title>
          , vol.
          <volume>141</volume>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>67</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          et al.,
          <source>“Asvspoof</source>
          <year>2021</year>
          :
          <article-title>accelerating progress in spoofed and deepfake speech detection</article-title>
          ,” in
          <source>ASVspoof 2021 Workshop-Automatic Speaker Verification and Spoof- ing Coutermeasures Challenge</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          , “
          <article-title>The dku-cmri system for the asvspoof 2021 challenge: vocoder based replay channel response estimation</article-title>
          ,
          <source>” Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge</source>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>21</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Xue</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lv</surname>
            <given-names>Z</given-names>
          </string-name>
          , et al.
          <source>Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features[C]//Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia</source>
          .
          <year>2022</year>
          :
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and G. Sun, “
          <article-title>Squeeze-and-excitation networks</article-title>
          ,
          <source>” in Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Xue</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            <given-names>J</given-names>
          </string-name>
          , et al.
          <article-title>Learning from yourself: A self-distillation method for fake speech detection</article-title>
          [C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          . IEEE,
          <year>2023</year>
          :
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Liu</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahidullah</surname>
            <given-names>M</given-names>
          </string-name>
          , et al.
          <source>ASVspoof</source>
          <year>2021</year>
          :
          <article-title>Towards spoofed and deepfake speech detection in the wild</article-title>
          [J].
          <source>arXiv preprint arXiv:2210.02437</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>