<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Unknown Speech Spoofing Algorithms with Nearest Neighbors</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jingze Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuxiang Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhuo Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zengqiang Shang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>WenChao Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pengyuan Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics</institution>
          ,
          <addr-line>CAS, No.21 North 4th Ring West Road, Haidian District, Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Chinese Academy of Sciences</institution>
          ,
          <addr-line>No.1 Yanqihu East Road, Huairou District, Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>89</fpage>
      <lpage>94</lpage>
      <abstract>
        <p>The development of deep speech generation technology has increased the risk of people being exposed to malicious or misleading information. From a defensive perspective, merely distinguishing between genuine and fake utterances is not enough. At the vocoder level, the artifacts in diferent frequency bands make it possible to distinguish between diferent synthesis methods. A reliable model should not only classify synthesis algorithms correctly, but also be able to identify samples that have not been seen. The second Audio Deepfake Detection Challenge (ADD2023) set up Track3 (Deepfake Algorithms Recognition) to simulate such a scenario. The challenge motivates researchers to construct systems that are robust enough for In-Distribution (ID) and Out-Of-Distribution (OOD) utterances. Cosine similarity based kNN distance is introduced in this work to separate unknown samples from known ones. Together with data augmentation methods and logits based model fusion, our system wins first place in ADD2023 Track3.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deepfake Detection</kwd>
        <kwd>Algorithms Recognition</kwd>
        <kwd>Out-Of-Distribution Detection</kwd>
        <kwd>ADD Challenge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Nearest Neighbors (kNN) distance is adopted to detect</title>
        <p>OOD data in [17]. We find that for the task of detecting
spoofing algorithms, the kNN distance based on cosine
similarity can efectively detect samples from OOD
algorithms. Therefore, kNN distance is introduced in this
work to construct a class calibration module, which
improves the performance of basic models significantly. In
addition, we investigate diferent data augmentation and
model fusion methods. All these methods help us achieve
ifrst place in ADD2023 Track3.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>The proposed work is based on Track 3 (Deepfake Al</title>
        <p>gorithms Recognition) of ADD2023 Challenge. In this Figure 1: The average amplitude of diferent vocoders at each
section, we investigate the basis of deepfake algorithms frequency bands.
recognition, which is the artifacts introduced by vocoders
located in diferent frequency bands. In addition, in
Track3, OOD samples exist in the test set. A kNN-based We denote the sets of In-distribution (ID) data and
OutOOD detection method is also proposed to identify sam- of-distribution (OOD) data as  and , respectively.
ples from unknown counterfeit class. The purpose of the algorithm is to distinguish a sample
 ∈ , where  donates the input space, is from  or
2.1. Vocoder Artifacts . For this binary classification task, a direct solution
is to set a mapping function  () and a threshold .</p>
      </sec>
      <sec id="sec-2-2">
        <title>Before recognizing deepfake algorithms, what needs to</title>
        <p>be demonstrated is whether the utterances generated by
diferent synthesis methods are distinguishable. In other {︃,  () ≥ 
words, on what level are they distinguishable.  ∈ ,  () &lt;</p>
        <p>Vocoder is a key component in the process of
generating fake utterances, which converts features to sam- Based on the assumption that samples from diferent
pling points. The quality of the vocoder determines the classes are farther apart, in this work, we leverage the
quality of the generated utterance. Vocoder residual arti- k-th nearest neighbor (kNN) distance as the output of
facts located in diferent frequency bands could serve as the mapping function  (), inspired by [17]. Compared
markers for deepfake algorithms. For instance, non-ideal to the 1st nearest neighbor (1NN) distance, under an
upsampling filters will leave aliasing artifacts in the high appropriate k-value, kNN distance is less susceptible to
frequency part [18]. Figure 1 shows the impact of difer- noise samples. Cosine similarity is adopted to calculated
ent vocoders on utterances at the frequency level. We the distance between the feature embeddings. Cosine
reconstruct the same batch of natural speech using dif- similarity is defined as:
ferent vocoders, and calculate the average energy of the  · 
frames at each frequency point. From Figure 1, it can be , = ‖‖‖‖
analyzed that the artifacts carried by diferent vocoders
are located in diferent frequency bands. Therefore,
features that encode time-frequency information could be
utilized to recognize deepfake algorithms.
where  and  are the embeddings of utterances
extracted by models. Figure 2 shows the density of kNN
distance of embeddings between ID data and OOD data. The
ID data is from a known class of training set of ADD2023
Track3, and OOD data is from the other classes. kNN
2.2. KNN-based OOD Detection cosine distance of ID data is smaller than that of OOD
The proposed KNN-based OOD detection method is a data. Therefore, we could use a threshold-based criterion
distance-based method, which leverages the distance be- to determine whether the input utterance is OOD or not.
tween embeddings extracted by trained DNN-based mod- The pipeline of the method could be summarized as:
els. The basis of the proposed method is an intuitive (1) Train a multi-class DNN-based classifier with training
assumption that samples from the same class are closer dataset D; (2) Use the trained model to pre-classify
in distance, while samples from diferent classes are far- the test set D; (3) Extract the feature embeddings of
ther apart. each samples from D and D. (4) Select an
appropriate k-value, and calculate the kNN cosine distance of
D of each class, and estimate a threshold; (5) Calcu- where  and  represent precision and recall,
respeclate the kNN cosine distance between D and D tively.  and  are defined as:
of each class, and attribute the OOD samples to a new
unknown class based on a threshold-based criterion.
 = 1 ∑︁</p>
        <p>=1   +  
 = 1 ∑︁  
 =1   +</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Metrics</title>
        <sec id="sec-3-1-1">
          <title>We used the training, development and test datasets of</title>
          <p>ADD2023 Challenge track 3 (Deepfake algorithm
recognition) [14] to validate our work. The training and
development sets include 6 types of counterfeit speech generated
with diferent deepfake algorithms and 1 type of genuine
speech. The test set includes the 7 classes from the
training and development sets, and an unknown counterfeit
speech class. The detailed information about the dataset
is shown in Table 1.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Augmention</title>
        <p>To augment training data, we utilized a common data
augmentation method in the speech spoofing detection
tasks, which is to add noise and reverberation to the
original speech from MUSAN [19] and RIRs [20] datasets. In
addition, we add some acoustic scenes as additive noise to
improve the robustness of methods under various noisy
scenarios. The acoustic scenes are randomly selected
from the TAU Urban Acoustic Scenes database [21].</p>
        <p>
          Since ADD2023 Track 3 includes OOD data, it is
necessary to mitigate the common issue of overconfidence in
deep neural networks. Therefore, we also introduce
CutMix [
          <xref ref-type="bibr" rid="ref4">22</xref>
          ] as a data augmentation method. The operation
of CutMix could be described as
{︃ =  ⊙  + (1 ×
̃︀
̃︀ =   + (1 − ) 
−  ) ⊙ 
where  and  ∈ R × are two-dimensional
timefrequency feature extracted by utterances randomly
selected from the training set.  and  are the labels of
the selected samples.  ∈ {0, 1} × denotes a binary
mask indicating where to drop out and fill in from two
features, 1 × is a binary mask filled with ones. ⊙ is
element-wise multiplication. (, ) denotes the newly
̃︀ ̃︀
generated training sample. CutMix cuts and pastes two
speeches from diferent classes at the two-dimensional
time-frequency feature level, allowing the DNN model to
learn a better decision boundary. In addition, CutMix can
improve the model’s ability to distinguish OOD data [
          <xref ref-type="bibr" rid="ref4">22</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Architecture</title>
        <sec id="sec-3-3-1">
          <title>Since the method introduced in this work, detecting OOD</title>
          <p>data based on kNN, is model-agnostic, we attempt to train
various model architectures. By doing so, the
complementarity between diferent models could be utilized through
fusing model in order to enhance performance.</p>
          <p>Similar to the traditional pipeline of speech spoofing
detection tasks, in this deepfake algorithm recognition
task, we divide the model into a front-end for feature results are obtained by directly classifying the test set
extraction and a back-end for classification. For the front- into 7 classes, without considering the unknown
counterend, we choose a hand-crafted feature, STFT, and an un- feit class. The results show that the performance of the
supervised pre-trained feature extractor, Wav2Vec2 [23]. model has been significantly improved after the addition
For the back-end, three kinds of model architectures are of additive noise. which is consistent with the traditional
adopted, which are SENet [4], LCNN-LSTM [24] and speech spoofing detection task. While after adding
CutTDNN [25]. The SENet is an integration of the ResNet mix, the performance of the model is not significantly
with the squeeze-and-excitation (SE) [26] block. The changed.</p>
          <p>SENet18 and SENet34 are adopted in our work, the
number of blocks of which are diferent. Table 2</p>
          <p>The STFT feature is a two-dimensional time-frequency Data augmentation experimental results.
feature, so convolution-based models can learn the
patterns that exist in both dimensions. While, although Augment Method F1-score(%) ↑
Wav2Vec2 still extracts two-dimensional features from no augment 64.89
an utterance, the features at each time frame are context +additive noise 76.75
representations rather than patterns that could be learned +additive noise + CutMix 76.79
by convolutional kernels. Therefore, SENet-based
backends are cascaded to STFT front-ends. And LCNN-LSTM
and TDNN, which have RNN-based, which have the
ability to extract temporal information, are cascaded to the 4.2. Results of kNN-based OOD detection
Wav2Vec2-based front-end.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Training Strategy</title>
        <p>All DNN-based models are trained with Adam
optimizer [27], which is adopted with  1 = 0.9,  2 = 0.9,
 = 10−8 and weigth decay 10−4 . Angular margin based
softmax loss (A-softmax) [28] is adopted as the loss
function to be optimized. For the models with STFT-based
front-end, the learning rate is initialized as 3 × 10−4 . As
a scheduler, StepLR is used with step size of 10 epochs
and coeficient 0.5. For the Wav2Vec2-based feature
extractor, the learning rate is fixed at 10−6 . All models are
trained with 100 epochs, in which the model with the the
lowest loss on the dev set is selected as the final model.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Model Fusion</title>
        <sec id="sec-3-5-1">
          <title>Since the proposed OOD detection method is model</title>
          <p>agnostic, to leverage the complementarity between
different models, we introduce a logits-based model fusion
method. Logits output by diferent models are weighted
and then added. For the samples that are identified as
OOD data by kNN-based detector, the original maximum
logit value is assigned to the new unknown class, and
the logit of the original max class index is set to zero.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result and Analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Results of Data Augmentation</title>
        <sec id="sec-4-1-1">
          <title>Two data augmentation methods are introduced in this work, namely additive noise and cutmix. Under the same DNN model (STFT+SENet34), the results of data augmentation are shown in Table 2. It should be noted that all</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Results of Model Fusion</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>4.4. Results of Submitted System</title>
        <sec id="sec-5-1-1">
          <title>This paper describes the system developed for ADD2023</title>
          <p>Track3. Five single-models with diferent front-ends and
back-ends are constructed as basic classifiers for the
deepfake algorithms recognition task. kNN distance is
efective in separating ID samples and OOD samples.
Therefore, an OOD detection module based on kNN distance
is introduced and improve the performance of
singlemodels significantly. Introducing additive noise during
the training process makes single-model more robust.
After fusing these models at the logits level, our final
system achieves first place in ADD2023 Track3.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work is partially supported by the National Key Research and Development Program of China (No. 2021YFC3320103).</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Interspeech.</surname>
          </string-name>
          <year>2022</year>
          -
          <volume>129</volume>
          . [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <year>wav2vec</year>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , H. Ma, T. Wang,
          <volume>2</volume>
          .0:
          <article-title>A framework for self-supervised learning of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>tecting vocoder fingerprints of fake audio</article-title>
          ,
          <source>in: mation Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Proceedings of the 1st International Workshop on</source>
          [24]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          , A comparative study
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <article-title>New synthetic speech detection</article-title>
          ,
          <source>in: Proc. Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>York</surname>
          </string-name>
          , NY, USA,
          <year>2022</year>
          , p.
          <fpage>61</fpage>
          -
          <lpage>68</lpage>
          . URL: https://doi.org/ speech
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>4259</fpage>
          -
          <lpage>4263</lpage>
          . doi:
          <volume>10</volume>
          .21437/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          10.1145/3552466.3556525. doi:
          <volume>10</volume>
          .1145/3552466. Interspeech.
          <year>2021</year>
          -
          <volume>702</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          3556525. [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Desplanques</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thienpondt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Demuynck</surname>
          </string-name>
          , [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chenglong</surname>
          </string-name>
          Ecapa-tdnn:
          <article-title>Emphasized channel attention</article-title>
          , propa-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , ifcation,
          <source>Proc. Interspeech</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>3830</fpage>
          -
          <lpage>3834</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Add</source>
          <year>2023</year>
          :
          <article-title>the second audio deepfake detection chal-</article-title>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <surname>Squeeze-</surname>
          </string-name>
          and
          <string-name>
            <surname>-excitation</surname>
          </string-name>
          net-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          lenge, in: IJCAI 2023 Workshop on Deepfake Audio works,
          <source>in: Proceedings of the IEEE conference on</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Detection</surname>
            and
            <given-names>Analysis (DADA</given-names>
          </string-name>
          <year>2023</year>
          ),
          <year>2023</year>
          . computer vision and pattern recognition,
          <year>2018</year>
          , pp. [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          , Enhancing the reliabil-
          <volume>7132</volume>
          -7141.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>ity of out-of-distribution image detection in neu-</article-title>
          [27]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>ral networks</article-title>
          , in: 6th International Conference on optimization, in: Y. Bengio, Y. LeCun (Eds.),
          <fpage>3rd</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Learning</given-names>
            <surname>Representations</surname>
          </string-name>
          ,
          <source>ICLR</source>
          <year>2018</year>
          , Vancouver, International Conference on Learning Representa-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>BC</surname>
          </string-name>
          , Canada, April 30 - May 3,
          <year>2018</year>
          , Conference tions,
          <source>ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-9,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Track</given-names>
            <surname>Proceedings</surname>
          </string-name>
          , OpenReview.net,
          <year>2018</year>
          . URL:
          <year>2015</year>
          , Conference Track Proceedings,
          <year>2015</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          https://openreview.net/forum?id=H1VGkIxRZ. http://arxiv.org/abs/1412.6980. [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bendale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Boult</surname>
          </string-name>
          , Towards open set deep [28]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Raj</surname>
          </string-name>
          , L. Song,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>on computer vision and pattern recognition, 2016, recognition</article-title>
          , in: Proceedings of the IEEE conference
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          pp.
          <fpage>1563</fpage>
          -
          <lpage>1572</lpage>
          .
          <article-title>on computer vision</article-title>
          and pattern recognition,
          <year>2017</year>
          , [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ming</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Out-</surname>
          </string-name>
          of-distribution pp.
          <fpage>212</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <year>2022</year>
          , pp.
          <fpage>20827</fpage>
          -
          <lpage>20840</lpage>
          . [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Wang,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          , Analy-
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>form generation models, Applied Acoustics 203</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          (
          <year>2023</year>
          )
          <fpage>109183</fpage>
          . [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , MUSAN:
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>abs/1510</source>
          .08484 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          1510.08484. arXiv:
          <volume>1510</volume>
          .
          <fpage>08484</fpage>
          . [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peddinti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Seltzer</surname>
          </string-name>
          , S. Khudan-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>speech for robust speech recognition</article-title>
          , in: 2017 IEEE
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Signal</given-names>
            <surname>Processing (ICASSP),</surname>
          </string-name>
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>5220</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          5224. [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mesaros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heittola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          , A multi-
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>tion of Acoustic Scenes and Events 2018 Workshop</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>(DCASE2018)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>13</lpage>
          . [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          , D. Han,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>puter vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6023</fpage>
          -
          <lpage>6032</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>