<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deepfake algorithm recognition through multi-model fusion based on manifold measure</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ye Tian</string-name>
          <email>tianye_cetc3@163.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunkun Chen</string-name>
          <email>yunkun.chen@connect.polyu.hk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuezhong Tang</string-name>
          <email>tangyuezhong@cetc.com.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boyang Fu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The 3rd Research Institute of China Electronics Technology Group Corporation</institution>
          ,
          <addr-line>No.B7 Jiuxianqiao North Road, Chaoyang District, Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>76</fpage>
      <lpage>81</lpage>
      <abstract>
        <p>This paper describes a deepfake algorithm recognition system submitted to the Audio Deep Synthesis Detection (ADD) Challenge Track 3, which aiming to recognize the algorithms of the deepfake utterances. Given the complex noise present in the testing data and the existence of unknown deepfake algorithms, we propose a manifold-based multi-model fusion approach for open-set recognition. This approach constructs a manifold space to fuse the deep embedding features extracted by diferent models and computes the geodesic distance between the manifold spaces of diferent deepfake algorithms to distinguish unknown deepfake methods. Experimental results demonstrate the efectiveness of the proposed strategy in multi-model fusion. The proposed system obtained the F1-score of 0.7934 in ADD Track 3 testing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deepfake algorithm recognition</kwd>
        <kwd>model fusion</kwd>
        <kwd>manifold space</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>3.2. Features</title>
        <p>The rest of this paper is organized as follows: Section
2 describes the task. Section 3 presents the related work
and illustrates our proposed method. Results and
discussions are reported in Section 4. Finally, the paper is
concluded in Section 5.</p>
        <p>To handle the complexity of the testing data, we explored
three categories of features: raw waveform, hand-crafted
features, and pre-trained features. Our expectation was
that a combination of these features would be able to
capture the divergences among diferent deepfake
algo</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Task description and data rithms.</title>
      <p>
        Based on the findings in literature [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], it has been
The Audio Deep Synthesis Detection (ADD) Challenge demonstrated that anti-spoofing systems can achieve
Track 3 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] aims to recognize the algorithms of the deep- good performance by using raw waveforms with an
endfake utterances. The testing dataset includes known and to-end network architecture. In our work, a unified audio
unknown algorithms of the fake ones. The training and duration of 3s was applied in subsequent processing with
developing sets include 7 classes (1 real and 6 counter- truncation or padding.
feit), the 7 categories are labeled 0, 1, 2, 3, 4, 5, 6. The Hand-crafted features are extracted based on specific
testing set includes 8 classes (the 7 classes included in the knowledge, in contrast to raw waveforms. Several
featraining and developing sets + 1 unknown counterfeit). tures are widely used in anti-spoofing, such as constant-Q
      </p>
      <p>
        There are 22,400 training data, 8,400 developing data, cepstral coeficients (CQCC), linear frequency cepstral
and 79,490 testing data. In addition to containing un- coeficients (LFCC), and log power magnitude
spectroknown categories, the noise of the testing data is much gram (Spec) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. While these features have demonstrated
more complex than the training data. It is clear that utility in anti-spoofing, we chosed to use LFCC as the
this challenge is focused on improving the generalization hand-crafted feature in track 3 based on our previous
ability of the model based on limited training data. tests with the ASVSpoof2019 dataset.
      </p>
      <p>
        Metrics for this track is the macro-average precision, Due to the complexity of the testing data and the
recall, and F1-score. scarcity of available training data, we utilized a
pretrained model to extract essential speech features.
Recently, some pre-trained speech models, including
3. System description Wav2vec 2.0 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], HuBERT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and WavLM [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], have
In order to improve the performance of the system on the demonstrated significant performance improvements in
testing set, some measures have been taken in terms of downstream tasks such as Automatic Speech
Recognithe data layer, feature layer, model layer and finally the tion, Text-to-speech and Voice Conversation. As some
score calculation, which are described in detail below. experiments have shown that HuBERT performs
comparably or better than the current leading Wav2vec 2.0 on
various benchmarks, we utilized a HuBERT model as a
3.1. Data augmentation feature extractor and fed raw waveform as input to the
model.
      </p>
      <p>First, by observing the training data, we found that the
audio were sampled at 16 or 24 kHz, and the volume of
the audio varies relatively widely. Thus, the whole audio
were uniformly resampled to 16 kHz and normalized.</p>
      <p>
        Then, by examining the testing data, compared to the
relatively clean training and developing data, the noise
interference in the test data was more complicated, then
data augmentation was performed on training and
developing data with MUSAN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] dataset. And the SNR was
set randomly among 15 30dB.
      </p>
      <p>Finally, some completely silent segments with
zerovolume were found in these datasets. Although this may
be a characteristic of some deepfake methods, the silent
segments that appear at the beginning and the end of
the audio were cropped out considering the generalized
application of the model.</p>
      <sec id="sec-2-1">
        <title>3.3. Deep recognition network</title>
        <p>In our work, we utilized three diferent deep networks:
rawnet2, SE-Res2Net50, and HuBERT.</p>
        <p>
          rawnet2 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is an end-to-end network that is trained
on raw audio and consists of one sinc layer, six residual
blocks with attention mechanism, gate recurrent units
(GRU), and two fully-connected layers. In our work, a
softmax function was added to the output layer to
produce seven-class predictions corresponding to the
categories in the training dataset. The model was trained for
100 epochs with a batch size of 32 and a learning rate of
0.0001.
        </p>
        <p>
          SE-Res2Net50 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] is an improved version of the
ResNet [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] model that combines squeeze-and-excitation
(SE) with Res2Block. We trained the model using LFCC
features with cross-entropy as the loss function and
Adam as the optimizer with default parameters. The
model was trained for 40 epochs with a batch size of 48
and a learning rate of 0.0002.
        </p>
        <p>
          HuBERT is a self-supervised learning pre-trained
model and is available in several versions. We utilized
the chinese-hubert-large [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] model, which was trained
using the WenetSpeech train L subset. Following the
ifnal layer of the model, we added two fully-connected
layers and a softmax function to generate predictions.
To mitigate the limitation of computing resources, we
trained the model with a batch size of 24 for 40 epochs.
        </p>
        <p>To ensure the best performance, we selected the final
model for testing from the above mentioned models with
the highest F1-score in the developing dataset.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.4. Manifold space and distance</title>
        <p>
          To classify the categories of deepfake audio and
identify unknown deepfake means, we adopted the manifold
space and manifold distance. Firstly, the manifold space
of each deepfake category was constructed using the
ONPE method [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Then, the spatial geodesic distance
[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] between diferent manifold spaces was calculated
using equation (1) and inverted to serve as a similarity
indicator. Finally, the softmax value was calculated using
equation (2)-(4) as the final decision score.
        </p>
        <p>(1, 2) = ‖Θ‖ 2, ‖Θ‖ 2 = [ 1,  2, . . . ,  ], (1)
where the geodesic distance (1, 2) was calculated
based on the principal angles [ 1,  2, . . . ,  ] between
spaces (1, 2), which were obtained from the
orthonormal basis matrix (obtained by ONPE) and singular value
decomposition.</p>
        <p>((,) − )
(,) = ∑︀6
=0((,) − )
,
 = ((,0), (,1), . . . , (,6)),
(,) = −( , ),
(2)
(3)
(4)
where (,) represents the similarity score between
the testing data  and the deepfake category , while
(,)( = 0, 1, . . . , 6) represents the negative of the
geodesic distance between the testing data manifold
space  and the deepfake method  manifold space .</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.5. Model fusion</title>
        <p>To efectively improve the final recognition results, we
conducted model fusion at three levels.</p>
        <sec id="sec-2-3-1">
          <title>3.5.1. Fusion on label layer</title>
          <p>First is the label layer fusion. In the output scores of
rawnet2, SE-Res2Net50 and HuBERT models, the index
corresponding to the maximum score was set to be the
output label. A threshold was set for open-set
recognition based on model training and validation. The output
labels were secondary adjusted and those with scores less
than the threshold were considered as unknown label 7.
Finally, three sets of recognition label values were thus
obtained for the testing data. The mode of the three sets
of labels was used as the fused label. When all three sets
of labels were diferent, the result from HuBERT model
was chosen as the fused result because it had the best
performance.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>3.5.2. Fusion on score layer</title>
          <p>
            Next is the score-level fusion. A common score fusion
method is conducted by calculating the mean of multiple
sets of scores. As discussed in literature [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], when the
scores showed a clear polarization in the histogram, it
would be hard to perform score fusion, and the fusion
results maybe degraded. In our work, the scores we
obtained of the testing data showed a polarization in the
histograms, as shown in Figure 1 (left). Although this
phenomenon is not as prominent as in the literature [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ],
we had taken a measure of inference augmentation to
alleviate it. As we know, if a model is trained well on the
training set, the Softmax function will be likely to get
extreme values (0 or 1). To make the outputs of softmax
less close to 0 or 1, we first set a bound of (-20,20) and then
added a constant multiplier of 0.1 to the inputs of softmax.
The score distribution after inference augmentation is
shown in Figure1 (right). Then the index corresponding
to the maximum score was set to be the output label.
          </p>
        </sec>
        <sec id="sec-2-3-3">
          <title>3.5.3. Fusion on feature layer</title>
          <p>Finally, feature-level fusion is performed, as shown in
Figure 2. For diferent models, the 256-dimensional output
of the penultimate layer was connected as the embedding
features, and then used to construct the manifold space
for each class, and the spatial distance was calculated as
the similarity score.</p>
          <p>Specifically,  training data for each deepfake method
 were input to the  trained models to obtain  ×
256×  feature matrices. The feature matrices were then
processed by ONPE to obtain the manifold space of the
deepfake method. Next, the testing data were segmented
into segments of length 3 with a shift of 1, and 
audio segments were obtained. The segments were input
to the three trained models to obtain  × 256 × 
feature matrices. The feature matrices were processed
by ONPE to obtain the manifold space of the testing
data. The geodesic distance and softmax score between
manifold space of training data and manifold space of
testing data were calculated as the final fusion score. If
the maximum score was higher than the threshold, the
index corresponding to the maximum score was set to
be the output label, otherwise the label was set to 7 as a
new label, and the threshold was fine-tuned by testing
data.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results and discussion</title>
      <p>
        results, in fact it was less efective than the approach of
Se-Res2Net50 with LFCC, probably because the model
was relatively simple and overfitting was more serious.
Maybe we should match the model with appropriate
subsequent classification networks so as to train a model
with excellent discriminative ability. To visualize the
efectiveness of the proposed method, we also used
tSNE [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] to visualize the embedding features of the three
models on developing data, as shown in Figure 3. It can
be seen that the distinguishability of rawnet2 was better
than that of Se-Res2Net50 on developing data, which also
indicated that the trained rawnet2 model was over-fitted
from another perspective.
      </p>
      <p>Secondly, in terms of fusion strategies, it can be seen
that manifold-based feature-level fusion got the best
performance, while the score-level fusion by inference
augmentation performed better than common score
fusion method (shown as F21 vs F22 and F41 vs F42). As
our trained rawnet2 model got poor performance and it
pulled down the overall performance in label-level
fusion with the other two models (shown as F1), it was
not considered in the subsequent score-level fusion and
feature-level fusion.</p>
      <p>According to the results of score-level fusion and
feature-level fusion, it indicated that there was
complementary information among the diferent models, and
by constructing the manifold space and measuring the
geodesic distance, further discriminative information was
extracted, thus enhancing the overall recognition
performance.</p>
      <p>Thirdly, in data augmentation, shown as B11 to B22,
due to the variability of background noise between the
training and testing data, by adding noise to the training
data was efective in improving the model performance.
However, unexpectedly, the performance of the HuBERT
model trained on the augmented data was not as good
as that of the HuBERT model trained on the original
training data (shown as B31 vs B32). One possible
reason was that the training data of the pre-trained models
already contain rich noisy data, which itself can shield
the efect of noise on speech. In addition, due to the time
constraint of the competition, all models were obtained
by training a set of parameters and no parameter tuning
was performed, which may also be a reason.</p>
      <p>Finally, it should be noted that, F3 in Table 1 with a
F1-score of 0.7352, was the best result we submitted to
ADD Track 3 during the competition and is ranked 5th.
After the competition, when we conduct supplementary
experiments on data augmentation, a better result was
found as B31. Then we conduct relevant fusion
experiments and obtained results shown as F4 and F5, with the
best result up to 0.7934, which so far can rank 3rd in the
competition. Despite this, the conclusion that
featurelevel fusion was better than fractional-level fusion was
consistent.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>The existing fake audio recognition systems often rely
on three types of architectures: handcrafted features
with classifiers, end-to-end classification models, and
pre-trained feature extractors with classifiers. In ADD
Track 3, we explored three models and three multi-model
fusion strategies. Experiments demonstrated the
efectiveness of the proposed manifold-based feature-level
fusion strategy. And the proposed score-level fusion by
inference augmentation provided an attempt to solve the
fusion of models with an overfitting tendency. In
addition, we experimented the efect of data augmentation on
model performance enhancement. Finally, the proposed
model fusion method obtained the F1-score of 0.7934 in
ADD Track3 testing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sisman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An overview of voice conversion and its challenges: From statistical modeling to deep learning</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Soong</surname>
          </string-name>
          , T. Y. Liu,
          <article-title>A survey on neural speech synthesis (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Muhammad</surname>
          </string-name>
          ,
          <article-title>Akbar, A overview of spoof speech detection for automatic speaker verification (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <article-title>Replay and synthetic speech detection with res2net architecture</article-title>
          , in: International Conference on Acoustics,
          <source>Speech, and Signal Processing</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. I.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          , Assert:
          <article-title>Anti-spoofing with squeeze-excitation and residual networks (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. K. Das</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Light convolutional neural network with feature genuinization for detection of synthetic speech attacks (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <article-title>End-to-end anti-spoofing with rawnet2</article-title>
          ,
          <year>2021</year>
          , pp.
          <fpage>6369</fpage>
          -
          <lpage>6373</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP39728.
          <year>2021</year>
          .
          <volume>9414234</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Zhang,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <article-title>Deep spectrotemporal artifacts for detecting synthesized speech (</article-title>
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2210.05254.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. Zhang,</surname>
          </string-name>
          <article-title>Deepfake detection system for the add challenge track 3.2 based on score fusion</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          .1145/3552466.3556528.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Recent advances in open set recognition: A survey</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>43</volume>
          (
          <year>2021</year>
          )
          <fpage>3614</fpage>
          -
          <lpage>3631</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2020</year>
          .
          <volume>2981604</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bendale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Boult</surname>
          </string-name>
          ,
          <article-title>Towards open set deep networks</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1563</fpage>
          -
          <lpage>1572</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>173</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          , F. Cheng, E. Belyaev,
          <article-title>Spatiotemporal attention on manifold space for 3d human action recognition</article-title>
          ,
          <source>Applied Intelligence</source>
          <volume>51</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1007/s10489-020-01803-3.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <year>Add 2022</year>
          :
          <article-title>the ifrst audio deep synthesis detection challenge</article-title>
          ,
          <source>in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>9216</fpage>
          -
          <lpage>9220</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICASSP43922.
          <year>2022</year>
          .
          <volume>9746939</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <article-title>Musan: A music, speech, and noise corpus (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          , CoRR abs/
          <year>2006</year>
          .11477 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2006</year>
          .11477. arXiv:
          <year>2006</year>
          .11477.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bolte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , Hubert:
          <article-title>Selfsupervised speech representation learning by masked prediction of hidden units</article-title>
          ,
          <source>CoRR abs/2106</source>
          .07447 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/ 2106.07447. arXiv:
          <volume>2106</volume>
          .
          <fpage>07447</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yoshioka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Wavlm:
          <article-title>Large-scale self-supervised pretraining for full stack speech processing</article-title>
          ,
          <source>CoRR abs/2110</source>
          .13900 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/ 2110.13900. arXiv:
          <volume>2110</volume>
          .
          <fpage>13900</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Larcher</surname>
          </string-name>
          , rawnet2-
          <fpage>antispoofing</fpage>
          (
          <year>2021</year>
          ). URL: https: //github.com/eurecom-asp/rawnet2-antispoofing.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weng</surname>
          </string-name>
          , et al.,
          <article-title>asv-anti-spoofing-</article-title>
          <string-name>
            <surname>withres2net</surname>
          </string-name>
          (
          <year>2020</year>
          ). URL: https://github.com/lixucuhk/ ASV-anti
          <article-title>-spoofing-with-Res2Net.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peng</surname>
          </string-name>
          , S. Liu, chinese
          <article-title>-speech-</article-title>
          <string-name>
            <surname>pretrain</surname>
          </string-name>
          (
          <year>2022</year>
          ). URL: https://github.com/TencentGameMate/ chinese_speech_pretrain.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Orthogonal neighborhood preserving embedding for face recognition</article-title>
          ,
          <source>in: 2007 IEEE International Conference on Image Processing</source>
          , volume
          <volume>1</volume>
          ,
          <year>2007</year>
          , pp.
          <source>I - 133-I - 136. doi:10</source>
          .1109/ICIP.
          <year>2007</year>
          .
          <volume>4378909</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Kernel grassmannian distances and discriminant analysis for face recognition from image sets</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          <volume>30</volume>
          (
          <year>2009</year>
          )
          <fpage>1161</fpage>
          -
          <lpage>1165</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/S0167865509001391. doi:https://doi. org/10.1016/j.patrec.
          <year>2009</year>
          .
          <volume>06</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L. van der</given-names>
            <surname>Maaten</surname>
          </string-name>
          , G. Hinton,
          <article-title>Visualizing data using t-sne</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          )
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          . URL: http://jmlr.org/papers/v9/ vandermaaten08a.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>