<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Adaptive Fake Audio Detection with Low-Rank Model Squeezing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaohui Zhang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiangyan Yi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianhua Tao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chenglong Wang</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Le Xu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruibo Fu</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Automation, Tsinghua University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Artificial Intelligence, University of Chinese Academy of Sciences</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computer and Information Technology, Beijing Jiaotong University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>State Key Laboratory of Multimodal Artificial Intelligence System, Institute of Automation, Chinese Academy of Sciences</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Science and Technology of China</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>95</fpage>
      <lpage>100</lpage>
      <abstract>
        <p>The rapid advancement of spoofing algorithms necessitates the development of robust detection methods capable of accurately identifying emerging fake audio. Traditional approaches, such as finetuning on new datasets containing these novel spoofing algorithms, are computationally intensive and pose a risk of impairing the acquired knowledge of known fake audio types. To address these challenges, this paper proposes an innovative approach that mitigates the limitations associated with finetuning. We introduce the concept of training low-rank adaptation matrices tailored specifically to the newly emerging fake audio types. During the inference stage, these adaptation matrices are combined with the existing model to generate the final prediction output. Extensive experimentation is conducted to evaluate the eficacy of the proposed method. The results demonstrate that our approach efectively preserves the prediction accuracy of the existing model for known fake audio types. Furthermore, our approach ofers several advantages, including reduced storage memory requirements and lower equal error rates compared to conventional finetuning methods, particularly on specific spoofing algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fake audio detection</kwd>
        <kwd>low-rank adaption</kwd>
        <kwd>finetuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        seen types of fake audio. However, finetuning the model
on the new dataset can disrupt the knowledge model
In recent years, there has been a significant concern sur- learned from the old dataset, leading to a decrease in the
rounding the issue of audio forgery attacks. Detection recognition accuracy of the model for fake audio
genermodels for detecting fake audio, based on handcrafted ated by known spoofing algorithms, which is known as
features [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and large-scale pre-trained models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], catastrophic forgetting [
        <xref ref-type="bibr" rid="ref11 ref9">11, 9</xref>
        ]. In addition, if the model
have achieved promising performance on multiple com- has a large number of parameters, simultaneously
finepetition datasets [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8">4, 5, 6, 7, 8</xref>
        ]. However, when faced with tuning will not only require a high training time and
audio generated by spoofing algorithms that were not computational memory consumption, but also result in
encountered during training, these models experience a large saved model that is dificult to use in scenarios
a significant decrease in their discrimination accuracy with storage space limitations.
[
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. This issue has become one of the crucial factors To mitigate the detrimental impact of fine-tuning on
achindering the practical application of fake audio detec- quired knowledge, we propose a novel training approach
tion models. As new audio spoofing techniques continue based on Low-Rank Adaption (LoRA) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Our method
to emerge, there is a need for a method to improve the dis- tackles the issue of poor performance of the model on
criminative ability of fake audio detection models against unseen types of fake audio. The core of our approach
new spoofing attacks. lies in training two low-rank adaptive matrices rather
      </p>
      <p>The most intensive way to improve the detection accu- than finetuning the whole model for improving the
recogracy of the model against new spoofing algorithms is to nized accuracy of the unseen fake audio. During training
ifnetune the model on a new dataset including those un- on the new dataset that includes those unseen fake
audio, we load the source model (SoM), which is a saved
model training on the old dataset, and freeze all its
parameters. This allows us to solely focus on training two
adaptive matrices, namely A and B, as introduced by
the LoRA algorithm. When performing inferences on
the new dataset, we load the SoM together with the two
adaptive matrices. Conversely, when dealing with the old</p>
      <sec id="sec-1-1">
        <title>Train:</title>
        <p>Load</p>
        <sec id="sec-1-1-1">
          <title>Dataset A</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Dataset B</title>
          <p>Save
SoM
Load
AB
(a)
Save
BB</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>Dataset C</title>
          <p>Ac
Save
Bc</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Inference:</title>
        <p>SoM
Load</p>
        <sec id="sec-1-2-1">
          <title>Dataset A</title>
          <p>PredictionA
Load
AB
Load
Ac
BB
Load
Bc
Load</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>Dataset B</title>
        </sec>
        <sec id="sec-1-2-3">
          <title>Dataset C</title>
          <p>PredictionB
(b)</p>
          <p>PredictionC</p>
          <p>
            SoM, efectively evading the risk of damage to the knowl- downstream task. During model inference, LoRA
simuledge obtained from known instances of fake audio in the taneously loads the large-scale pre-trained model and
old dataset. Additionally, our approach boasts an advan- the two matrices A and B trained specifically for that
tage in terms of storage memory consumption, as it only downstream task. Compared to existing methods that
necessitates the storage of the two low-rank adaptive introduce adapter layers [
            <xref ref-type="bibr" rid="ref13 ref14">13, 14, 15, 16</xref>
            ], LoRA exhibits
matrices A and B for the new and unseen fake audio. lower inference latency [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. This method also reduced
Furthermore, experimental results demonstrate that our the number of trainable parameters in the training of
method achieves lower equal error rates (EER) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] on cer- the large-scale language model GPT-3 [17] by a factor of
tain types of unseen fake audio compared to finetuning. 10,000 and decreased GPU memory requirements by 3
          </p>
          <p>Contribution: We propose a method based on Low- times.</p>
          <p>Rank Adaption to address the issue of low recognized
accuracy when models encounter new and unknown
types of fake audio in fake audio detection. Compared 3. Methodology
to the commonly used fine-tuning approach, our method
requires lower storage space and avoids forgetting the
knowledge learned from existing known types of fake
audio. Additionally, experimental results demonstrate
that our method achieves higher recognized accuracy for
certain unknown types of spoofing algorithms compared
to finetuning.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Low-Rank Adaption (LoRA)[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a method proposed to
significantly reduce GPU and storage memory
consumption when finetuning large-scale pre-trained transformer
models for specific tasks. The core idea of LoRA is that
the learned over-parametrized models actually reside on
a low intrinsic dimension. Therefore, for each specific
downstream task, LoRA introduces two low-rank
matrices, A and B, to replace the entire model during training.
      </p>
      <p>In each downstream task, LoRA first loads the large-scale
pre-trained model but freezes all its parameters. Then, it
initializes matrices A and B as zero matrices and trains
only these two matrices using the training dataset of the
When facing unknown types of fake audio generated
by unknown algorithms, the accuracy of deep neural
networks would significantly decrease compared to the
known types included in the training set. The
finetuningbased methods may damage the learned knowledge of the
model, leading to a reduction in the detection accuracy
of known types of fake audio. To address this problem,
we propose a new method based on LoRA to improve the
recognized ability of the model to detect unknown types
of fake audio. The training and inference processes of
our method are illustrated in Fig 1a and 1b, respectively.</p>
      <p>
        We consider a model designed for fake speech detection,
where there exists an initial dataset A containing some
fake audio generated by known spoofing algorithms, and
two additional datasets B and C, both of which contain
new types of fake audio not present in dataset A. We first
train a source model (SoM) on dataset A. As SoM has
not seen the new types of fake audio in datasets B and C,
it can achieve good recognition performance on dataset
A, but its performance would significantly decrease on
datasets B and C. To improve the detection accuracy of
new types of fake audio in dataset B, we load the saved
SoM trained on dataset A, freeze all its parameters, and we enable our model to achieve self-incremental
learnintroduce two low-rank adaptive matrices A and B ing within limited storage space, thus defending against
specifically designed for dataset B. Both A and B are emerging attacks from new spoofing algorithms.
initialized as all-zero matrices with rank  , which is
much lower than the rank of SoM. During the training
process on dataset B, we simultaneously feed data into 4. Experiment
both SoM and the adaptive matrices A and B . The
outputs from both components are summed to generate 4.1. Datasets
the output of the model ℎ, as shown in Equation 1. Three fake audio datasets are selected for our
experiments, including the ASVspoof2019LA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], ASVspoof2015
ℎ = W  + ABBB (1) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and In-the-Wild [18]. All of the experiments are
trained on training sets and evaluated on evaluation sets
in these datasets.
      </p>
      <p>ASVspoof2019LA is a dataset widely used in the field
of fake audio detection. It was created as part of an
international challenge that aimed to evaluate the
performance of automatic speaker verification systems in
detecting spoofing attacks. The dataset consists of a large
collection of both genuine and spoofed speech recordings,
where spoofed speech refers to artificially generated or
manipulated audio designed to deceive speaker
verification systems.</p>
      <p>ASVspoof2015 is another important dataset used in
fake audio detection research. The dataset contains both
genuine and spoofed speech recordings, with various
types of spoofing attacks, such as speech synthesis, voice
conversion, and replay attacks. ASVspoof2015 ofers a
diverse range of spoofing techniques, making it a valuable
resource for studying and developing robust
countermeasures against fake audio.</p>
      <p>In-the-Wild is a commonly used collection of
realworld audio recordings that encompass a broad range
of environments and scenarios. Unlike the
aforementioned datasets that focus on specific spoofing attacks,
In-the-Wild captures audio data from various sources
and situations encountered in everyday life. This dataset
aims to simulate the challenges faced by fake audio
detection systems when dealing with uncontrolled and
unpredictable acoustic conditions. We divide the genuine and
fake audios of the In-the-Wild dataset into two subsets.</p>
      <p>One-third is used to build the training set, and the rest is
used as the evaluation set.
where the  represents the input batch of data and the
ℎ is the output state of the model. While the
parameters of SoM remain unchanged during the training
on dataset B, we optimize the parameters of the two
low-rank adaptive matrices A and B to learn new
features for the new types of fake audio and optimize the
detection performance of the model on these types. Once
the training on dataset B is completed, we only need to
save the two low-rank adaptive matrices, A and B ,
instead of the entire model. The same process is repeated
on dataset C, allowing us to train two additional low-rank
adaptive matrices, A and B , specifically tailored to
the new types of fake audio in dataset C.</p>
      <p>Our method follows a similar inference process across
diferent datasets, as shown in Fig 1b. When the model
predicts a fake audio type belonging to Dataset A, we only
load the SoM. Since the parameters of the SoM are frozen
and not involved in the training process on Datasets B
and C, the detection performance on Dataset A is not
disrupted by the learned features from the new fake audio
in Datasets B and C. When the model predicts a fake
audio type from Dataset B, both the source model SoM
and the low-rank adaptive matrices A and B are
loaded into the model, and the output follows Equation
1. Although the parameters of the SoM are not updated
during training, the model can learn the features of the
new fake audio type by training the parameters of the
two adaptive matrices. As a result, compared to using
only the source model SoM, the detection accuracy of
the model is significantly improved. Similarly, when the
model faces fake audio types from Dataset C, we can load
the SoM and the adaptive matrices A and B trained
specifically for Dataset C.</p>
      <p>Overall, our algorithm provides a low-cost incremental
learning method for the model. As audio spoofing
algorithms continue to evolve, we can view Dataset A as
consisting of the known spoofing algorithms and set a time
period , after which we collect new fake audio
generated by emerging spoofing algorithms to build Dataset B.</p>
      <p>We apply our method to incrementally learn the model on
Dataset B. After another time period of 2, we repeat
the process to construct Dataset C and perform
incremental learning on Dataset C. Through this approach,
4.2. Experimental Setup
In our experiments, the Low-Level Cepstral Coeficients
(LFCC) [19] feature has been selected as the input
feature extracted from each audio. The classifier is the
Squeeze-and-Excitation Network (SENet) [20] with three
sub-layers. All of them include three basic blocks
introduced by the SENet. There is one conv2d layer before
each sub-layer. The input dim and output dim of the first
conv2d are 1 and 128, respectively. The second conv2d
and the third have input and output dim 128 and 256, 256
and 512, respectively. The kernel sizes of them are 9, 7,</p>
      <p>Dataset
EER(%)</p>
      <p>ASVspoof2019LA
6.51</p>
      <p>ASVspoof2015 In-the-Wild</p>
      <p>51.77 51.71
4.3. Only trained on ASVspoof2019LA
We first test the recognized performance of the deep
neural network against fake audio generated by known
and unknown algorithms, respectively. We consider the
datasets ASVspoof2019LA, ASVspoof2015, and
In-theWild as datasets A, B, and C, respectively, as shown in
Fig 1. We train the model only on the training set of
ASVspoof2019LA and evaluate it on the evaluation sets
of the three datasets. The experimental results are shown
in Table 1. The results indicate that the model has high
accuracy when faced with known types of fake audio
that have appeared in the training set, but its recognized
performance will degrade considerably when faced with
fake audio generated by new and unknown spoofing
algorithms.</p>
      <p>EER(%)</p>
      <p>SoM
Finetuning
Our method</p>
      <p>ASVspoof2019LA
6.51
35.31
6.51</p>
      <p>ASVspoof2015 In-the-Wild
51.77 51.71
45.13 1.39
2.38 1.25
4.4. The comparison on EER between
finetuning and our method on
learning between two datasets
In this section, we compare the recognized performance
between our method and finetuning on two-dataset
learning condition. We set two experimental situations: the
ifrst is the model first trained on the ASVspoof2019LA
and then trained on the ASVspoof2015; the second is
the model first trained on the ASVspoof2019LA and then
trained on the In-the-Wild. The results of these two
experiments are shown in Table 2a and 2b, respectively.</p>
      <p>From the second column of the two tables, we can
easily observe that training on the new dataset is really
beneficial for the detection of new fake audio generated
from new spoofing algorithms. However, from the
comparison in the first column, we can observe that
finetuning on the new dataset will definitely disrupt the
learned knowledge from the known types of fake audio
(6.51 → 49.03, 6.51 → 33.05). Compared to
finetuning, our method freeze the parameters of the SoM and
only trained two adaptive matrices to learn new
knowledge from the new dataset. In this case, the recognized
performance of the known fake audio types will remain
unchanged even after training on the new dataset 1b,
which is evaluated in our results shown in Table 2. From
the comparison on the learning performance between
our method and finetuning in the second column, we
can also see that our method achieves a higher
recognized accuracy in ASVspoof2019LA → ASVspoof2015,
which shows that our method has a positive efect on the
learning on specific unknown spoofing algorithms.
4.5. The comparison on EER between
finetuning and our method on
learning among three datasets
To evaluate the efectiveness of our method in
multidataset learning, we also compare the recognized
performance between our method and finetuning on three
datasets learning condition. In our experiment, we first
trained our model in the ASVspoof2019LA and saved
the completed source model SoM. After that, we trained</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>the SoM first on the ASVspoof2015 and then on the
Inthe-Wild, and saved the adaptive matrices AB, BB and
AC, BC, respectively. The inference process is shown This work is supported by the National Natural
in Fig 1b and the comparison result is illustrated in Table Science Foundation of China (NSFC) (No.61831022,
4. From the comparison of the first two rows in the re- No.U21B2010, No.62101553, No.61971419, No.62006223,
sult, we can observe that finetuning on new datasets will No.62276259, No.62201572, No. 62206278), Beijing
Mureduce the recognized accuracy on old datasets, which nicipal Science and Technology Commission,
Adminshows that the detection accuracy of the known types of istrative Commission of Zhongguancun Science Park
fake audio will considerably decrease after finetuning on No.Z211100004821013, Open Research Projects of
Zhethe unknown types of fake audio. However, we can see jiang Lab (NO. 2021KH0AB06).
that our method still remains unchanged in old datasets
ASVspoof2019LA and ASVspoof2015 and achieves lower References
EER than finetuning on the final dataset.
4.6. The comparison on storage memory</p>
      <p>between finetuning and our method
In order to improve the recognition performance of the
model, we train it according to the process illustrated in
Fig 1. After training on the new datasets ASVspoof2015
and In-the-Wild, we compare the storage memory
between the whole model and adaptive matrices saved by
ifnetuning and our method, respectively, which has been
shown in Table 3. The experimental result shows that
our method achieves a marked success in squeezing
storage memory. Under the setting illustrated in Sec 4.2, our
method greatly reduces the number of trainable
parameters and the storage memory requirement by about 30
times, which makes the model can be easily applied in
many strict memory constraint situations.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>In this paper, we propose a method to address the
problem of low detection accuracy of models facing newly
emerging fake audio generated by new spoofing
algorithms. In the training process, we train two low-rank
adaptation matrices A and B specifically for these new
types of fake audio. During inference, we simultaneously
load the existing model and these adaptation matrices,
and combine their prediction outputs as our final
prediction output. The experimental results demonstrate that
our method does not degrade the prediction accuracy of
the existing model for known types of fake audio because
the existing model parameters are not modified during
training on the new dataset. Moreover, our method has a
lower storage memory requirement and lower equal
error rates on some specific spoofing algorithms compared
to finetuning. These findings encourage further
investigation into countering the ever-evolving landscape of
audio spoofing while maintaining the learned knowledge
of known types of fake audio.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W. D.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <article-title>A new feature for automatic speaker verification antispoofing: Constant Q cepstral coeficients</article-title>
          , in: L. J.
          <string-name>
            <surname>Rodríguez-Fuentes</surname>
          </string-name>
          , E. Lleida (Eds.),
          <year>Odyssey 2016</year>
          :
          <article-title>The Speaker</article-title>
          and Language Recognition Workshop, Bilbao, Spain, June 21-24,
          <year>2016</year>
          , ISCA,
          <year>2016</year>
          , pp.
          <fpage>283</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>R. K. Das</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Assessing the scope of generalized countermeasures for anti-spoofing</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2020</year>
          , Barcelona, Spain, May 4-
          <issue>8</issue>
          ,
          <year>2020</year>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>6589</fpage>
          -
          <lpage>6593</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <article-title>Investigating selfsupervised front ends for speech spoofing countermeasures</article-title>
          , in: T. F. Zheng (Ed.),
          <source>Odyssey 2022: The Speaker and Language Recognition Workshop</source>
          , 28 June - 1
          <source>July</source>
          <year>2022</year>
          , Beijing, China,
          <string-name>
            <surname>ISCA</surname>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W. D.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hanilçi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sizov</surname>
          </string-name>
          ,
          <year>Asvspoof 2015</year>
          :
          <article-title>the first automatic speaker verification spoofing and countermeasures challenge</article-title>
          ,
          <source>in: INTERSPEECH</source>
          <year>2015</year>
          ,
          <article-title>16th Annual Conference of the International Speech Communication Association</article-title>
          , Dresden, Germany, September 6-
          <issue>10</issue>
          ,
          <year>2015</year>
          , ISCA,
          <year>2015</year>
          , pp.
          <fpage>2037</fpage>
          -
          <lpage>2041</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W. D.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection</article-title>
          , in: F. Lacerda (Ed.),
          <source>Interspeech</source>
          <year>2017</year>
          , 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden,
          <source>August 20-24</source>
          ,
          <year>2017</year>
          , ISCA,
          <year>2017</year>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vestman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W. D.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Kinnunen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <year>Asvspoof 2019</year>
          :
          <article-title>Future horizons in spoofed and fake audio detection</article-title>
          , in: G. Kubin,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          Kacic (Eds.),
          <source>Interspeech</source>
          <year>2019</year>
          , 20th Annual Conference of the International Speech Communication Association, Graz, Austria,
          <fpage>15</fpage>
          -
          <lpage>19</lpage>
          <source>in Neural Information Processing Systems 30: AnSeptember</source>
          <year>2019</year>
          , ISCA,
          <year>2019</year>
          , pp.
          <fpage>1008</fpage>
          -
          <lpage>1012</lpage>
          . nual Conference on Neural Information Processing
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Todisco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahidullah</surname>
          </string-name>
          ,
          <year>Systems 2017</year>
          , December 4-
          <issue>9</issue>
          ,
          <year>2017</year>
          , Long Beach, CA,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nautsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          , T. Kin- USA,
          <year>2017</year>
          , pp.
          <fpage>506</fpage>
          -
          <lpage>516</lpage>
          . nunen, N. W. D.
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Delgado</surname>
            , Asvspoof [15]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pfeifer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kamath</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rücklé</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Cho</surname>
          </string-name>
          ,
          <year>2021</year>
          :
          <article-title>accelerating progress in spoofed and deep- I. Gurevych, Adapterfusion: Non-destructive task fake speech detection</article-title>
          ,
          <source>CoRR abs/2109</source>
          .00537 (
          <year>2021</year>
          ).
          <article-title>composition for transfer learning</article-title>
          , in: P. Merlo, arXiv:
          <fpage>2109</fpage>
          .00537.
          <string-name>
            <surname>J. Tiedemann</surname>
          </string-name>
          , R. Tsarfaty (Eds.),
          <source>Proceedings of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nie</surname>
          </string-name>
          , H. Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          , 16th
          <article-title>Conference of the European Chapter of the Z</article-title>
          . Tian,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Association for Computational Linguistics:
          <string-name>
            <surname>Main X. Yan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>ADD</given-names>
          </string-name>
          <year>2022</year>
          :
          <article-title>the first Volume</article-title>
          ,
          <source>EACL</source>
          <year>2021</year>
          , Online,
          <source>April 19 - 23</source>
          ,
          <year>2021</year>
          ,
          <article-title>Asaudio deep synthesis detection challenge</article-title>
          ,
          <source>in: IEEE sociation for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <source>International Conference on Acoustics, Speech and 487-503. Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2022</year>
          ,
          <article-title>Virtual</article-title>
          and Singa- [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          , G. Geigle,
          <string-name>
            <given-names>M.</given-names>
            <surname>Glockner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Beck</surname>
          </string-name>
          , J. Pfeifer, pore,
          <fpage>23</fpage>
          -
          <lpage>27</lpage>
          May
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>9216</fpage>
          -
          <lpage>9220</lpage>
          . N.
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>I. Gurevych</given-names>
          </string-name>
          , Adapterdrop: On the efi-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J. Yi,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Contin
          <article-title>- ciency of adapters in transformers</article-title>
          , in: M.
          <article-title>Moens, ual learning for fake audio detection</article-title>
          , in: H.
          <string-name>
            <surname>Herman- X. Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
            , S. W. Yih (Eds.), Proceedings sky, H. Cernocký,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Lamel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>Scharen- of the 2021 Conference on Empirical Methods in borg</article-title>
          , P. Motlícek (Eds.),
          <source>Interspeech</source>
          <year>2021</year>
          ,
          <article-title>22nd An- Natural Language Processing</article-title>
          ,
          <string-name>
            <surname>EMNLP</surname>
          </string-name>
          <year>2021</year>
          , Virnual Conference of the International Speech Com- tual Event / Punta Cana, Dominican Republic,
          <fpage>7</fpage>
          -
          <lpage>11</lpage>
          munication Association, Brno, Czechia,
          <volume>30</volume>
          <fpage>August</fpage>
          - November,
          <year>2021</year>
          ,
          <article-title>Association for Computational 3 September 2021</article-title>
          , ISCA,
          <year>2021</year>
          , pp.
          <fpage>886</fpage>
          -
          <lpage>890</lpage>
          . Linguistics,
          <year>2021</year>
          , pp.
          <fpage>7930</fpage>
          -
          <lpage>7946</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Zhu,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Duan</surname>
          </string-name>
          , An empirical [17]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Subbiah, study on channel efects for synthetic voice spoof-</article-title>
          J. Kaplan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          , P. Shyam,
          <article-title>ing countermeasure systems</article-title>
          , in: H.
          <string-name>
            <surname>Hermansky</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Cernocký</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Lamel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Scharenborg</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
          </string-name>
          , P. Motlícek (Eds.),
          <source>Interspeech</source>
          <year>2021</year>
          ,
          <string-name>
            <surname>22nd Annual D. M. Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
            Chen, Conference of the International Speech Communi- E. Sigler, M.
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
          </string-name>
          , J. Clark, cation Association, Brno, Czechia,
          <volume>30</volume>
          <fpage>August</fpage>
          - 3
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <year>September 2021</year>
          , ISCA,
          <year>2021</year>
          , pp.
          <fpage>4309</fpage>
          -
          <lpage>4313</lpage>
          . D.
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Rabinowitz</surname>
          </string-name>
          , J. Ve- in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balness</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Desjardins</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Milan</surname>
          </string-name>
          , J. Quan, can, H. Lin (Eds.),
          <source>Advances in Neural Information T.</source>
          <string-name>
            <surname>Ramalho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Grabska-Barwinska</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <source>Processing Systems 33: Annual Conference on NeuC. Clopath</source>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          ,
          <source>Overcoming ral Information Processing Systems</source>
          <year>2020</year>
          ,
          <article-title>NeurIPS catastrophic forgetting in neural networks</article-title>
          ,
          <source>CoRR</source>
          <year>2020</year>
          , December 6-
          <issue>12</issue>
          ,
          <year>2020</year>
          , virtual,
          <year>2020</year>
          . abs/1612.00796 (
          <year>2016</year>
          ). arXiv:
          <volume>1612</volume>
          .
          <fpage>00796</fpage>
          . [18]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Czempin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dieckmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Frogh-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , yar, K. Böttinger,
          <article-title>Does audio deepfake detection S</article-title>
          .
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adap- generalize?</article-title>
          , in: H.
          <string-name>
            <surname>Ko</surname>
            ,
            <given-names>J. H. L.</given-names>
          </string-name>
          Hansen (Eds.),
          <article-title>Intertation of large language models</article-title>
          ,
          <source>in: The Tenth speech 2022, 23rd Annual Conference of the InterInternational Conference on Learning Representa- national Speech Communication Association</source>
          , Intions,
          <string-name>
            <surname>ICLR</surname>
          </string-name>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>April 25-29</source>
          ,
          <year>2022</year>
          , cheon, Korea,
          <fpage>18</fpage>
          -22
          <source>September</source>
          <year>2022</year>
          , ISCA,
          <year>2022</year>
          , OpenReview.net,
          <year>2022</year>
          . pp.
          <fpage>2783</fpage>
          -
          <lpage>2787</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giurgiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jastrzebski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morrone</surname>
          </string-name>
          , [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McLaren</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Q. de Laroussilhe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gesmundo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Attariyan, novel scheme for speaker recognition using a S. Gelly, Parameter-eficient transfer learning for phonetically-aware deep neural network, in: IEEE NLP</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.), International Conference on Acoustics,
          <source>Speech and Proceedings of the 36th International Conference Signal Processing, ICASSP</source>
          <year>2014</year>
          , Florence, Italy,
          <source>on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2019</year>
          ,
          <volume>9</volume>
          -
          <fpage>15</fpage>
          June 2019, May 4-
          <issue>9</issue>
          ,
          <year>2014</year>
          , IEEE,
          <year>2014</year>
          , pp.
          <fpage>1695</fpage>
          -
          <lpage>1699</lpage>
          . Long Beach, California, USA, volume
          <volume>97</volume>
          of Proceed- [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <surname>Squeeze-</surname>
          </string-name>
          and
          <article-title>-excitation netings of Machine Learning Research</article-title>
          , PMLR,
          <year>2019</year>
          , pp.
          <fpage>works</fpage>
          ,
          <source>in: 2018 IEEE Conference on Computer 2790-2799. Vision and Pattern Recognition</source>
          ,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2018</year>
          , Salt
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rebufi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bilen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          , Learning multiple Lake City,
          <string-name>
            <surname>UT</surname>
          </string-name>
          , USA, June 18-22,
          <year>2018</year>
          ,
          <article-title>Computer visual domains with residual adapters</article-title>
          ,
          <source>in: I. Guyon</source>
          , Vision Foundation / IEEE Computer Society,
          <year>2018</year>
          , U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          , R. Fergus, pp.
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          .
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.), Advances
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>