<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring the Adaptability of Large Speech Models to Non-Verbal Vocalization Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juan José Márquez Villacís</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico D'Asaro</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Rizzo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Bottino</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LINKS Foundation - AI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Space (ADS)</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large Speech Models (LSMs), pre-trained on extensive speech corpora, have recently emerged as powerful foundations in the audio processing field, demonstrating strong transfer capabilities to downstream tasks such as speaker identification and emotion recognition. However, while these models excel on speech-centric tasks, limited research has investigated their adaptability to Non-Verbal Vocalization (NVV) tasks, which involve vocal bursts like laughter, sighs, shrieks, and moans. In this work, we examine how well LSMs, specifically Wav2Vec 2.0, HuBERT, WavLM, and Whisper, can be adapted to NVV tasks. We conduct experiments using both linear probing to evaluate the pre-trained knowledge relevant to NVVs, and Parameter-Eficient Fine-Tuning (PEFT) techniques, including LoRA, Adapters, and Prompt Tuning. Experimental results on NVV datasets-ASVP-ESD, CNVVE, Non-Verbal Vocalization Dataset, ReCANVo, VIVAE-indicate that Whisper-based models consistently achieve superior performance, which is further enhanced through the application of LoRA. Additionally, our layer-wise analysis reveals that applying PEFT specifically to layers with lower NVV information is key to efective model adaptation, providing valuable insights for optimizing fine-tuning strategies in future work. The repository associated with this work can be found here: https://github.com/links-ads/kk-nonverbal-vocal-class</p>
      </abstract>
      <kwd-group>
        <kwd>Non-Verbal Vocalization Large Speech Models Parameter Eficient Fine-Tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        and emotion recognition from prosody [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the domain
of Non-Verbal Vocalizations (NVV) has received
comUnderstanding and correctly identifying emotional cues paratively little attention [
        <xref ref-type="bibr" rid="ref1 ref7">7, 1</xref>
        ]. Early approaches for
in human vocalizations is essential for building conver- NVV analysis often relied on Hidden Markov Models or
sational systems capable of engaging with people in an Convolutional Neural Networks. However, the advent
emotionally aware and natural manner [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Emotional of Transformer architectures [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has led to the
developinformation in the human voice is transmitted mainly ment of Large Speech Models (LSMs), including Wav2Vec
through two distinct pathways: speech prosody—which 2.0 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], HuBERT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], WavLM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and Whisper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
encompasses features such as intonation, rhythm, and vo- which have demonstrated impressive transfer learning
cal quality [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]—and non-verbal vocal sounds, commonly capabilities on speech-based tasks. Despite this success,
referred to as vocal bursts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which include expressions the adaptability of these models to NVV tasks remains
like laughter, sighs, screams, and moans. Importantly, largely unexplored.
these non-speech sounds serve as critical communicative In this work, we systematically investigate how
varitools, particularly for individuals with profound disabili- ous LSMs perform as feature extractors for NVV
recognities or speech limitations, since more than 96% of people tion, aiming to understand the extent to which non-verbal
with speech impairments are still able to produce non- knowledge is already embedded in their pre-trained
repverbal vocalizations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. resentations. To further enhance their adaptation to NVV
      </p>
      <p>
        While much research has focused on speech-related tasks, we apply Parameter-Eficient Fine-Tuning (PEFT)
tasks such as speaker recognition, speaker diarization, strategies [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], including Adapters [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Prompt Tuning
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and LoRA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
tCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- Our experimental results, conducted across five
* Corresponding author. NVV datasets—ASVP-ESD, CNVVE, Non-Verbal
Vocaliza† These authors contributed equally. tion Dataset, ReCANVo, VIVAE—indicate that Whisper
$ juan.marquez@linksfoundation.com (J. J. Márquez Villacís); consistently outperforms Wav2Vec 2.0, HuBERT, and
federico.dasaro@polito.it (F. D’Asaro); WavLM, especially when fine-tuned with PEFT
techgiuseppe.rizzo@linksfoundation.com (G. Rizzo); niques. Among these, LoRA achieves the best overall
andhrtetap.:b//octotinncoe@ptpboalsieto.s.oitu(rAce.fBoorgttei.nnoe)t/mjf/ (A. Bottino) performance. Further analysis of the Transformer layers
0009-0008-3098-5492 (J. J. Márquez Villacís); reveals that non-verbal information is primarily captured
0009-0003-8727-3393 (F. D’Asaro); 0000-0003-0083-813X (G. Rizzo); in the later layers of Whisper. Interestingly, we find that
0000-0002-8894-5089 (A. Bottino) applying LoRA exclusively to earlier, less important
lay© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
ers yields better adaptation compared to focusing on the
layers already rich in non-verbal knowledge. This
counterintuitive result suggests that adjusting the layers with
initially limited task relevance is crucial, as these layers
benefit most from targeted adaptation.
      </p>
      <p>The main contributions of this work are:</p>
      <p>
        Recent advancements in natural language processing
(NLP) and computer vision (CV) have leveraged vast
amounts of unlabeled data using Self-Supervised
Learning [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ] and Weakly Supervised Learning [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
Inspired by techniques such as masked language modeling
• We evaluate the adaptability of Large Speech Mod- in NLP and image modeling in CV, Wav2Vec 2.0 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
inels (Wav2Vec 2.0, HuBERT, WavLM, and Whisper) troduced a Large Speech Model (LSM) trained through
to Non-Verbal Vocalization tasks using both lin- masked speech modeling on large-scale audio datasets,
ear probing and Parameter-Eficient Fine-Tuning including the LibriSpeech corpus [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and LibriVox [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
techniques on five NVV datasets. Following Wav2Vec 2.0, subsequent LSMs such as
• We demonstrate that Whisper achieves the HuBERT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and WavLM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] further advanced
selfstrongest performance across all datasets, and supervised pretraining approaches. In parallel, Whisper
that LoRA is the most efective PEFT method [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] was introduced, trained with large-scale weak
suwhen compared to Adapters and Prompt Tuning. pervision from paired audio and transcription data using
• Through layer-wise importance analysis, we ob- an encoder-decoder transformer architecture.
serve that non-verbal information is predomi- These large speech models have demonstrated strong
nantly encoded in the later layers of Whisper. capabilities in learning rich and robust speech
representaSurprisingly, we find that adapting less impor- tions from large datasets, leading to significant
improvetant layers is more beneficial for task-specific ments in various tasks, including language modeling,
performance than focusing solely on the most audio classification, and speech-to-text transcription.
informative layers.
      </p>
      <sec id="sec-1-1">
        <title>2.3. Parameter Eficient Finetuning</title>
        <p>
          2. Related Work Large-scale models demonstrate strong adaptability
across a wide range of downstream tasks, but this
of2.1. Non Verbal Vocalization ten comes at a significant computational cost. To address
this, Parameter-Eficient Fine-Tuning (PEFT) techniques
Early approaches to recognizing Non-Verbal Vocaliza- have emerged, aiming to introduce minimal task-specific
tions (NVVs) primarily relied on Hidden Markov Models parameters while keeping the majority of the pretrained
(HMMs), which analyzed vocal signals based on acoustic model unchanged. This approach preserves the model’s
features such as intensity, pitch, and vowel articulation generalization ability and reduces the number of
parampatterns [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ]. Despite their initial success, these mod- eters that require modification.
els were limited by their dependence on linear modeling, As outlined by Han et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], PEFT methods can be
susceptibility to noise interference, and challenges in broadly categorized into two types: Additive PEFT and
handling large or complex datasets. Reparameterized PEFT. Additive PEFT methods include
        </p>
        <p>
          To address these limitations, subsequent research tran- techniques such as Adapters [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and Prompt Tuning
sitioned towards employing convolutional neural net- [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], which introduce additional learnable components
works (CNNs) that process time-frequency representa- at either the activation level or through prompt-based
tions like Mel spectrograms and Mel-Frequency Cep- conditioning without altering the core model
paramestral Coeficients (MFCCs) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Recent progress has been ters. Reparameterized PEFT approaches, such as LoRA
driven by the adoption of Transformer-based frameworks [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], apply low-rank adaptations to the weight matrices,
capable of learning from massive audio datasets. Draw- efectively transforming the model’s parameter space
ing inspiration from large-scale speech models such as while maintaining the original architecture and
inferWav2Vec 2.0 and Whisper, these state-of-the-art systems ence speed.
have enabled the classification of up to 67 distinct types These parameter-eficient strategies have shown
of vocal expressions [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. strong results in English Speech Emotion Recognition
        </p>
        <p>
          Following this research direction, Koudounas et al. tasks [
          <xref ref-type="bibr" rid="ref25 ref26 ref27">25, 26, 27</xref>
          ], with LoRA in particular demonstrating
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] proposed a new foundation model trained on 125 notable performance. In this work, we investigate the
hours of non-verbal vocalization data, demonstrating application of Adapters, Prompt Tuning, and LoRA for
significantly improved performance on downstream clas- adapting Large Speech Models to the classification of
sification tasks. Non-Verbal Vocalizations.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Non-Verbal Vocalization Classifier</title>
      <sec id="sec-2-1">
        <title>HuBERT HuBERT [10] introduced the use of an acous</title>
        <p>
          tic unit discovery system, such as k-means clustering
applied to MFCC features, to generate frame-level
tarIn this section, we describe the architecture of the Non- gets for both masked and unmasked tokens. By adjusting
Verbal Vocalization classifier illustrated in Figure 1. The the number of clusters (), the system produces targets of
model is composed of a Large Speech Model serving as varying granularity, ranging from broad vowel categories
the backbone ℬ, with a classifier  stacked on top. Addi- to more fine-grained senones. Similar to Wav2Vec 2.0,
tionally, we describe the integration of PEFT techniques, the HuBERT architecture employs a 1D convolutional
which can be selectively applied to the Transformer lay- feature encoder with seven layers, using a frame size of
ers of the LSM to enhance adaptability while minimizing 20 ms, followed by a series of Transformer blocks for
the number of trainable parameters. contextual representation learning.
3.1. Large Speech Models WavLM The WavLM framework [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] further extends
the pretraining approach introduced by Wav2Vec 2.0 by
Wav2Vec 2.0 Wav2Vec 2.0 demonstrated, for the first integrating both masked speech prediction and speech
detime, that it is possible to learn powerful speech represen- noising into the pretraining process. Specifically, WavLM
tations directly from raw audio without requiring labels. introduces masked speech denoising, where portions of
The architecture consists of a multi-layer 1D convolu- the input are artificially corrupted with simulated noise
tional feature encoder, which takes raw audio input  or overlapping speech. The model is then tasked with
and produces latent representations  = {1, . . . ,  }, predicting the pseudo-labels of the original clean speech
where  denotes the number of frames, each correspond- in the masked regions, similar to the approach used in
ing to 25 ms of audio. These latent representations  are HuBERT. This strategy enhances the model’s robustness
then passed through a Transformer network to obtain in complex acoustic environments.
contextualized representations  = {1, . . . ,  }. Addi- Like previous models, WavLM employs a 1D
convolutionally, the output of the feature encoder is discretized tional feature encoder followed by a Transformer encoder.
using product quantization in the latent space [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. This The Transformer in WavLM is augmented with gated
discretization enables the application of masked speech relative position bias, which improves the modeling of
modeling, the core innovation of Wav2Vec 2.0’s self- interactions between speech segments and enhances the
supervised learning strategy. The model is trained to model’s ability to capture long-range dependencies.
solve a contrastive task, where it must correctly identify
the true quantized latent representation of a masked time
step from a set of distractor candidates.
where  is a scaling coeficient that balances the
adaptation impact. At initialization, up is set to zero and
down is randomly initialized, ensuring that the model
dicts raw text transcripts directly from audio without initially behaves as the pretrained base without
modifirequiring significant text standardization. Whisper
emcation. This strategy allows LoRA to inject task-specific
ploys an encoder-decoder Transformer architecture,
conknowledge while preserving the original model’s
struc
        </p>
        <p>In this work, we utilize the Whisper model solely as a  = 1, . . . ,  indexes the layers and  = 1, . . . ,  indexes
sisting of an encoder  and a decoder , which processes</p>
      </sec>
      <sec id="sec-2-2">
        <title>Mel spectrograms instead of raw waveforms as used in</title>
        <p>earlier models. Formally, given an input audio signal
, the model first applies two 1D convolutional layers
with GELU activation as a feature encoder, followed by</p>
      </sec>
      <sec id="sec-2-3">
        <title>Transformer blocks to produce contextualized internal</title>
        <p>representations. These representations are then used by
the BERT-like decoder  to generate the output text.
and discarding the decoder .
feature extractor by using the encoder  as backbone ℬ</p>
        <sec id="sec-2-3-1">
          <title>3.2. PEFT Methods</title>
          <p>Adapter</p>
          <p>Adapters introduce small, trainable mod- as:
ules within Transformer layers to enable eficient
finesize and  is the bottleneck dimension.
tuning. Each adapter consists of a down-projection
matrix down ∈ R× , a non-linear activation  (· ), and an
up-projection matrix up ∈ R× , where  is the hidden</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Given input ℎin, the adapter output with residual con</title>
        <p>nection is:</p>
        <p>Adapter() = up  (down ) + 
ture and maintaining fast inference.</p>
        <sec id="sec-2-4-1">
          <title>3.3. Classifier Head</title>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>To perform non-verbal event classification, we append</title>
        <p>a classifier</p>
        <p>to the backbone ℬ of the Large Speech</p>
      </sec>
      <sec id="sec-2-6">
        <title>Model. From the Transformer encoder, we obtain hidden</title>
        <p>representations across all layers denoted by {ℎ}, where
the sequence frames.
a unified sequence</p>
        <p>We aggregate these multi-layer representations into
weighted sum across layers. This aggregation is
formalized by the function  : R×  × 
→ R ×</p>
        <p>, defined

{ℎ* }=1 by applying a learnable

=1
ℎ* = ∑︁  · ℎ,

∀ ∈ {1, . . . ,  }
(4)
(1)
(2)
are normalized such that ∑︀
where each weight  satisfies  ≥ 0 and the weights
=1
 = 1.</p>
        <p>
          The resulting sequence {ℎ* } is first projected using a
frame-wise linear transformation ℒ1 : R
lowing standard practices in speech emotion recognition
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], we apply temporal aggregation via average pooling

→
        </p>
        <p>R.
Fol over the  frames to produce a single vector
summarizing the input audio. This pooled representation is then
fed into a classification layer
outputs the logits corresponding to the target classes.
 : R
→</p>
        <p>R, which
The overall classifier  can be concisely expressed as:
︁(
 {ℎ* }=1
 ︁)
︁(
︁(</p>
        <p>︁(
=   ℒ1 {ℎ* }=1
 ︁)
(5)
Prompt Tuning</p>
        <p>Unlike adapters, embedding prompts
introduce learnable prompt vectors that are prepended to
the input sequence at each Transformer layer. Formally,
the input sequence to layer  is:
() = 1
[︁ (), . . . , ()
, 
(), . . . , 
1
() ]︁
where 
() are the continuous prompt tokens and 
() are
the original input tokens. Here,  denotes the number
of continuous prompt tokens, and  is the length of
the original input. This approach allows task-specific
information to be injected directly into the model without
modifying its internal weights.
smaller than min(, ).</p>
        <p>LoRA</p>
        <p>LoRA enhances each Transformer layer by
applying a low-rank decomposition to the pretrained
weight matrix 0 ∈ R× , enabling parameter-eficient
ifne-tuning without altering the original model weights.
It adds two additional trainable matrices: down ∈ R× 
and up ∈ R× , where  is the rank, typically much</p>
        <p>Given an input ℎin, the original output 0ℎin is
updated with a task-specific adjustment:


ℎout = 0ℎin +
updownℎin
(3)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>4.1. Datasets</title>
        <p>ASVP-ESD</p>
        <sec id="sec-3-1-1">
          <title>The ASVP-ESD (Audio, Speech and Vision</title>
          <p>
            Processing Lab Emotional Sound Database) [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ]
comprises 12,625 emotion-related audio samples, including
both speech and non-speech vocalizations. These
samples were collected from movies, YouTube channels, and
various other online sources. Each recording is
annotated with one of 12 emotion categories, plus an
additional "breath" label. All audio files are mono-channel
and sampled at 16 kHz.
ReCANVo The Real-World Communicative and
Afective Nonverbal Vocalizations (ReCANVo) dataset [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]
contains over 7,000 vocalizations produced by minimally
speaking individuals aged between 6 and 25 years. Each
vocalization is annotated with one of six communicative
or afective labels.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.3. Experimental Details</title>
        <sec id="sec-3-2-1">
          <title>CNVVE The Dataset and Benchmark for Classifying</title>
          <p>
            Non-verbal Voice Expressions (CNVVE) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] consists of
950 audio recordings from 42 participants. Each
recording is labeled with one of six non-verbal voice expression
categories. The audio samples are mono-channel and
sampled at 16 kHz.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>All experiments were conducted using a consistent setup</title>
          <p>across datasets. Each dataset was split into training,
validation, and test sets, with 80% of the audio samples used
for training, 10% for validation, and the remaining 10%
for testing.</p>
          <p>The Large Speech Models evaluated in this study
inNon-verbal Vocalization Dataset The Non-verbal clude: Whisper Tiny2, Whisper Base3, Whisper Small4,
Vocalization Dataset1 includes crowdsourced audio HuBERT Base5, WavLM Base Plus6, and Wav2Vec2 Base7.
recordings of non-verbal vocalizations categorized into Training was performed for 50 epochs with the
follow16 distinct labels. All recordings are sampled at 16 kHz, ing hyperparameters: an initial learning rate of 1e− 4,
with 16-bit resolution and mono-channel format. weight decay of 0.01,  1 = 0.9,  2 = 0.999, and
 = 1e− 8 for the Adam optimizer. A batch size of 16
was used along with a gradient accumulation step of 2.</p>
          <p>All experiments were executed on a single NVIDIA
A100 GPU.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.4. Results</title>
        <p>
          VIVAE The Variably Intense Vocalizations of Afect
and Emotion (VIVAE) dataset [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] comprises 1,085 audio
recordings from 11 speakers. The recordings are sampled
at 42 kHz with 16-bit resolution and are annotated with
six emotion labels. These labels capture both positive and
negative afective states, as well as emotional intensity.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>4.2. Metrics</title>
        <sec id="sec-3-4-1">
          <title>For the experimental evaluation, we report both Accuracy</title>
          <p>and Macro F1 score. Since the datasets are imbalanced,
the macro F1 score ofers a more reliable assessment of
the model’s performance across all classes.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>1https://www.openslr.org/99/</title>
          <p>4.4.1. Linear Probing on Large Speech Models</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>To compare the Large Speech Models introduced in Sec</title>
          <p>tion 3.1, we adopt a linear probing setup where the
backbone ℬ is kept frozen, and only the classiefir  is trained.
In this configuration, each model—Wav2Vec 2.0, HuBERT,
WavLM, and Whisper—is used purely as a feature
extractor for the Non-Verbal Vocalization task. This approach
allows us to evaluate the extent to which task-relevant
representations are already captured in the pre-trained
models.</p>
          <p>Table 1 reports the performance of each model across
all datasets, using Accuracy and Macro F1 as evaluation
metrics. Results indicate that Wav2Vec 2.0, HuBERT, and</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>2https://huggingface.co/openai/whisper-tiny</title>
          <p>3https://huggingface.co/openai/whisper-base
4https://huggingface.co/openai/whisper-small
5https://huggingface.co/facebook/hubert-base-ls960
6https://huggingface.co/microsoft/wavlm-base-plus
7https://huggingface.co/facebook/wav2vec2-base</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>For evaluating Parameter-Eficient Fine-Tuning (PEFT)</title>
          <p>techniques, we focus on Whisper models, which
demonstrated the strongest performance in the previous section.</p>
          <p>Table 2 presents the results across diferent fine-tuning
strategies applied to Whisper: Frozen Backbone, LoRA,
Adapters, and Prompt Tuning.</p>
          <p>
            Consistent with prior findings in audio classification 4.4.4. Optimizing PEFT via Layer Importance
tasks [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ], LoRA emerges as the most efective PEFT
method across various datasets and model sizes. How- In Section 4.4.2, we applied PEFT techniques uniformly
ever, an exception is observed in the Non-Verbal Vocal- across all Whisper layers, without considering their
relaization dataset, where Adapters achieve superior perfor- tive importance. However, as observed in the previous
mance for both the Whisper Base and Small models. section, diferent layers contribute unevenly to the
Non
          </p>
          <p>LoRA’s strength lies in its ability to eficiently intro- Verbal Vocalization task. Therefore, in this subsection, we
duce minimal task-specific parameters while selectively investigate whether the efectiveness of PEFT depends
modeling the non-verbal specific update ∆  , allowing it on layer importance, and if focusing on specific layers
to efectively integrate pre-trained knowledge with new can further reduce adaptation parameters.
task-specific information. Table 3 presents diferent strategies for applying LoRA
to Whisper models, as LoRA showed the best
perfor4.4.3. Analysis of Transformer Layers mance in most cases. For each model, LoRA refers to
applying the technique to all Transformer layers, LoRA[-]
This subsection examines the contribution of each Trans- applies LoRA only to the less important layers, and
former encoder layer within the Whisper backbone to LoRA[+] applies it exclusively to the important layers,
the Non-Verbal Vocalization task. We concentrate on the as determined in Section 4.4.3.</p>
          <p>CNVVE</p>
          <p>Nonverbal</p>
          <p>ReCanVo</p>
          <p>ViVAE
Acc</p>
          <p>Overall, we find that full LoRA adaptation typically tasks using both linear probing and Parameter-Eficient
yields the best results, followed by LoRA[-]. This sug- Fine-Tuning (PEFT) techniques. Our experimental results
gests that adapting the less important layers has a greater demonstrate that Whisper models consistently
outperpositive impact than focusing solely on the important form Wav2Vec 2.0, HuBERT, and WavLM across multiple
layers, for which performance is often significantly lower. NVV datasets.</p>
          <p>Although this may seem counterintuitive, we hypothe- Furthermore, we observe that applying PEFT methods
size that adaptation is more necessary where the network significantly improves performance, with LoRA
emergretains less prior knowledge relevant to the task. Impor- ing as the most efective strategy compared to Adapters
tant layers already encode useful features, thus requiring and Prompt Tuning. Through a detailed analysis of the
less adjustment, while ignoring the less important layers Transformer layer weights in Whisper models, we find
limits the model’s adaptability. that non-verbal information is predominantly captured</p>
          <p>Hence, we propose that focusing on the less impor- in the later layers.
tant layers is more beneficial than concentrating exclu- Interestingly, we discover that fine-tuning only these
sively on the important ones. This insight ofers valuable later layers yields limited gains compared to adapting
guidance for future work aimed at improving PEFT tech- the layers that initially contain less non-verbal
knowlniques by targeting the parts of the network that need edge. We hypothesize that this is because the layers with
the most adaptation. less task-relevant information require a larger degree of
adaptation to bridge the knowledge gap. This
observation suggests a valuable pathway for optimizing PEFT
5. Conclusion methods by selectively targeting particular transformer
layers based on the knowledge they embed, potentially
minimizing the need for additional task-specific
parame</p>
        </sec>
        <sec id="sec-3-4-6">
          <title>In this work, we investigated the adaptability of Large Speech Models (LSMs) to Non-Verbal Vocalization (NVV) ters even further.</title>
          <p>Declaration on Generative AI</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Tzirakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gagne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Opara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gregory</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Metrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boseck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tiruvadi</surname>
          </string-name>
          , et al.,
          <article-title>Large-scale nonverbal vocalization detection using transformers</article-title>
          ,
          <source>in: ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Foundation model assisted automatic speech emotion recognition: Transcribing, annotating, and augmenting</article-title>
          ,
          <source>in: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>12116</fpage>
          -
          <lpage>12120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Liebenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Silbersweig</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Stern,</surname>
          </string-name>
          <article-title>The language, tone and prosody of emotions: neural substrates and dynamics of spoken-word emotion perception</article-title>
          ,
          <source>Frontiers in neuroscience 10</source>
          (
          <year>2016</year>
          )
          <fpage>506</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cowen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sauter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Tracy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Keltner</surname>
          </string-name>
          ,
          <article-title>Mapping the passions: Toward a high-dimensional taxonomy of emotional experience and expression</article-title>
          ,
          <source>Psychological Science in the Public Interest</source>
          <volume>20</volume>
          (
          <year>2019</year>
          )
          <fpage>69</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>McCormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McLeod</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. McAllister</surname>
          </string-name>
          ,
          <article-title>The impact of speech impairment in early childhood: Investigating parents' and speech-language pathologists' perspectives using the icf-cy</article-title>
          ,
          <source>Journal of communication disorders 43</source>
          (
          <year>2010</year>
          )
          <fpage>378</fpage>
          -
          <lpage>396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wöllmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>
          ,
          <source>in: Proceedings of the 18th ACM international conference on Multimedia</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>1459</fpage>
          -
          <lpage>1462</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hedeshy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Menges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <article-title>Cnvve: Dataset and benchmark for classifying non-verbal voice (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.-N.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bolte</surname>
          </string-name>
          , Y.
          <string-name>
            <surname>-H. H. Tsai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lakhotia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mohamed</surname>
          </string-name>
          , Hubert:
          <article-title>Self-supervised speech representation learning by masked prediction of hidden units</article-title>
          ,
          <source>IEEE/ACM transactions on audio, speech, and language processing 29</source>
          (
          <year>2021</year>
          )
          <fpage>3451</fpage>
          -
          <lpage>3460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yoshioka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , et al.,
          <article-title>Wavlm: Large-scale self-supervised pre-training for full stack speech processing</article-title>
          ,
          <source>IEEE Journal of Selected Topics in Signal Processing</source>
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <fpage>1505</fpage>
          -
          <lpage>1518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T. Xu,
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McLeavey</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Robust speech recognition via large-scale weak supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>28492</fpage>
          -
          <lpage>28518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>S. Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Parametereficient fine-tuning for large models: A comprehensive survey</article-title>
          ,
          <source>arXiv preprint arXiv:2403.14608</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giurgiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jastrzebski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Morrone</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. De Laroussilhe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gesmundo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Attariyan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gelly</surname>
          </string-name>
          ,
          <article-title>Parameter-eficient transfer learning for nlp</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2790</fpage>
          -
          <lpage>2799</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <article-title>The power of scale for parameter-eficient prompt tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08691</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2106.09685</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bilmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kilanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kirchhof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Subramanya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Landay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dowden</surname>
          </string-name>
          , et al.,
          <article-title>The vocal joystick: A voice-based human-computer interface for individuals with motor impairments</article-title>
          ,
          <source>in: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>995</fpage>
          -
          <lpage>1002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hawley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Enderby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brownsell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carmichael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Parker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hatzis</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. O'Neill</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Palmer</surname>
          </string-name>
          ,
          <article-title>A speech-controlled environmental control system for people with severe dysarthria</article-title>
          ,
          <source>Medical Engineering &amp; Physics</source>
          <volume>29</volume>
          (
          <year>2007</year>
          )
          <fpage>586</fpage>
          -
          <lpage>593</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Siniscalchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Baralis,</surname>
          </string-name>
          <article-title>voc2vec: A foundation model for nonverbal vocalization</article-title>
          ,
          <source>in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: International conference on machine learning,
          <source>PmLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>V.</given-names>
            <surname>Panayotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          ,
          <article-title>Librispeech: an asr corpus based on public domain audio books</article-title>
          ,
          <source>in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP)</source>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>5206</fpage>
          -
          <lpage>5210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riviere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharitonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-E.</given-names>
            <surname>Mazaré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karadayi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liptchinsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fuegen</surname>
          </string-name>
          , et al.,
          <article-title>Libri-light: A benchmark for asr with limited or no supervision</article-title>
          ,
          <source>in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>7669</fpage>
          -
          <lpage>7673</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pepino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Riera</surname>
          </string-name>
          , L. Ferrer,
          <article-title>Emotion recognition from speech using wav2vec 2.0 embeddings</article-title>
          , arXiv preprint arXiv:
          <volume>2104</volume>
          .03502 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Peft-ser: On the use of parameter eficient transfer learning approaches for speech emotion recognition using pre-trained speech models</article-title>
          ,
          <source>in: 2023 11th International Conference on Afective Computing and Intelligent Interaction (ACII)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hebbar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Trust-ser: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition</article-title>
          ,
          <source>in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>11201</fpage>
          -
          <lpage>11205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Product quantization for nearest neighbor search</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>33</volume>
          (
          <year>2010</year>
          )
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Landry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances</article-title>
          ,
          <source>Global Scientific Journals</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>1793</fpage>
          -
          <lpage>1798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>K. T. Johnson</surname>
            , J. Narain,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Quatieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Maes</surname>
          </string-name>
          , R. W. Picard,
          <article-title>Recanvo: A database of real-world communicative and afective nonverbal vocalizations</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>523</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>N.</given-names>
            <surname>Holz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Larrouy-Maestri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Poeppel</surname>
          </string-name>
          ,
          <article-title>The variably intense vocalizations of afect and emotion (vivae) corpus prompts new perspective on nonspeech perception</article-title>
          .,
          <source>Emotion</source>
          <volume>22</volume>
          (
          <year>2022</year>
          )
          <fpage>213</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>F. D'Asaro</surname>
            ,
            <given-names>J. J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Villacís</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bottino</surname>
          </string-name>
          ,
          <article-title>Using large speech models for feature extraction in cross-lingual speech emotion recognition</article-title>
          ,
          <source>in: Titolo volume non avvalorato</source>
          , Accademia University Press,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>