1. Introduction

Exploring the Adaptability of Large Speech Models to Non-Verbal Vocalization Task

Juan José Márquez Villacís

Federico D'Asaro

Giuseppe Rizzo

Andrea Bottino

LINKS Foundation - AI

Space (ADS)

2025

Large Speech Models (LSMs), pre-trained on extensive speech corpora, have recently emerged as powerful foundations in the audio processing field, demonstrating strong transfer capabilities to downstream tasks such as speaker identification and emotion recognition. However, while these models excel on speech-centric tasks, limited research has investigated their adaptability to Non-Verbal Vocalization (NVV) tasks, which involve vocal bursts like laughter, sighs, shrieks, and moans. In this work, we examine how well LSMs, specifically Wav2Vec 2.0, HuBERT, WavLM, and Whisper, can be adapted to NVV tasks. We conduct experiments using both linear probing to evaluate the pre-trained knowledge relevant to NVVs, and Parameter-Eficient Fine-Tuning (PEFT) techniques, including LoRA, Adapters, and Prompt Tuning. Experimental results on NVV datasets-ASVP-ESD, CNVVE, Non-Verbal Vocalization Dataset, ReCANVo, VIVAE-indicate that Whisper-based models consistently achieve superior performance, which is further enhanced through the application of LoRA. Additionally, our layer-wise analysis reveals that applying PEFT specifically to layers with lower NVV information is key to efective model adaptation, providing valuable insights for optimizing fine-tuning strategies in future work. The repository associated with this work can be found here: https://github.com/links-ads/kk-nonverbal-vocal-class

Non-Verbal Vocalization Large Speech Models Parameter Eficient Fine-Tuning

1. Introduction

and emotion recognition from prosody [ 6 ], the domain of Non-Verbal Vocalizations (NVV) has received comUnderstanding and correctly identifying emotional cues paratively little attention [ 7, 1 ]. Early approaches for in human vocalizations is essential for building conver- NVV analysis often relied on Hidden Markov Models or sational systems capable of engaging with people in an Convolutional Neural Networks. However, the advent emotionally aware and natural manner [ 1, 2 ]. Emotional of Transformer architectures [ 8 ] has led to the developinformation in the human voice is transmitted mainly ment of Large Speech Models (LSMs), including Wav2Vec through two distinct pathways: speech prosody—which 2.0 [ 9 ], HuBERT [ 10 ], WavLM [ 11 ], and Whisper [ 12 ], encompasses features such as intonation, rhythm, and vo- which have demonstrated impressive transfer learning cal quality [ 3 ]—and non-verbal vocal sounds, commonly capabilities on speech-based tasks. Despite this success, referred to as vocal bursts [ 4 ], which include expressions the adaptability of these models to NVV tasks remains like laughter, sighs, screams, and moans. Importantly, largely unexplored. these non-speech sounds serve as critical communicative In this work, we systematically investigate how varitools, particularly for individuals with profound disabili- ous LSMs perform as feature extractors for NVV recognities or speech limitations, since more than 96% of people tion, aiming to understand the extent to which non-verbal with speech impairments are still able to produce non- knowledge is already embedded in their pre-trained repverbal vocalizations [ 5 ]. resentations. To further enhance their adaptation to NVV

While much research has focused on speech-related tasks, we apply Parameter-Eficient Fine-Tuning (PEFT) tasks such as speaker recognition, speaker diarization, strategies [ 13 ], including Adapters [ 14 ], Prompt Tuning [ 15 ], and LoRA [ 16 ]. tCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- Our experimental results, conducted across five * Corresponding author. NVV datasets—ASVP-ESD, CNVVE, Non-Verbal Vocaliza† These authors contributed equally. tion Dataset, ReCANVo, VIVAE—indicate that Whisper $ juan.marquez@linksfoundation.com (J. J. Márquez Villacís); consistently outperforms Wav2Vec 2.0, HuBERT, and federico.dasaro@polito.it (F. D’Asaro); WavLM, especially when fine-tuned with PEFT techgiuseppe.rizzo@linksfoundation.com (G. Rizzo); niques. Among these, LoRA achieves the best overall andhrtetap.:b//octotinncoe@ptpboalsieto.s.oitu(rAce.fBoorgttei.nnoe)t/mjf/ (A. Bottino) performance. Further analysis of the Transformer layers 0009-0008-3098-5492 (J. J. Márquez Villacís); reveals that non-verbal information is primarily captured 0009-0003-8727-3393 (F. D’Asaro); 0000-0003-0083-813X (G. Rizzo); in the later layers of Whisper. Interestingly, we find that 0000-0002-8894-5089 (A. Bottino) applying LoRA exclusively to earlier, less important lay© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ers yields better adaptation compared to focusing on the layers already rich in non-verbal knowledge. This counterintuitive result suggests that adjusting the layers with initially limited task relevance is crucial, as these layers benefit most from targeted adaptation.

The main contributions of this work are:

Recent advancements in natural language processing (NLP) and computer vision (CV) have leveraged vast amounts of unlabeled data using Self-Supervised Learning [ 20, 21 ] and Weakly Supervised Learning [ 22 ]. Inspired by techniques such as masked language modeling • We evaluate the adaptability of Large Speech Mod- in NLP and image modeling in CV, Wav2Vec 2.0 [ 9 ] inels (Wav2Vec 2.0, HuBERT, WavLM, and Whisper) troduced a Large Speech Model (LSM) trained through to Non-Verbal Vocalization tasks using both lin- masked speech modeling on large-scale audio datasets, ear probing and Parameter-Eficient Fine-Tuning including the LibriSpeech corpus [ 23 ] and LibriVox [ 24 ]. techniques on five NVV datasets. Following Wav2Vec 2.0, subsequent LSMs such as • We demonstrate that Whisper achieves the HuBERT [ 10 ] and WavLM [ 11 ] further advanced selfstrongest performance across all datasets, and supervised pretraining approaches. In parallel, Whisper that LoRA is the most efective PEFT method [ 12 ] was introduced, trained with large-scale weak suwhen compared to Adapters and Prompt Tuning. pervision from paired audio and transcription data using • Through layer-wise importance analysis, we ob- an encoder-decoder transformer architecture. serve that non-verbal information is predomi- These large speech models have demonstrated strong nantly encoded in the later layers of Whisper. capabilities in learning rich and robust speech representaSurprisingly, we find that adapting less impor- tions from large datasets, leading to significant improvetant layers is more beneficial for task-specific ments in various tasks, including language modeling, performance than focusing solely on the most audio classification, and speech-to-text transcription. informative layers.

2.3. Parameter Eficient Finetuning

2. Related Work Large-scale models demonstrate strong adaptability across a wide range of downstream tasks, but this of2.1. Non Verbal Vocalization ten comes at a significant computational cost. To address this, Parameter-Eficient Fine-Tuning (PEFT) techniques Early approaches to recognizing Non-Verbal Vocaliza- have emerged, aiming to introduce minimal task-specific tions (NVVs) primarily relied on Hidden Markov Models parameters while keeping the majority of the pretrained (HMMs), which analyzed vocal signals based on acoustic model unchanged. This approach preserves the model’s features such as intensity, pitch, and vowel articulation generalization ability and reduces the number of parampatterns [ 17, 18 ]. Despite their initial success, these mod- eters that require modification. els were limited by their dependence on linear modeling, As outlined by Han et al. [ 13 ], PEFT methods can be susceptibility to noise interference, and challenges in broadly categorized into two types: Additive PEFT and handling large or complex datasets. Reparameterized PEFT. Additive PEFT methods include

To address these limitations, subsequent research tran- techniques such as Adapters [ 14 ] and Prompt Tuning sitioned towards employing convolutional neural net- [ 15 ], which introduce additional learnable components works (CNNs) that process time-frequency representa- at either the activation level or through prompt-based tions like Mel spectrograms and Mel-Frequency Cep- conditioning without altering the core model paramestral Coeficients (MFCCs) [ 7 ]. Recent progress has been ters. Reparameterized PEFT approaches, such as LoRA driven by the adoption of Transformer-based frameworks [ 16 ], apply low-rank adaptations to the weight matrices, capable of learning from massive audio datasets. Draw- efectively transforming the model’s parameter space ing inspiration from large-scale speech models such as while maintaining the original architecture and inferWav2Vec 2.0 and Whisper, these state-of-the-art systems ence speed. have enabled the classification of up to 67 distinct types These parameter-eficient strategies have shown of vocal expressions [ 1 ]. strong results in English Speech Emotion Recognition

Following this research direction, Koudounas et al. tasks [ 25, 26, 27 ], with LoRA in particular demonstrating [ 19 ] proposed a new foundation model trained on 125 notable performance. In this work, we investigate the hours of non-verbal vocalization data, demonstrating application of Adapters, Prompt Tuning, and LoRA for significantly improved performance on downstream clas- adapting Large Speech Models to the classification of sification tasks. Non-Verbal Vocalizations.

3. Non-Verbal Vocalization Classifier HuBERT HuBERT [10] introduced the use of an acous

tic unit discovery system, such as k-means clustering applied to MFCC features, to generate frame-level tarIn this section, we describe the architecture of the Non- gets for both masked and unmasked tokens. By adjusting Verbal Vocalization classifier illustrated in Figure 1. The the number of clusters (), the system produces targets of model is composed of a Large Speech Model serving as varying granularity, ranging from broad vowel categories the backbone ℬ, with a classifier stacked on top. Addi- to more fine-grained senones. Similar to Wav2Vec 2.0, tionally, we describe the integration of PEFT techniques, the HuBERT architecture employs a 1D convolutional which can be selectively applied to the Transformer lay- feature encoder with seven layers, using a frame size of ers of the LSM to enhance adaptability while minimizing 20 ms, followed by a series of Transformer blocks for the number of trainable parameters. contextual representation learning. 3.1. Large Speech Models WavLM The WavLM framework [ 11 ] further extends the pretraining approach introduced by Wav2Vec 2.0 by Wav2Vec 2.0 Wav2Vec 2.0 demonstrated, for the first integrating both masked speech prediction and speech detime, that it is possible to learn powerful speech represen- noising into the pretraining process. Specifically, WavLM tations directly from raw audio without requiring labels. introduces masked speech denoising, where portions of The architecture consists of a multi-layer 1D convolu- the input are artificially corrupted with simulated noise tional feature encoder, which takes raw audio input or overlapping speech. The model is then tasked with and produces latent representations = {1, . . . , }, predicting the pseudo-labels of the original clean speech where denotes the number of frames, each correspond- in the masked regions, similar to the approach used in ing to 25 ms of audio. These latent representations are HuBERT. This strategy enhances the model’s robustness then passed through a Transformer network to obtain in complex acoustic environments. contextualized representations = {1, . . . , }. Addi- Like previous models, WavLM employs a 1D convolutionally, the output of the feature encoder is discretized tional feature encoder followed by a Transformer encoder. using product quantization in the latent space [ 28 ]. This The Transformer in WavLM is augmented with gated discretization enables the application of masked speech relative position bias, which improves the modeling of modeling, the core innovation of Wav2Vec 2.0’s self- interactions between speech segments and enhances the supervised learning strategy. The model is trained to model’s ability to capture long-range dependencies. solve a contrastive task, where it must correctly identify the true quantized latent representation of a masked time step from a set of distractor candidates. where is a scaling coeficient that balances the adaptation impact. At initialization, up is set to zero and down is randomly initialized, ensuring that the model dicts raw text transcripts directly from audio without initially behaves as the pretrained base without modifirequiring significant text standardization. Whisper emcation. This strategy allows LoRA to inject task-specific ploys an encoder-decoder Transformer architecture, conknowledge while preserving the original model’s struc

In this work, we utilize the Whisper model solely as a = 1, . . . , indexes the layers and = 1, . . . , indexes sisting of an encoder and a decoder , which processes

Mel spectrograms instead of raw waveforms as used in

earlier models. Formally, given an input audio signal , the model first applies two 1D convolutional layers with GELU activation as a feature encoder, followed by

Transformer blocks to produce contextualized internal

representations. These representations are then used by the BERT-like decoder to generate the output text. and discarding the decoder . feature extractor by using the encoder as backbone ℬ

3.2. PEFT Methods

Adapter

Adapters introduce small, trainable mod- as: ules within Transformer layers to enable eficient finesize and is the bottleneck dimension. tuning. Each adapter consists of a down-projection matrix down ∈ R× , a non-linear activation (· ), and an up-projection matrix up ∈ R× , where is the hidden

Given input ℎin, the adapter output with residual con

nection is:

Adapter() = up (down ) + ture and maintaining fast inference.

3.3. Classifier Head To perform non-verbal event classification, we append

a classifier

to the backbone ℬ of the Large Speech

Model. From the Transformer encoder, we obtain hidden

representations across all layers denoted by {ℎ}, where the sequence frames. a unified sequence

We aggregate these multi-layer representations into weighted sum across layers. This aggregation is formalized by the function : R× × → R ×

, defined {ℎ* }=1 by applying a learnable =1 ℎ* = ∑︁ · ℎ, ∀ ∈ {1, . . . , } (4) (1) (2) are normalized such that ∑︀ where each weight satisfies ≥ 0 and the weights =1 = 1.

The resulting sequence {ℎ* } is first projected using a frame-wise linear transformation ℒ1 : R lowing standard practices in speech emotion recognition [ 26 ], we apply temporal aggregation via average pooling →

R. Fol over the frames to produce a single vector summarizing the input audio. This pooled representation is then fed into a classification layer outputs the logits corresponding to the target classes. : R →

R, which The overall classifier can be concisely expressed as: ︁( {ℎ* }=1 ︁) ︁( ︁(

︁( = ℒ1 {ℎ* }=1 ︁) (5) Prompt Tuning

Unlike adapters, embedding prompts introduce learnable prompt vectors that are prepended to the input sequence at each Transformer layer. Formally, the input sequence to layer is: () = 1 [︁ (), . . . , () , (), . . . , 1 () ]︁ where () are the continuous prompt tokens and () are the original input tokens. Here, denotes the number of continuous prompt tokens, and is the length of the original input. This approach allows task-specific information to be injected directly into the model without modifying its internal weights. smaller than min(, ).

LoRA

LoRA enhances each Transformer layer by applying a low-rank decomposition to the pretrained weight matrix 0 ∈ R× , enabling parameter-eficient ifne-tuning without altering the original model weights. It adds two additional trainable matrices: down ∈ R× and up ∈ R× , where is the rank, typically much

Given an input ℎin, the original output 0ℎin is updated with a task-specific adjustment: ℎout = 0ℎin + updownℎin (3)

4. Experiments 4.1. Datasets

ASVP-ESD

The ASVP-ESD (Audio, Speech and Vision

Processing Lab Emotional Sound Database) [ 29 ] comprises 12,625 emotion-related audio samples, including both speech and non-speech vocalizations. These samples were collected from movies, YouTube channels, and various other online sources. Each recording is annotated with one of 12 emotion categories, plus an additional "breath" label. All audio files are mono-channel and sampled at 16 kHz. ReCANVo The Real-World Communicative and Afective Nonverbal Vocalizations (ReCANVo) dataset [ 30 ] contains over 7,000 vocalizations produced by minimally speaking individuals aged between 6 and 25 years. Each vocalization is annotated with one of six communicative or afective labels.

4.3. Experimental Details CNVVE The Dataset and Benchmark for Classifying

Non-verbal Voice Expressions (CNVVE) [ 7 ] consists of 950 audio recordings from 42 participants. Each recording is labeled with one of six non-verbal voice expression categories. The audio samples are mono-channel and sampled at 16 kHz.

All experiments were conducted using a consistent setup

across datasets. Each dataset was split into training, validation, and test sets, with 80% of the audio samples used for training, 10% for validation, and the remaining 10% for testing.

The Large Speech Models evaluated in this study inNon-verbal Vocalization Dataset The Non-verbal clude: Whisper Tiny2, Whisper Base3, Whisper Small4, Vocalization Dataset1 includes crowdsourced audio HuBERT Base5, WavLM Base Plus6, and Wav2Vec2 Base7. recordings of non-verbal vocalizations categorized into Training was performed for 50 epochs with the follow16 distinct labels. All recordings are sampled at 16 kHz, ing hyperparameters: an initial learning rate of 1e− 4, with 16-bit resolution and mono-channel format. weight decay of 0.01, 1 = 0.9, 2 = 0.999, and = 1e− 8 for the Adam optimizer. A batch size of 16 was used along with a gradient accumulation step of 2.

All experiments were executed on a single NVIDIA A100 GPU.

4.4. Results

VIVAE The Variably Intense Vocalizations of Afect and Emotion (VIVAE) dataset [ 31 ] comprises 1,085 audio recordings from 11 speakers. The recordings are sampled at 42 kHz with 16-bit resolution and are annotated with six emotion labels. These labels capture both positive and negative afective states, as well as emotional intensity.

4.2. Metrics For the experimental evaluation, we report both Accuracy

and Macro F1 score. Since the datasets are imbalanced, the macro F1 score ofers a more reliable assessment of the model’s performance across all classes.

1https://www.openslr.org/99/

4.4.1. Linear Probing on Large Speech Models

To compare the Large Speech Models introduced in Sec

tion 3.1, we adopt a linear probing setup where the backbone ℬ is kept frozen, and only the classiefir is trained. In this configuration, each model—Wav2Vec 2.0, HuBERT, WavLM, and Whisper—is used purely as a feature extractor for the Non-Verbal Vocalization task. This approach allows us to evaluate the extent to which task-relevant representations are already captured in the pre-trained models.

Table 1 reports the performance of each model across all datasets, using Accuracy and Macro F1 as evaluation metrics. Results indicate that Wav2Vec 2.0, HuBERT, and

2https://huggingface.co/openai/whisper-tiny

3https://huggingface.co/openai/whisper-base 4https://huggingface.co/openai/whisper-small 5https://huggingface.co/facebook/hubert-base-ls960 6https://huggingface.co/microsoft/wavlm-base-plus 7https://huggingface.co/facebook/wav2vec2-base

For evaluating Parameter-Eficient Fine-Tuning (PEFT)

techniques, we focus on Whisper models, which demonstrated the strongest performance in the previous section.

Table 2 presents the results across diferent fine-tuning strategies applied to Whisper: Frozen Backbone, LoRA, Adapters, and Prompt Tuning.

Consistent with prior findings in audio classification 4.4.4. Optimizing PEFT via Layer Importance tasks [ 26 ], LoRA emerges as the most efective PEFT method across various datasets and model sizes. How- In Section 4.4.2, we applied PEFT techniques uniformly ever, an exception is observed in the Non-Verbal Vocal- across all Whisper layers, without considering their relaization dataset, where Adapters achieve superior perfor- tive importance. However, as observed in the previous mance for both the Whisper Base and Small models. section, diferent layers contribute unevenly to the Non

LoRA’s strength lies in its ability to eficiently intro- Verbal Vocalization task. Therefore, in this subsection, we duce minimal task-specific parameters while selectively investigate whether the efectiveness of PEFT depends modeling the non-verbal specific update ∆ , allowing it on layer importance, and if focusing on specific layers to efectively integrate pre-trained knowledge with new can further reduce adaptation parameters. task-specific information. Table 3 presents diferent strategies for applying LoRA to Whisper models, as LoRA showed the best perfor4.4.3. Analysis of Transformer Layers mance in most cases. For each model, LoRA refers to applying the technique to all Transformer layers, LoRA[-] This subsection examines the contribution of each Trans- applies LoRA only to the less important layers, and former encoder layer within the Whisper backbone to LoRA[+] applies it exclusively to the important layers, the Non-Verbal Vocalization task. We concentrate on the as determined in Section 4.4.3.

CNVVE

Nonverbal

ReCanVo

ViVAE Acc

Overall, we find that full LoRA adaptation typically tasks using both linear probing and Parameter-Eficient yields the best results, followed by LoRA[-]. This sug- Fine-Tuning (PEFT) techniques. Our experimental results gests that adapting the less important layers has a greater demonstrate that Whisper models consistently outperpositive impact than focusing solely on the important form Wav2Vec 2.0, HuBERT, and WavLM across multiple layers, for which performance is often significantly lower. NVV datasets.

Although this may seem counterintuitive, we hypothe- Furthermore, we observe that applying PEFT methods size that adaptation is more necessary where the network significantly improves performance, with LoRA emergretains less prior knowledge relevant to the task. Impor- ing as the most efective strategy compared to Adapters tant layers already encode useful features, thus requiring and Prompt Tuning. Through a detailed analysis of the less adjustment, while ignoring the less important layers Transformer layer weights in Whisper models, we find limits the model’s adaptability. that non-verbal information is predominantly captured

Hence, we propose that focusing on the less impor- in the later layers. tant layers is more beneficial than concentrating exclu- Interestingly, we discover that fine-tuning only these sively on the important ones. This insight ofers valuable later layers yields limited gains compared to adapting guidance for future work aimed at improving PEFT tech- the layers that initially contain less non-verbal knowlniques by targeting the parts of the network that need edge. We hypothesize that this is because the layers with the most adaptation. less task-relevant information require a larger degree of adaptation to bridge the knowledge gap. This observation suggests a valuable pathway for optimizing PEFT 5. Conclusion methods by selectively targeting particular transformer layers based on the knowledge they embed, potentially minimizing the need for additional task-specific parame

In this work, we investigated the adaptability of Large Speech Models (LSMs) to Non-Verbal Vocalization (NVV) ters even further.

Declaration on Generative AI

[1]

Tzirakis ,

Baird ,

Brooks ,

Gagne ,

Kim ,

Opara ,

Gregory ,

Metrick ,

Boseck ,

Tiruvadi , et al., Large-scale nonverbal vocalization detection using transformers , in: ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2023 , pp. 1 - 5 .

[2]

Feng ,

Narayanan , Foundation model assisted automatic speech emotion recognition: Transcribing, annotating, and augmenting , in: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2024 , pp. 12116 - 12120 .

[3]

Liebenthal ,

D. A.

Silbersweig , E. Stern, The language, tone and prosody of emotions: neural substrates and dynamics of spoken-word emotion perception , Frontiers in neuroscience 10 ( 2016 ) 506 .

[4]

Cowen ,

Sauter ,

J. L.

Tracy ,

Keltner , Mapping the passions: Toward a high-dimensional taxonomy of emotional experience and expression , Psychological Science in the Public Interest 20 ( 2019 ) 69 - 90 .

[5]

McCormack ,

McLeod ,

L. J.

Harrison , L. McAllister , The impact of speech impairment in early childhood: Investigating parents' and speech-language pathologists' perspectives using the icf-cy , Journal of communication disorders 43 ( 2010 ) 378 - 396 .

[6]

Eyben ,

Wöllmer ,

Schuller , Opensmile: the munich versatile and fast open-source audio feature extractor , in: Proceedings of the 18th ACM international conference on Multimedia , 2010 , pp. 1459 - 1462 .

[7]

Hedeshy ,

Menges ,

Staab , Cnvve: Dataset and benchmark for classifying non-verbal voice ( 2023 ).

[8]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[9]

Baevski ,

Zhou ,

Mohamed ,

Auli , wav2vec 2 . 0: A framework for self-supervised learning of speech representations , Advances in neural information processing systems 33 ( 2020 ) 12449 - 12460 .

[10]

W.-N.

Hsu ,

Bolte , Y. -H. H. Tsai , K.

Lakhotia , R.

Salakhutdinov , A.

Mohamed , Hubert: Self-supervised speech representation learning by masked prediction of hidden units , IEEE/ACM transactions on audio, speech, and language processing 29 ( 2021 ) 3451 - 3460 .

[11]

Chen ,

Wang ,

Chen ,

Wu ,

Liu ,

Chen ,

Li ,

Kanda ,

Yoshioka ,

Xiao , et al., Wavlm: Large-scale self-supervised pre-training for full stack speech processing , IEEE Journal of Selected Topics in Signal Processing 16 ( 2022 ) 1505 - 1518 .

[12]

Radford ,

J. W.

Kim , T. Xu,

Brockman ,

McLeavey , I. Sutskever , Robust speech recognition via large-scale weak supervision , in: International Conference on Machine Learning, PMLR , 2023 , pp. 28492 - 28518 .

[13]

Han ,

Gao , J. Liu,

S. Q.

Zhang , et al., Parametereficient fine-tuning for large models: A comprehensive survey , arXiv preprint arXiv:2403.14608 ( 2024 ).

[14]

Houlsby ,

Giurgiu ,

Jastrzebski ,

Morrone , Q. De Laroussilhe , A.

Gesmundo , M.

Attariyan , S.

Gelly , Parameter-eficient transfer learning for nlp , in: International conference on machine learning, PMLR , 2019 , pp. 2790 - 2799 .

[15]

Lester ,

Al-Rfou ,

Constant , The power of scale for parameter-eficient prompt tuning , arXiv preprint arXiv:2104.08691 ( 2021 ).

[16]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Low-rank adaptation of large language models , arXiv preprint arXiv:2106.09685 ( 2021 ).

[17]

Bilmes ,

Li ,

Malkin ,

Kilanski ,

Wright ,

Kirchhof ,

Subramanya ,

Harada ,

Landay ,

Dowden , et al., The vocal joystick: A voice-based human-computer interface for individuals with motor impairments , in: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing , 2005 , pp. 995 - 1002 .

[18]

M. S.

Hawley ,

Enderby ,

Green ,

Cunningham ,

Brownsell ,

Carmichael ,

Parker ,

Hatzis , P. O'Neill , R. Palmer , A speech-controlled environmental control system for people with severe dysarthria , Medical Engineering & Physics 29 ( 2007 ) 586 - 593 .

[19]

Koudounas ,

M. La

Quatra ,

S. M.

Siniscalchi , E. Baralis, voc2vec: A foundation model for nonverbal vocalization , in: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2025 , pp. 1 - 5 .

[20]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[21]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly , et al., An image is worth 16x16 words: Transformers for image recognition at scale , arXiv preprint arXiv: 2010 . 11929 ( 2020 ).

[22]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: International conference on machine learning, PmLR , 2021 , pp. 8748 - 8763 .

[23]

Panayotov ,

Chen ,

Povey ,

Khudanpur , Librispeech: an asr corpus based on public domain audio books , in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , IEEE, 2015 , pp. 5206 - 5210 .

[24]

Kahn ,

Riviere ,

Zheng ,

Kharitonov ,

Xu ,

P.-E.

Mazaré ,

Karadayi ,

Liptchinsky ,

Collobert ,

Fuegen , et al., Libri-light: A benchmark for asr with limited or no supervision , in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2020 , pp. 7669 - 7673 .

[25]

Pepino ,

Riera , L. Ferrer, Emotion recognition from speech using wav2vec 2.0 embeddings , arXiv preprint arXiv: 2104 .03502 ( 2021 ).

[26]

Feng ,

Narayanan , Peft-ser: On the use of parameter eficient transfer learning approaches for speech emotion recognition using pre-trained speech models , in: 2023 11th International Conference on Afective Computing and Intelligent Interaction (ACII) , IEEE, 2023 , pp. 1 - 8 .

[27]

Feng ,

Hebbar ,

Narayanan , Trust-ser: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition , in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2024 , pp. 11201 - 11205 .

[28]

Jegou ,

Douze ,

Schmid , Product quantization for nearest neighbor search , IEEE transactions on pattern analysis and machine intelligence 33 ( 2010 ) 117 - 128 .

[29]

Landry ,

He ,

Yan ,

Li , Asvp-esd: A dataset and its benchmark for emotion recognition using both speech and non-speech utterances , Global Scientific Journals 8 ( 2020 ) 1793 - 1798 .

[30] K. T. Johnson , J. Narain, T.

Quatieri , P.

Maes , R. W. Picard, Recanvo: A database of real-world communicative and afective nonverbal vocalizations , Scientific Data 10 ( 2023 ) 523 .

[31]

Holz ,

Larrouy-Maestri ,

Poeppel , The variably intense vocalizations of afect and emotion (vivae) corpus prompts new perspective on nonspeech perception ., Emotion 22 ( 2022 ) 213 .

[32] F. D'Asaro , J. J. M.

Villacís , G.

Rizzo , A.

Bottino , Using large speech models for feature extraction in cross-lingual speech emotion recognition , in: Titolo volume non avvalorato , Accademia University Press, 2024 .