-

1613-0073

Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition

Federico D'Asaro

federico.dasaro@polito.it 0

Juan José Márquez Villacís

Giuseppe Rizzo

Andrea Bottino

andrea.bottino@polito.it 0

LINKS Foundation - AI

Space (ADS)

Cross-lingual Speech Emotion Recognition, Large Speech models, Transfer Learning

0 dian French , Spanish, Greek, Persian, and Egyptian Arabic

2024

Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or WeaklySupervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.

CEUR

ceur-ws.org 1. Introduction tions from speech audio, enhancing Human-AI interaction in fields such as healthcare, education, and security [ 1 ]. Traditional methods rely on Low-Level Deity features [ 2 ], using classifiers such as KNN, SVM, or Naïve Bayes [ 3 ]. Deep learning has introduced advanced techniques, including Convolutional Neural Networks (CNNs) [ 4, 5, 6 ], eventually followed by Recurrent Neural Networks (RNNs) [ 7, 8 ], and Transformers [9, 10, 11]. Transformers’ ability to learn from extensive datasets has led to Large Speech Models (LSMs), which generalize across various speech tasks. Common training approaches for these models include Self-Supervised

Learning (SSL), which uses data itself to learn generalpurpose features [12], and Weakly-Supervised Learning (WSL), which pairs audio with text for tasks like transcrip

0009-0003-8727-3393 (F. D’Asaro); 0009-0008-3098-5492 (J. J. Márquez Villacís); 0000-0003-0083-813X (G. Rizzo); 0000-0002-8894-5089 (A. Bottino) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). edge of LSMs makes them efective feature extractors for SER. Research has adapted LSMs for SER in English ited, focusing on Wav2Vec 2.0 [18] for cross-lingual SER

This study examines how efective LSMs are as fea

across eight languages: Italian, German, French, Cana

Specifically, we utilize LSMs from the Wav2Vec 2.0 and Whisper [13] model families, pre-trained with SSL and WSL approaches, respectively. We introduce Whisper

due to its underexplored use in cross-lingual SER. To assess the efectiveness of LSMs as feature extractors, we test three classifiers of increasing complexity— Linear,

Non-Linear, and Multi-Layer —across nine datasets. This

evaluation determines which classifier best suits each

LSM across diferent languages. Moreover, our study includes both English-Only and Multilingual models from uate the efectiveness of multilingual pre-training for cross-lingual SER. The main contributions of this work are:

• We evaluate LSMs from the Wav2Vec 2.0 and

Whisper models as feature extractors for crosslingual SER across eight languages. • We test three types of downstream classiifers—Linear, Non-Linear, and Multi-Layer—and ifnd that Whisper models’ last Transformer layer features are well-suited for a Linear classifier, whereas Wav2Vec 2.0 models perform better with tion and translation [13]. The general-purpose knowl- the Wav2Vec 2.0 and Whisper families, aiming to evallayers. • We compare English-Only and Multilingual LSMs, revealing that Whisper models benefit from mul

Spanish, Canadian French, French, and German

and competitively on Greek, Egyptian Arabic,

Persian. Conversely, English-Only Wav2Vec 2.0

models surpass multilingual XLS-R in most languages, achieving the highest performance in

Greek, Egyptian Arabic. 2. Background

2.1. Large Speech Models

Recent developments in natural language processing and

computer vision have harnessed large volumes of unlaBuilding on techniques such as masked language and image modeling, Wav2Vec 2.0 [18] introduced a LSM trained on extensive audio datasets using masked speech modeling. Wav2Vec 2.0 features seven 1D convolutional blocks for initial feature extraction, followed by 12 or 24 transformer blocks (depending on the model variant) for contextual processing. The model masks part of the latent features and reconstructs them using the surrounding context. To further refine LSMs for tasks like emotion recognition, methods such as WavLM [25] have been and text interaction in languages such as Italian, German, and Urdu. Self-supervised pre-training methods, including variational autoencoders, have also been ef

German [35, 36]. The advent of LSMs pre-trained with

self-supervision has further increased the potential for transfer learning due to their high generalization capabilities [15]. However, most research primarily focuses on adapting multilingual Wav2Vec 2.0 models (XLSR-53) [19, 37, 20, 21]. This work expands the scope of analyzed

LSMs including WSL models as Whisper. Additionally, we evaluate the ability of English-only models to transfer knowledge to other languages, beyond just multilingual models.

Method uating the efectiveness of LSMs as feature extractors for downstream SER in various languages. We stack a classification model on top of the LSM backbone, with its parameters frozen. All LSMs used in this work share the same overall architecture, which we describe below along with the stacked classification model.

Formally, the input audio (raw waveform or logmel spectrogram) passes through a convolutional encoder ∶ →

, mapping the audio to latent features = { 1, … , }, where is the sequence length and each features from the middle and early Transformer

dual attention [21] and tensor fusion [34] enhance audio tilingual pre-training performing best on Italian, fective in transferring knowledge across languages like beled data through Self-Supervised Learning [22, 23, 24]. In this section, we describe the methodology for evalside masked modeling, demonstrating broad efectivedeveloped. WavLM incorporates speech denoising along- frame typically corresponds to 25 ms with ∈ ℝ .

Then, passes through a Transformer encoder consist

ness across various tasks in the SUPERB benchmark [26]. ing of layers ∶ → , enriching the latent features Moreover, XLSR-53 [27] extends the Wav2Vec 2.0 framework to cover 53 languages, sharing the latent space across these languages. This approach has shown superior performance over monolingual pretraining for automatic speech recognition. XLS-R [28] further advances with contextual information, resulting in {ℎ1, … , ℎ } for each of the = 1, … ,

Transformer layers. Here, = corresponds to the output features of the last layer, with ℎ ∈ ℝ . The features {ℎ1, … , ℎ }=1,.., are considered the extracted features from the LSM and are fed into a downthis by scaling to 128 languages, excelling in speech trans- stream classifier ∶ → , which maps these features lation and language identification. In comparison, Whisper [13] leverages large-scale weak supervision from audio-transcription pairs to train an encoder-decoder transformer. Using log-mel spectrograms, Whisper is trained in a multitask framework that includes multilingual transcription and translation, establishing itself as an efective zero-shot model for multilingual tasks. 2.2. Cross-Language Speech Emotion

Recognition

Emotion recognition in languages beyond English, like

Italian [29], French [30], Persian [31, 32], and Spanish [33], is crucial but often limited by data availability. Recent eforts have focused on improving cross-lingual and cross-modal knowledge transfer. Techniques like label ∗ for audio is given by: to the output class logits { 1, … , }. The output class ∗ = arg max softmax ( ( (() ))) (1)

Inspired by previous work that uses probing to evalu

ate the quality of features extracted from backbone models [38, 39], we evaluate three diferent downstream classiifers of increasing complexity: Linear Classifier (ℊ ), Non

Linear Classifier

(ℊ ), and Multi-layer Classifier

(ℊ ). neural network that consists solely of linear projections. layer {ℎ1 , … , ℎ }, they are first projected by a linear layer 1 ∶ ℝ → ℝ

that is shared across all frames, then aggregated by average pooling , and finally pass through the classification layer ℴ ∶ ℝ

→ ℝ to obtain the output class logits. The function ℊ is compactly defined as:

ℊ (ℎ1 , … , ℎ ) = ℴ ( ( 1 (ℎ1 , … , ℎ ))) uate the quality of the features extracted from the LSM based on the linear classifier model’s ability to handle the SER task. 3.2. Non-Linear Classifier is: To increase the complexity of the classification model, we utilize a series of linear layers interleaved with ReLU activations both before and after feature pooling. We follow the same architecture as in [14, 15], but unlike them, we only feed the features from the last Transformer layer to the model. Each {ℎ1 , … , ℎ } passes through two shared linear layers, ReLU, and dropout blocks ( ), ∶ ℝ followed by a linear layer ( 1). Linear layers are functions

→ ℝ . Projected features are averaged, pass through 2 and ReLU, and are classified by ℴ. Thus, ℊ ℊ ( = ℎ 1 , … , ℎ ) = ℴ (ReLU ( 2 ( ( 1 ( ( )))))) (3) (4) (5) The absence of non-linear activations allows us to eval- Italian, German, Spanish, Egyptian Arabic, and Persian. =1 ℎ∗ = ∑ ⋅ ℎ for = 1, … , where 1, … , are the weights assigned to each Transformer layer, ensuring ∈ [ 0, 1 ] and ∑

=1 = 1. The resulting sequence {ℎ1∗, … , ℎ∗} is then processed by the same pipeline as the Non-Linear Classifier , resulting in:

ℊ

( = {ℎ 1, … , ℎ }=1,.., ) = ℊ (() )

This classifier leverages internal layer information, which

has proven beneficial for paralinguistic and linguistic downstream tasks [39, 40, 41, 42]. By investigating the contribution of internal LSM layers for SER across various languages, we corroborates previous findings for

Wav2Vec 2.0 models and provide new insights for Whisper models.

The function ∶ ℝ × × to {ℎ1∗, … , ℎ∗} as follows: 3.3. Multi-Layer Classifier As a third option, we adopt the approach from [14, 15], which utilizes all hidden states of the Transformer encoder. The features {ℎ1, … , ℎ }=1,.., are combined into a new sequence {ℎ1∗, … , ℎ∗} using a learnable weighted sum. → ℝ × maps {ℎ1, … , ℎ }=1,..,

4. Experiments

4.1. Datasets and Metrics (2)

In this study, we conduct experiments using 9 distinct datasets spanning 8 diferent languages: Greek, French, The datasets vary in their collection methodologies, such

as acted emotions and elicitation methods. The participant demographics may be balanced by gender (e.g.,

CaFE, EYASE), by emotion (e.g., EMOVO), or may not

be balanced at all. For all datasets, we conduct our experiments in a speaker-independent setting to prevent evaluation on speaker-dependent features. Table 1 provides an overview of the dataset statistics, with a more detailed description given below.

AESDD [43]: The Acted Emotional Speech Dynamic Database comprises 500 recorded samples from 5 actors

(3 females, 2 males) expressing 5 distinct emotions in

Greek. Each actor performed 20 utterances per emotion, with some utterances recorded multiple times. In later versions, additional actors were included, bringing the total to 604 recordings from 6 actors. CaFE [44]: This dataset includes recordings of 6 dif

ferent sentences delivered by 12 actors (6 female, 6 male) portraying the Big Six emotions and a neutral state in

Canadian French. It ofers a high-quality version with a sampling rate of 192 kHz at 24 bits per sample, as well as

Dataset AESDD CaFE DEMoS

EmoDB EmoMatch EMOVO EYASE Oréau ShEMO

Language

Greek Canadian French

Italian German Spanish

Italian Egyptian Arabic

French Persian

Emotions anger, disgust, fear, happiness, and sadness anger, disgust, fear, happiness, surprise, sadness, and neutrality anger, disgust, fear, happiness, surprise, sadness, and neutrality anger, disgust, fear, happiness, boredom, sadness, and neutrality anger, disgust, fear, happiness, surprise, sadness, and neutrality anger, disgust, fear, happiness, surprise, sadness, and neutrality

anger, happiness, sadness, and neutrality anger, disgust, fear, happiness, surprise, sadness, and neutrality anger, happiness, sadness, and neutrality a down-sampled version at 48 kHz and 16 bits per sample. validation/test split is performed with ratios of 80/10/10. The total number of samples amounts to 936. All results are reported using the macro F1 score, ex

DEMoS [45]: DEMoS contains 9697 audio samples pressed as a percentage. We conducted 3 runs, presenting from 68 volunteer students (299 females, 131 males) ex- the mean ± standard deviation. pressing the Big Six emotions plus the neutral state in Italian. Instead of acted emotions, samples were gener- 4.2. Experimental Details ated using an elicitation approach. The recordings, with a mean duration of 2.9 seconds (std: 1.1s), are provided Baseline As a baseline to evaluate LSM transfer learnin 48 kHz, 16-bit, mono format. ing capabilities, we adopt the Audio Spectrogram Trans

EmoDB [46]: This collection includes 535 utterances former (AST) [51], a fully transformer-based architecture across 7 emotional states, spoken in German by 5 female recently proposed as a substitute for CNNs [9, 10, 11]. and 5 male actors. Each actor performed a set of 10 We train AST from scratch on each of the 9 datasets using sentences, which were down-sampled from the original the same hyperparameters as [51]. 48 kHz to 16 kHz. LSM Models We use pre-trained checkpoints for both

EmoMatch [33]: Consisting of 2005 recordings, Emo- English-Only and Multilingual models: Wav2Vec 2.0 Match features samples from 50 non-actor Spanish speak- Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0 ers (20 females, 30 males) expressing the Big Six emotions family, and Whisper Small (EN) (Whisper Small preand a neutral state. The dataset is a subset of the larger trained only on English data), Whisper Small, Whisper EmoSpanishDB and contains recordings sampled at 48 Medium from the Whisper family. The LSM backbones kHz with a 16-bit mono format. are kept frozen and used exclusively as feature extractors.

EMOVO [47]: EMOVO presents 588 Italian audio Training We follow the same hyperparameters setrecordings from 3 male and 3 female actors simulating tings as [15] to train the downstream classifiers. Specifithe Big Six emotions plus a neutral state. Each actor cally, we train for 30 epochs using the Adam optimizer voiced 14 utterances, and the recordings are provided in with a learning rate of 5.0e-04, weight decay of 1.0e-04, 48 kHz, 16-bit stereo WAV format. betas set to (0.9, 0.98), and epsilon of 1.0e-08. The dimen

EYASE [48]: EYASE contains 579 utterances in Egyp- sion of the classifier projection is 256. tian Arabic, recorded by 3 male and 3 female professional actors. The recordings, ranging from 1 to 6 seconds in 4.3. Results duration, were labeled as angry, happy, neutral, or sad and sampled at 44.1 kHz. To present our results, we first compare the performance

Oréau [49]: The Oréau dataset features 502 audio sam- of the various classifiers (see Section 3) for each LSM ples from 32 non-professional actors (25 male, 7 female) utilized. This analysis provides insights into the charwho voiced 10 to 13 utterances in French for the Big Six acteristics of features extracted from Wav2Vec 2.0 and emotions plus a neutral state. Whisper models for downstream SER tasks. After identi

ShEMO [50]: ShEMO comprises 3000 semi-natural fying the best classifier for each LSM, we then compare recordings from 87 native Persian speakers (31 female, the performance of English-Only and Multilingual LSMs 56 male). The dataset captures 5 of the Big Six emo- across the 8 languages covered in this study. tions—sadness, anger, happiness, surprise, and fear—plus a neutral state. The samples were up-sampled to a fre- 4.3.1. Comparison between downstream classifiers quency of 44.1 kHz in mono-channel format, with an average length of 4.11 seconds (std: 3.41s). We examine the results in Table 2, comparing three clasThe audio is resampled to 16 kHz, and a stratified train/- sifier methods for Wav2Vec 2.0 and Whisper models. The ness for SER across multiple languages. We hypothesize that this difering behavior may be related to their respective Self-Supervised and Weakly-Supervised pre-training approaches, which warrant further investigation. To gain further insights into the importance of Transformer layers in Wav2Vec 2.0 and Whisper for SER, we leverage the weights learned in the Multi-Layer classifier as follows.

Transformer Layer Weights. We analyze the weights 1, … , from the Multi-Layer Classifier to assess Transformer layer importance. Figure 2 illustrates that Wav2Vec 2.0 models assign greater weight to the early and middle layers, whereas Whisper models emphasize the later layers. This observation confirms the earlier findings, suggesting that paralinguistic information in Whisper models is embedded in the features of the later Transformer layers. 4.3.2. Comparing English-Only and Multilingual

LSMs Across Diferent Languages

In this section, we compare English-Only and Multilin

gual LSMs with the AST baseline across 9 datasets. Table 3 displays F1 scores for the optimal classifiers found in the previous section: Multi-Layer for Wav2Vec 2.0 and Figure 2: Greyscale map of layer weight distribution from the Linear for Whisper models.

Multi-Layer classification method. Weights are averaged over Transferring knowledge from LSMs proves to be efall 9 datasets for each model. Darker shades indicate higher fective across all datasets compared to the baseline. For weights. instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian Arabic, while Whisper Small scores 51.98 and AST scores table shows average F1 scores across 9 datasets, highlight- 33.23. This indicates that LSMs are efective feature exing the most efective classifier for each LSM in cross- tractors for cross-lingual SER on multiple languages. lingual SER tasks. When comparing English-only and Multilingual mod

For Wav2Vec 2.0 models, the Multi-Layer Classifier els, we diferentiate between the Wav2Vec 2.0 and Whisperforms best, with F1 scores of 53.42, 57.50, and 40.89 per families. For Wav2Vec 2.0, we observe that Wav2Vec for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R. 2.0 Base and Large generally outperform XLS-R (e.g., The Linear and Non-Linear classifiers perform similarly, 87.85 and 88.31 vs. 67.71 for DEMos), except in Persian, especially for Wav2Vec 2.0 Large and XLS-R, suggesting where their performance is comparable. This indicates improvements are due to using features from internal that multilingual pre-training may not be as efective Transformer layers rather than non-linear activations. for Wav2Vec 2.0 models across various languages. We For Whisper models, the Linear Classifier performs best, speculate that this may be due to the limitations of SSL with F1 scores of 58.16, 60.87, and 60.72 for Whisper pre-training, which might struggle with the diverse range Small (EN), Whisper Small, and Whisper Medium. In- of languages and lose important paralinguistic features creasing classifier complexity with non-linear activations that are retained in English-only models. Further investidecreases performance, likely due to general information gation with a wider range of SSL-pretrained LSMs could loss caused by complex transformations. The Multi-Layer provide more insights. As regards to Whisper, MultilinClassifier performs worse, indicating that using also fea- gual Whisper Small outperforms its English-only vertures from internal layers is less efective than using sion, with the exception of Greek and Persian, likely due features from the last layer alone. to limited pretraining data for these languages, which

This comparison reveals that Wav2Vec 2.0 models ben- resulted in higher word error rates compared to other efit from features extracted from internal Transformer languages in this study [13]. Multilingual Whisper modlayers and exhibit less sensitivity to classifier complex- els achieve best performance in Canadian French, Spanity, consistent with prior research [41, 39]. Conversely, ish (66.71, 73.13 with Whisper Small), Italian, German, Whisper models achieve better performance with fea- and French (91.17, 90.64, 95.22 with Whisper Medium). tures from the last Transformer layer when using a simple This improvement is likely due to the larger pretraining linear classifier, ofering new insights into their efective- datasets for these languages and the similarities between Dataset/Model

AST AESDD (el) CaFE (fr-ca) DEMoS (it) EmoDB (de) EMOVO (it) EYASE (ar-eg) Oréau (fr) ShEMO (fa)

Wav2Vec 2.0

Base‡ Wav2Vec 2.0

Large‡

Whisper Small†

XLS-R‡

Multilingual Whisper Small†

Whisper Medium† 19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99) 9.16 (± 1.25) 26.34 (± 1.65) 27.62 (± 0.62)

Canadian French and French. We believe that multilingual pretraining benefits Whisper models by capturing language-specific features more efectively through WSL and multitask learning. However, further research is needed to evaluate the efectiveness of multilingual pretraining with WSL compared to SSL across a broader range of LSMs.

5. Conclusion This paper examines the capabilities of Wav2Vec 2.0 and

Whisper models as feature extractors for cross-lingual SER across eight languages, considering both EnglishOnly and Multilingual variants. Our findings reveal that LSMs are efective feature extractors compared to a full Transformer baseline trained from scratch. We observe that Whisper models encode acoustic information primarily in the features of the last Transformer layer, whereas Wav2Vec 2.0 models rely on features from middle and early layers. Furthermore, we show that multilingual pre-training benefits Whisper models, leading to strong performance in Italian, Canadian French, French, Spanish, German, and competitive results in Greek, Egyptian Arabic, and Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving top performance in Greek and Egyptian Arabic. We attribute the disparity in multilingual pre-training efectiveness to the diferences between SSL and WSL strategies, which should be explored further. in: 2016 Asia-Pacific signal and information pro- [20] S. G. Upadhyay, L. Martinez-Lucas, B.-H. Su, W.-C. cessing association annual summit and conference Lin, W.-S. Chien, Y.-T. Wu, W. Katz, C. Busso, C.-C. (APSIPA), IEEE, 2016, pp. 1–4. Lee, Phonetic anchor-based transfer learning to fa[9] N.-C. Ristea, R. T. Ionescu, F. S. Khan, Septr: Separa- cilitate unsupervised cross-lingual speech emotion ble transformer for audio spectrogram processing, recognition, in: ICASSP 2023-2023 IEEE InternaarXiv preprint arXiv:2203.09581 (2022). tional Conference on Acoustics, Speech and Signal [10] J.-Y. Kim, S.-H. Lee, Coordvit: a novel method of Processing (ICASSP), IEEE, 2023, pp. 1–5. improve vision transformer-based speech emotion [21] S. A. M. Zaidi, S. Latif, J. Qadir, Cross-language recognition using coordinate information concate- speech emotion recognition using multimodal nate, in: 2023 International conference on electron- dual attention transformers, arXiv preprint ics, information, and communication (ICEIC), IEEE, arXiv:2306.13804 (2023).

2023, pp. 1–4. [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, [11] S. Akinpelu, S. Viriri, A. Adegun, An enhanced Bert: Pre-training of deep bidirectional transformspeech emotion recognition using vision trans- ers for language understanding, arXiv preprint former, Scientific Reports 14 (2024) 13126. arXiv:1810.04805 (2018). [12] S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, [23] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, K. Qian, X. Jing, A. Kathan, B. Hu, B. W. Schuller, et al., Improving language understanding by genAudio self-supervised learning: A survey, Patterns erative pre-training (2018).

3 (2022). [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis[13] A. Radford, J. W. Kim, T. Xu, G. Brockman, senborn, X. Zhai, T. Unterthiner, M. Dehghani, C. McLeavey, I. Sutskever, Robust speech recog- M. Minderer, G. Heigold, S. Gelly, et al., An image is nition via large-scale weak supervision, in: Inter- worth 16x16 words: Transformers for image recognational Conference on Machine Learning, PMLR, nition at scale, arXiv preprint arXiv:2010.11929 2023, pp. 28492–28518. (2020). [14] L. Pepino, P. Riera, L. Ferrer, Emotion recognition [25] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, from speech using wav2vec 2.0 embeddings, arXiv J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm: preprint arXiv:2104.03502 (2021). Large-scale self-supervised pre-training for full [15] T. Feng, S. Narayanan, Peft-ser: On the use of stack speech processing, IEEE Journal of Selected parameter eficient transfer learning approaches Topics in Signal Processing 16 (2022) 1505–1518. for speech emotion recognition using pre-trained [26] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, speech models, in: 2023 11th International Confer- K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, ence on Afective Computing and Intelligent Inter- G.-T. Lin, et al., Superb: Speech processing uniaction (ACII), IEEE, 2023, pp. 1–8. versal performance benchmark, arXiv preprint [16] Y. Li, A. Mehrish, R. Bhardwaj, N. Majumder, arXiv:2105.01051 (2021).

B. Cheng, S. Zhao, A. Zadeh, R. Mihalcea, S. Po- [27] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, ria, Evaluating parameter-eficient transfer learning M. Auli, Unsupervised cross-lingual representation approaches on sure benchmark for speech under- learning for speech recognition, arXiv preprint standing, in: ICASSP 2023-2023 IEEE International arXiv:2006.13979 (2020).

Conference on Acoustics, Speech and Signal Pro- [28] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, cessing (ICASSP), IEEE, 2023, pp. 1–5. N. Goyal, K. Singh, P. Von Platen, Y. Saraf, J. Pino, [17] T. Feng, R. Hebbar, S. Narayanan, Trust-ser: et al., Xls-r: Self-supervised cross-lingual speech On the trustworthiness of fine-tuning pre-trained representation learning at scale, arXiv preprint speech embeddings for speech emotion recognition, arXiv:2111.09296 (2021). in: ICASSP 2024-2024 IEEE International Confer- [29] A. Wurst, M. Hopwood, S. Wu, F. Li, Y.-D. Yao, Deep ence on Acoustics, Speech and Signal Processing learning for the detection of emotion in human (ICASSP), IEEE, 2024, pp. 11201–11205. speech: The impact of audio sample duration and [18] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec english versus italian languages, in: 2023 32nd 2.0: A framework for self-supervised learning of Wireless and Optical Communications Conference speech representations, Advances in neural infor- (WOCC), IEEE, 2023, pp. 1–6.

mation processing systems 33 (2020) 12449–12460. [30] M. Neumann, et al., Cross-lingual and multilingual [19] M. Sharma, Multi-lingual multi-task speech emo- speech emotion recognition on english and french, tion recognition using wav2vec 2.0, in: ICASSP in: 2018 IEEE international conference on acoustics, 2022-2022 IEEE International Conference on Acous- speech and signal processing (ICASSP), IEEE, 2018, tics, Speech and Signal Processing (ICASSP), IEEE, pp. 5769–5773. 2022, pp. 6907–6911. [31] S. Deng, N. Zhang, Z. Sun, J. Chen, H. Chen, When low resource nlp meets unsupervised language ing Society 66 (2018) 457–467. model: Meta-pretraining then meta-learning for [44] P. Gournay, O. Lahaie, R. Lefebvre, A canadian few-shot text classification (student abstract), in: french emotional speech dataset, in: Proceedings Proceedings of the AAAI Conference on Artificial of the 9th ACM multimedia systems conference, Intelligence, volume 34, 2020, pp. 13773–13774. 2018, pp. 399–402. [32] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross [45] E. Parada-Cabaleiro, G. Costantini, A. Batliner, lingual speech emotion recognition: Urdu vs. west- M. Schmitt, B. W. Schuller, Demos: An italian emoern languages, in: 2018 International conference tional speech corpus: Elicitation methods, machine on frontiers of information technology (FIT), IEEE, learning, and perception, Language Resources and 2018, pp. 88–93. Evaluation 54 (2020) 341–383. [33] E. Garcia-Cuesta, A. B. Salvador, D. G. Pãez, Emo- [46] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. matchspanishdb: study of speech emotion recog- Sendlmeier, B. Weiss, et al., A database of german nition machine learning models in a new spanish emotional speech., in: Interspeech, volume 5, 2005, elicited database, Multimedia Tools and Applica- pp. 1517–1520.

tions 83 (2024) 13093–13112. [47] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco, [34] A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. et al., Emovo corpus: an italian emotional speech Morency, Tensor fusion network for multimodal database, in: Proceedings of the ninth international sentiment analysis, arXiv preprint arXiv:1707.07250 conference on language resources and evaluation (2017). (LREC’14), European Language Resources Associa[35] H. H. Mao, A survey on self-supervised pre-training tion (ELRA), 2014, pp. 3501–3504. for sequential transfer learning in neural networks, [48] L. Abdel-Hamid, Egyptian arabic speech emotion arXiv preprint arXiv:2007.00800 (2020). recognition using prosodic, spectral and wavelet [36] S. Sadok, S. Leglaive, R. Séguier, A vector quantized features, Speech Communication 122 (2020) 19–30. masked autoencoder for speech emotion recogni- [49] S. Oréau, French emotional speech database - oréau, tion, in: 2023 IEEE International conference on Zenodo, 2021. URL: https://zenodo.org/records/ acoustics, speech, and signal processing workshops 4405783.

(ICASSPW), IEEE, 2023, pp. 1–5. [50] O. Mohamad Nezami, P. Jamshid Lou, M. Karami, [37] F. Catania, Speech emotion recognition in italian Shemo: a large-scale validated database for persian using wav2vec 2, Authorea Preprints (2023). speech emotion detection, Language Resources and [38] Y. Belinkov, J. Glass, Analyzing hidden representa- Evaluation 53 (2019) 1–16.

tions in end-to-end automatic speech recognition [51] Y. Gong, Y.-A. Chung, J. Glass, Ast: Audio spectrosystems, Advances in Neural Information Process- gram transformer, arXiv preprint arXiv:2104.01778 ing Systems 30 (2017). (2021). [39] J. Shah, Y. K. Singla, C. Chen, R. R. Shah, What all do audio transformer models hear? probing acoustic representations for language delivery and its structure, arXiv preprint arXiv:2101.00387 (2021). [40] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analysis of a self-supervised speech representation model, in: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021, pp.

914–921. [41] Y. Li, Y. Mohamied, P. Bell, C. Lai, Exploration of a self-supervised speech model: A study on emotional corpora, in: 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2023, pp.

868–875. [42] A. Pasad, B. Shi, K. Livescu, Comparative layerwise analysis of self-supervised speech models, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5. [43] N. Vryzas, R. Kotsakis, A. Liatsou, C. A. Dimoulas,

G. Kalliris, Speech emotion recognition for performance interaction, Journal of the Audio Engineer

[1]

C.-C.

Lee ,

Sridhar ,

J.-L.

Li ,

W.-C.

Lin ,

B.-H.

Su ,

Busso , Deep representation learning for afective speech signal analysis and processing: Preventing unwanted signal disparities , IEEE Signal Processing Magazine 38 ( 2021 ) 22 - 38 .

[2]

Lian ,

Lu ,

Li ,

Zhao ,

Tang ,

Zong , A survey of deep learning-based multimodal emotion recognition: Speech, text, and face , Entropy 25 ( 2023 ) 1440 .

[3]

T. M.

Wani ,

T. S.

Gunawan ,

S. A. A.

Qadri ,

Kartiwi ,

Ambikairajah , A comprehensive review of speech emotion recognition systems , IEEE access 9 ( 2021 ) 47795 - 47814 .

[4]

Huang ,

Dong ,

Mao ,

Zhan , Speech emotion recognition using cnn , in: Proceedings of the 22nd ACM international conference on Multimedia , 2014 , pp. 801 - 804 .

[5]

A. M.

Badshah ,

Ahmad ,

Rahim ,

S. W.

Baik , Speech emotion recognition from spectrograms with deep convolutional neural network, in: 2017 international conference on platform technology and service (PlatCon) , IEEE, 2017 , pp. 1 - 5 .

[6]

Zhao ,

Mao , L. Chen, Speech emotion recognition using deep 1d & 2d cnn lstm networks , Biomedical signal processing and control 47 ( 2019 ) 312 - 323 .

[7]

Feng ,

Hashemi ,

Annavaram ,

S. S.

Narayanan , Enhancing privacy through domain adaptive noise injection for speech emotion recognition , in: ICASSP 2022 -2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) , IEEE, 2022 , pp. 7702 - 7706 .

[8]

Lim ,

Jang ,

Lee , Speech emotion recognition using convolutional and recurrent neural networks,