Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition

Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition FedericoD'asaro federico.dasaro@polito.it Data & Space LINKS Foundation -AI Politecnico di Torino Dipartimento di Automatica e Informatica (DAUIN) JuanJosé MárquezVillacís Data & Space LINKS Foundation -AI GiuseppeRizzo giuseppe.rizzo@linksfoundation.com Data & Space LINKS Foundation -AI Politecnico di Torino Dipartimento di Automatica e Informatica (DAUIN) AndreaBottino andrea.bottino@polito.it Politecnico di Torino Dipartimento di Automatica e Informatica (DAUIN) Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition 1613-0073 60A12A0A7BDC475E882925E064E6DEDC GROBID - A machine learning software for extracting information from scholarly documents Cross-lingual Speech Emotion Recognition Large Speech models Transfer Learning

Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or Weakly-Supervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.

Introduction

Speech Emotion Recognition (SER) aims to identify emotions from speech audio, enhancing Human-AI interaction in fields such as healthcare, education, and security [1]. Traditional methods rely on Low-Level Descriptors (LLD) like spectral, prosodic, and voice quality features [2], using classifiers such as KNN, SVM, or Naïve Bayes [3]. Deep learning has introduced advanced techniques, including Convolutional Neural Networks (CNNs) [4,5,6], eventually followed by Recurrent Neural Networks (RNNs) [7,8], and Transformers [9,10,11]. Transformers' ability to learn from extensive datasets has led to Large Speech Models (LSMs), which generalize across various speech tasks. Common training approaches for these models include Self-Supervised Learning (SSL), which uses data itself to learn generalpurpose features [12], and Weakly-Supervised Learning (WSL), which pairs audio with text for tasks like transcription and translation [13]. The general-purpose knowl-edge of LSMs makes them effective feature extractors for SER. Research has adapted LSMs for SER in English [14,15,16,17], but efforts for other languages are limited, focusing on Wav2Vec 2.0 [18] for cross-lingual SER [19,20,21].

This study examines how effective LSMs are as feature extractors for cross-lingual SER, using nine datasets across eight languages: Italian, German, French, Canadian French, Spanish, Greek, Persian, and Egyptian Arabic. Specifically, we utilize LSMs from the Wav2Vec 2.0 and Whisper [13] model families, pre-trained with SSL and WSL approaches, respectively. We introduce Whisper due to its underexplored use in cross-lingual SER. To assess the effectiveness of LSMs as feature extractors, we test three classifiers of increasing complexity-Linear, Non-Linear, and Multi-Layer-across nine datasets. This evaluation determines which classifier best suits each LSM across different languages. Moreover, our study includes both English-Only and Multilingual models from the Wav2Vec 2.0 and Whisper families, aiming to evaluate the effectiveness of multilingual pre-training for cross-lingual SER.

The main contributions of this work are:

• We evaluate LSMs from the Wav2Vec 2.0 and Whisper models as feature extractors for crosslingual SER across eight languages.

Background

Large Speech Models

Recent developments in natural language processing and computer vision have harnessed large volumes of unlabeled data through Self-Supervised Learning [22,23,24].

Building on techniques such as masked language and image modeling, Wav2Vec 2.0 [18] introduced a LSM trained on extensive audio datasets using masked speech modeling. Wav2Vec 2.0 features seven 1D convolutional blocks for initial feature extraction, followed by 12 or 24 transformer blocks (depending on the model variant) for contextual processing. The model masks part of the latent features and reconstructs them using the surrounding context. To further refine LSMs for tasks like emotion recognition, methods such as WavLM [25] have been developed. WavLM incorporates speech denoising alongside masked modeling, demonstrating broad effectiveness across various tasks in the SUPERB benchmark [26]. Moreover, XLSR-53 [27] extends the Wav2Vec 2.0 framework to cover 53 languages, sharing the latent space across these languages. This approach has shown superior performance over monolingual pretraining for automatic speech recognition. XLS-R [28] further advances this by scaling to 128 languages, excelling in speech translation and language identification. In comparison, Whisper [13] leverages large-scale weak supervision from audio-transcription pairs to train an encoder-decoder transformer. Using log-mel spectrograms, Whisper is trained in a multitask framework that includes multilingual transcription and translation, establishing itself as an effective zero-shot model for multilingual tasks.

Cross-Language Speech Emotion Recognition

Emotion recognition in languages beyond English, like Italian [29], French [30], Persian [31,32], and Spanish [33], is crucial but often limited by data availability. Recent efforts have focused on improving cross-lingual and cross-modal knowledge transfer. Techniques like dual attention [21] and tensor fusion [34] enhance audio and text interaction in languages such as Italian, German, and Urdu. Self-supervised pre-training methods, including variational autoencoders, have also been effective in transferring knowledge across languages like German [35,36]. The advent of LSMs pre-trained with self-supervision has further increased the potential for transfer learning due to their high generalization capabilities [15]. However, most research primarily focuses on adapting multilingual Wav2Vec 2.0 models (XLSR-53) [19,37,20,21]. This work expands the scope of analyzed LSMs including WSL models as Whisper. Additionally, we evaluate the ability of English-only models to transfer knowledge to other languages, beyond just multilingual models.

Method

In this section, we describe the methodology for evaluating the effectiveness of LSMs as feature extractors for downstream SER in various languages. We stack a classification model on top of the LSM backbone, with its parameters frozen. All LSMs used in this work share the same overall architecture, which we describe below along with the stacked classification model. Formally, the input audio 𝐴 (raw waveform or logmel spectrogram) passes through a convolutional encoder 𝓏 ∶ 𝐴 → 𝑍, mapping the audio to latent features 𝑍 = {𝑧 1 , … , 𝑧 𝑇 }, where 𝑇 is the sequence length and each frame 𝑧 𝑖 typically corresponds to 25 ms with 𝑧 𝑖 ∈ ℝ 𝑑 . Then, 𝑍 passes through a Transformer encoder consisting of 𝑙 layers 𝒽 𝑙 ∶ 𝑍 → 𝐻, enriching the latent features with contextual information, resulting in {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } for each of the 𝑙 = 1, … , 𝐿 Transformer layers. Here, 𝑙 = 𝐿 corresponds to the output features of the last layer, with ℎ 𝑙 𝑖 ∈ ℝ 𝑑 . The features {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 are considered the extracted features from the LSM and are fed into a downstream classifier 𝓎 ∶ 𝐻 → 𝑌, which maps these features to the output class logits {𝑦 1 , … , 𝑦 𝑘 }. The output class label 𝑦 * for audio 𝐴 is given by:

𝑦 * = arg max 𝑘 softmax (𝓎 (𝒽 (𝓏(𝐴))))(1)

Inspired by previous work that uses probing to evaluate the quality of features extracted from backbone models [38,39], we evaluate three different downstream classifiers of increasing complexity: Linear Classifier (ℊ 𝑙 ), Non-Linear Classifier (ℊ 𝑛𝑙 ), and Multi-layer Classifier (ℊ 𝑚𝑙 ). Figure 1 illustrates their architecture, which is detailed below.

Linear Classifier

For the linear classifier, we use a simple feed-forward neural network that consists solely of linear projections. The snowflake icon represents frozen weights, while the fire icon denotes trainable weights.

Specifically, given the features from the last Transformer layer {ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 }, they are first projected by a linear layer 𝓁 1 ∶ ℝ 𝑑 → ℝ 𝑚 that is shared across all frames, then aggregated by average pooling 𝓅, and finally pass through the classification layer ℴ ∶ ℝ 𝑚 → ℝ 𝑘 to obtain the output class logits. The function ℊ 𝑙 is compactly defined as:

ℊ 𝑙 (ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 ) = ℴ (𝓅 (𝓁 1 (ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 )))(2)

The absence of non-linear activations allows us to evaluate the quality of the features extracted from the LSM based on the linear classifier model's ability to handle the SER task.

Non-Linear Classifier

To increase the complexity of the classification model, we utilize a series of linear layers interleaved with ReLU activations both before and after feature pooling. We follow the same architecture as in [14,15], but unlike them, we only feed the features from the last Transformer layer 𝐿 to the model. Each {ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 } passes through two shared linear layers, ReLU, and dropout blocks (𝒷), followed by a linear layer (𝓁 1 ). Linear layers are functions 𝓁 ∶ ℝ 𝑑 → ℝ 𝑚 . Projected features are averaged, pass through 𝓁 2 and ReLU, and are classified by ℴ. Thus, ℊ 𝑛𝑙 is:

ℊ 𝑛𝑙 (𝑥 = ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 ) = ℴ (ReLU (𝓁 2 (𝓅 (𝓁 1 (𝒷 (𝑥))))))(3)

Multi-Layer Classifier

As a third option, we adopt the approach from [14,15], which utilizes all hidden states of the Transformer encoder.

ℎ * 𝑡 = 𝐿 ∑ 𝑙=1 𝑤 𝑙 ⋅ ℎ 𝑙 𝑡 for 𝑡 = 1, … , 𝑇(4)

where 𝑤 1 , … , 𝑤 𝐿 are the weights assigned to each Transformer layer, ensuring 𝑤 𝑙 ∈ [0, 1] and ∑ 𝐿 𝑙=1 𝑤 𝑙 = 1. The resulting sequence {ℎ * 1 , … , ℎ * 𝑇 } is then processed by the same pipeline as the Non-Linear Classifier, resulting in:

ℊ 𝑚𝑙 (𝑥 = {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 ) = ℊ 𝑛𝑙 (𝓈(𝑥))(5)

This classifier leverages internal layer information, which has proven beneficial for paralinguistic and linguistic downstream tasks [39,40,41,42]. By investigating the contribution of internal LSM layers for SER across various languages, we corroborates previous findings for Wav2Vec 2.0 models and provide new insights for Whisper models.

Experiments

Datasets and Metrics

In this study, we conduct experiments using 9 distinct datasets spanning 8 different languages: Greek, French, Italian, German, Spanish, Egyptian Arabic, and Persian.

The datasets vary in their collection methodologies, such as acted emotions and elicitation methods. The participant demographics may be balanced by gender (e.g., CaFE, EYASE), by emotion (e.g., EMOVO), or may not be balanced at all. For all datasets, we conduct our experiments in a speaker-independent setting to prevent evaluation on speaker-dependent features. Table 1 provides an overview of the dataset statistics, with a more detailed description given below. AESDD [43]: The Acted Emotional Speech Dynamic Database comprises 500 recorded samples from 5 actors (3 females, 2 males) expressing 5 distinct emotions in Greek. Each actor performed 20 utterances per emotion, with some utterances recorded multiple times. In later versions, additional actors were included, bringing the total to 604 recordings from 6 actors.

CaFE [44] DEMoS [45]: DEMoS contains 9697 audio samples from 68 volunteer students (299 females, 131 males) expressing the Big Six emotions plus the neutral state in Italian. Instead of acted emotions, samples were generated using an elicitation approach. The recordings, with a mean duration of 2.9 seconds (std: 1.1s), are provided in 48 kHz, 16-bit, mono format.

EmoDB [46]: This collection includes 535 utterances across 7 emotional states, spoken in German by 5 female and 5 male actors. Each actor performed a set of 10 sentences, which were down-sampled from the original 48 kHz to 16 kHz.

EmoMatch [33]: Consisting of 2005 recordings, Emo-Match features samples from 50 non-actor Spanish speakers (20 females, 30 males) expressing the Big Six emotions and a neutral state. The dataset is a subset of the larger EmoSpanishDB and contains recordings sampled at 48 kHz with a 16-bit mono format.

EMOVO [47]: EMOVO presents 588 Italian audio recordings from 3 male and 3 female actors simulating the Big Six emotions plus a neutral state. Each actor voiced 14 utterances, and the recordings are provided in 48 kHz, 16-bit stereo WAV format.

EYASE [48]: EYASE contains 579 utterances in Egyptian Arabic, recorded by 3 male and 3 female professional actors. The recordings, ranging from 1 to 6 seconds in duration, were labeled as angry, happy, neutral, or sad and sampled at 44.1 kHz.

Oréau [49]: The Oréau dataset features 502 audio samples from 32 non-professional actors (25 male, 7 female) who voiced 10 to 13 utterances in French for the Big Six emotions plus a neutral state.

ShEMO [50]: ShEMO comprises 3000 semi-natural recordings from 87 native Persian speakers (31 female, 56 male). The dataset captures 5 of the Big Six emotions-sadness, anger, happiness, surprise, and fear-plus a neutral state. The samples were up-sampled to a frequency of 44.1 kHz in mono-channel format, with an average length of 4.11 seconds (std: 3.41s).

The audio is resampled to 16 kHz, and a stratified train/-validation/test split is performed with ratios of 80/10/10. All results are reported using the macro F1 score, expressed as a percentage. We conducted 3 runs, presenting the mean ± standard deviation.

Experimental Details

Baseline As a baseline to evaluate LSM transfer learning capabilities, we adopt the Audio Spectrogram Transformer (AST) [51], a fully transformer-based architecture recently proposed as a substitute for CNNs [9,10,11].

We train AST from scratch on each of the 9 datasets using the same hyperparameters as [51].

LSM Models

We use pre-trained checkpoints for both English-Only and Multilingual models: Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0 family, and Whisper Small (EN) (Whisper Small pretrained only on English data), Whisper Small, Whisper Medium from the Whisper family. The LSM backbones are kept frozen and used exclusively as feature extractors.

Training We follow the same hyperparameters settings as [15] to train the downstream classifiers. Specifically, we train for 30 epochs using the Adam optimizer with a learning rate of 5.0e-04, weight decay of 1.0e-04, betas set to (0.9, 0.98), and epsilon of 1.0e-08. The dimension of the classifier projection 𝑚 is 256.

Results

To present our results, we first compare the performance of the various classifiers (see Section 3) for each LSM utilized. This analysis provides insights into the characteristics of features extracted from Wav2Vec 2.0 and Whisper models for downstream SER tasks. After identifying the best classifier for each LSM, we then compare the performance of English-Only and Multilingual LSMs across the 8 languages covered in this study.

Comparison between downstream classifiers

We examine the results in Table 2

Table 2

Performance of various LSM backbones using Linear, Non-Linear, and Multi-Layer classification methods. F1 scores are averaged across all 9 datasets. For each LSM, the best classifier is highlighted in bold. table shows average F1 scores across 9 datasets, highlighting the most effective classifier for each LSM in crosslingual SER tasks. For Wav2Vec 2.0 models, the Multi-Layer Classifier performs best, with F1 scores of 53.42, 57.50, and 40.89 for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R. The Linear and Non-Linear classifiers perform similarly, especially for Wav2Vec 2.0 Large and XLS-R, suggesting improvements are due to using features from internal Transformer layers rather than non-linear activations. For Whisper models, the Linear Classifier performs best, with F1 scores of 58.16, 60.87, and 60.72 for Whisper Small (EN), Whisper Small, and Whisper Medium. Increasing classifier complexity with non-linear activations decreases performance, likely due to general information loss caused by complex transformations. The Multi-Layer Classifier performs worse, indicating that using also features from internal layers is less effective than using features from the last layer alone.

This comparison reveals that Wav2Vec 2.0 models benefit from features extracted from internal Transformer layers and exhibit less sensitivity to classifier complexity, consistent with prior research [41,39]. Conversely, Whisper models achieve better performance with features from the last Transformer layer when using a simple linear classifier, offering new insights into their effective-ness for SER across multiple languages. We hypothesize that this differing behavior may be related to their respective Self-Supervised and Weakly-Supervised pre-training approaches, which warrant further investigation. To gain further insights into the importance of Transformer layers in Wav2Vec 2.0 and Whisper for SER, we leverage the weights learned in the Multi-Layer classifier as follows.

Transformer Layer Weights. We analyze the weights 𝑤 1 , … , 𝑤 𝐿 from the Multi-Layer Classifier to assess Transformer layer importance. Figure 2 illustrates that Wav2Vec 2.0 models assign greater weight to the early and middle layers, whereas Whisper models emphasize the later layers. This observation confirms the earlier findings, suggesting that paralinguistic information in Whisper models is embedded in the features of the later Transformer layers.

Comparing English-Only and Multilingual LSMs Across Different Languages

In this section, we compare English-Only and Multilingual LSMs with the AST baseline across 9 datasets. Table 3 displays F1 scores for the optimal classifiers found in the previous section: Multi-Layer for Wav2Vec 2.0 and Linear for Whisper models.

Transferring knowledge from LSMs proves to be effective across all datasets compared to the baseline. For instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian Arabic, while Whisper Small scores 51.98 and AST scores 33.23. This indicates that LSMs are effective feature extractors for cross-lingual SER on multiple languages.

When comparing English-only and Multilingual models, we differentiate between the Wav2Vec 2.0 and Whisper families. For Wav2Vec 2.0, we observe that Wav2Vec 2.0 Base and Large generally outperform XLS-R (e.g., 87.85 and 88.31 vs. 67.71 for DEMos), except in Persian, where their performance is comparable. This indicates that multilingual pre-training may not be as effective for Wav2Vec 2.0 models across various languages. We speculate that this may be due to the limitations of SSL pre-training, which might struggle with the diverse range of languages and lose important paralinguistic features that are retained in English-only models. Further investigation with a wider range of SSL-pretrained LSMs could provide more insights. As regards to Whisper, Multilingual Whisper Small outperforms its English-only version, with the exception of Greek and Persian, likely due to limited pretraining data for these languages, which resulted in higher word error rates compared to other languages in this study [13]. Multilingual Whisper models achieve best performance in Canadian French, Spanish (66.71, 73.13 with Whisper Small), Italian, German, and French (91.17, 90.64, 95.22 with Whisper Medium). This improvement is likely due to the larger pretraining datasets for these languages and the similarities between Canadian French and French. We believe that multilingual pretraining benefits Whisper models by capturing language-specific features more effectively through WSL and multitask learning. However, further research is needed to evaluate the effectiveness of multilingual pretraining with WSL compared to SSL across a broader range of LSMs.

Conclusion

This paper examines the capabilities of Wav2Vec 2.0 and Whisper models as feature extractors for cross-lingual SER across eight languages, considering both English-Only and Multilingual variants. Our findings reveal that LSMs are effective feature extractors compared to a full Transformer baseline trained from scratch. We observe that Whisper models encode acoustic information primarily in the features of the last Transformer layer, whereas Wav2Vec 2.0 models rely on features from middle and early layers. Furthermore, we show that multilingual pre-training benefits Whisper models, leading to strong performance in Italian, Canadian French, French, Spanish, German, and competitive results in Greek, Egyptian Arabic, and Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving top performance in Greek and Egyptian Arabic. We attribute the disparity in multilingual pre-training effectiveness to the differences between SSL and WSL strategies, which should be explored further.

Figure 1 :1Figure 1: The three downstream classifiers used in this work are: Linear (red), Non-Linear (purple), and Multi-Layer (green).The snowflake icon represents frozen weights, while the fire icon denotes trainable weights.

Figure 2 :2Figure 2: Greyscale map of layer weight distribution from the Multi-Layer classification method. Weights are averaged over all 9 datasets for each model. Darker shades indicate higher weights.

The features {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 are combined into a new sequence {ℎ * 1 , … , ℎ * 𝑇 } using a learnable weighted sum. The function 𝓈 ∶ ℝ 𝐿×𝑇 ×𝑑 → ℝ 𝑇 ×𝑑 maps {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 to {ℎ* 1 , … , ℎ * 𝑇 } as follows:

Table 11: This dataset includes recordings of 6 different sentences delivered by 12 actors (6 female, 6 male) portraying the Big Six emotions and a neutral state in Canadian French. It offers a high-quality version with a sampling rate of 192 kHz at 24 bits per sample, as well as Summary statistics of the 9 datasets used in this work.

DatasetLanguage# SamplesEmotionsAESDDGreek500anger, disgust, fear, happiness, and sadnessCaFECanadian French936anger, disgust, fear, happiness, surprise, sadness, and neutralityDEMoSItalian9697anger, disgust, fear, happiness, surprise, sadness, and neutralityEmoDBGerman535anger, disgust, fear, happiness, boredom, sadness, and neutralityEmoMatchSpanish2005anger, disgust, fear, happiness, surprise, sadness, and neutralityEMOVOItalian588anger, disgust, fear, happiness, surprise, sadness, and neutralityEYASEEgyptian Arabic579anger, happiness, sadness, and neutralityOréauFrench502anger, disgust, fear, happiness, surprise, sadness, and neutralityShEMOPersian400anger, happiness, sadness, and neutralitya down-sampled version at 48 kHz and 16 bits per sample.The total number of samples amounts to 936.

, comparing three classifier methods for Wav2Vec 2.0 and Whisper models. TheBackboneLinearNon-Linear Multi-LayerWav2Vec 2.0 Base 47.87 (± 0.93) 42.07 (± 5.27) 53.42 (± 1.27)Wav2Vec 2.0 Large 12.09 (± 1.50) 12.93 (± 3.31) 57.50 (± 0.03)XLS-R5.43 (± 0.40) 5.86 (± 0.07) 40.89 (± 2.00)

Whisper Small (EN) 58.16 (± 0.15) 53.50 (± 0.98) 49.73 (± 2.02) Whisper Small 60.87 (± 0.26) 54.86 (± 0.93) 45.14 (± 1.54) Whisper Medium 60.72 (± 0.16) 55.56 (± 1.09) 37.95 (± 2.27)

.35) 52.86 (± 0.07) 58.42 (± 4.14) 82.27 (± 0.23) 32.51 (± 4.89) 92.70 (± 1.67) 95.22 (± 0.84) ShEMO (fa) 36.15 (± 0.85) 60.55 (± 3.90) 57.52 (± 9.09) 67.93 (± 0.37) 61.24 (± 8.93) 63.88 (± 1.21) 63.85 (± 1.58)English-OnlyMultilingualDataset/ModelASTWav2Vec 2.0 Base ‡Wav2Vec 2.0 Large ‡Whisper Small †XLS-R ‡Whisper Small †Whisper Medium †AESDD (el)19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99) 9.16 (± 1.25) 26.34 (± 1.65) 27.62 (± 0.62)CaFE (fr-ca)10.96 (± 6.26) 50.52 (± 3.54) 47.74 (± 0.33) 60.66 (± 0.76) 18.66 (± 0.01) 66.71 (± 0.72) 55.03 (± 0.38)DEMoS (it)13.75 (± 4.26) 87.85 (± 0.01) 88.31 (± 0.74) 88.24 (± 0.21) 67.71 (± 1.47) 90.61 (± 0.14) 91.17 (± 0.20)EmoDB (de)46.11 (± 6.55) 81.75 (± 7.30) 88.84 (± 7.48) 83.31 (± 0.18) 67.39 (± 4.33) 87.21 (± 1.11) 90.64 (± 1.47)EmoMatch (es) 36.10 (± 2.63) 69.84 (± 0.69) 71.85 (± 1.55) 67.59 (± 0.35) 44.14 (± 0.25) 73.13 (± 2.54) 68.23 (± 0.78)EMOVO (it)15.74 (± 1.24) 16.47 (± 0.61) 20.33 (± 1.31) 27.30 (± 0.16) 14.86 (± 2.11) 41.05 (± 1.21) 50.19 (± 0.29)EYASE (ar-eg)33.23 (± 4.58) 46.31 (± 3.62) 53.40 (± 1.56) 42.65 (± 0.70) 47.27 (± 1.36) 51.98 (± 0.88) 37.32 (± 3.62)Oréau (fr)19.01 (± 2

Table 33Performance of Wav2Vec and Whisper models across 9 datasets, divided into English-Only and Multilingual LSMs. AST is the baseline. † indicates a Linear Classifier, ‡ a Multi-Layer Classifier. Bold values are the highest scores, and underlined values highlight the best between English-Only and Multilingual models.

Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities C.-CLee KSridhar J.-LLi W.-CLin B.-HSu CBusso IEEE Signal Processing Magazine 38 2021 A survey of deep learning-based multimodal emotion recognition: Speech, text, and face HLian CLu SLi YZhao CTang YZong Entropy 25 1440 2023 A comprehensive review of speech emotion recognition systems TMWani TSGunawan SA AQadri MKartiwi EAmbikairajah IEEE access 9 2021 Speech emotion recognition using cnn ZHuang MDong QMao YZhan Proceedings of the 22nd ACM international conference on Multimedia the 22nd ACM international conference on Multimedia 2014 Speech emotion recognition from spectrograms with deep convolutional neural network AMBadshah JAhmad NRahim SWBaik 2017 international conference on platform technology and service (PlatCon) IEEE 2017 Speech emotion recognition using deep 1d & 2d cnn lstm networks JZhao XMao LChen Biomedical signal processing and control 47 2019 Enhancing privacy through domain adaptive noise injection for speech emotion recognition TFeng HHashemi MAnnavaram SSNarayanan ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE 2022 Speech emotion recognition using convolutional and recurrent neural networks WLim DJang TLee 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), IEEE 2016 N.-CRistea RTIonescu FSKhan arXiv:2203.09581 Septr: Separable transformer for audio spectrogram processing 2022 arXiv preprint Coordvit: a novel method of improve vision transformer-based speech emotion recognition using coordinate information concatenate J.-YKim S.-HLee 2023 International conference on electronics, information, and communication (ICEIC) IEEE 2023 An enhanced speech emotion recognition using vision transformer SAkinpelu SViriri AAdegun Scientific Reports 14 13126 2024 Audio self-supervised learning: A survey SLiu AMallol-Ragolta EParada-Cabaleiro KQian XJing AKathan BHu BWSchuller Patterns 3 2022 Robust speech recognition via large-scale weak supervision ARadford JWKim TXu GBrockman CMcleavey ISutskever International Conference on Machine Learning

PMLR

2023 LPepino PRiera LFerrer arXiv:2104.03502 Emotion recognition from speech using wav2vec 2.0 embeddings 2021 arXiv preprint Peft-ser: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models TFeng SNarayanan 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE 2023. 2023 Evaluating parameter-efficient transfer learning approaches on sure benchmark for speech understanding YLi AMehrish RBhardwaj NMajumder BCheng SZhao AZadeh RMihalcea SPoria ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2023 Trust-ser: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition TFeng RHebbar SNarayanan ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2024 wav2vec 2.0: A framework for self-supervised learning of speech representations ABaevski YZhou AMohamed MAuli Advances in neural information processing systems 33 2020 Multi-lingual multi-task speech emotion recognition using wav2vec 2.0 MSharma ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2022 <author> <persName><forename type="first">S</forename><forename type="middle">G</forename><surname>Upadhyay</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Martinez-Lucas</surname></persName> </author> <author> <persName><forename type="first">B.-H</forename><surname>Su</surname></persName> </author> <author> <persName><forename type="first">W.-C</forename></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b20"> <analytic> <title level="a" type="main">Phonetic anchor-based transfer learning to facilitate unsupervised cross-lingual speech emotion recognition W.-SLin Y.-TChien WWu CKatz C.-CBusso Lee ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2023 SA MZaidi SLatif JQadir arXiv:2306.13804 Cross-language speech emotion recognition using multimodal dual attention transformers 2023 arXiv preprint JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2018 arXiv preprint ARadford KNarasimhan TSalimans ISutskever Improving language understanding by generative pre-training 2018 ADosovitskiy LBeyer AKolesnikov DWeissenborn XZhai TUnterthiner MDehghani MMinderer GHeigold SGelly arXiv:2010.11929 An image is worth 16x16 words: Transformers for image recognition at scale 2020 arXiv preprint Wavlm: Large-scale self-supervised pre-training for full stack speech processing SChen CWang ZChen YWu SLiu ZChen JLi NKanda TYoshioka XXiao IEEE Journal of Selected Topics in Signal Processing 16 2022 S.-WYang P.-HChi Y.-SChuang C.-IJLai KLakhotia YYLin ATLiu JShi XChang G.-TLin arXiv:2105.01051 Superb: Speech processing universal performance benchmark 2021 arXiv preprint Unsupervised cross-lingual representation learning for speech recognition AConneau ABaevski RCollobert AMohamed MAuli arXiv:2006.13979 2020 arXiv preprint ABabu CWang ATjandra KLakhotia QXu NGoyal KSingh PVon Platen YSaraf JPino arXiv:2111.09296 Xls-r: Self-supervised cross-lingual speech representation learning at scale 2021 arXiv preprint Deep learning for the detection of emotion in human speech: The impact of audio sample duration and english versus italian languages AWurst MHopwood SWu FLi Y.-DYao 2023 32nd Wireless and Optical Communications Conference (WOCC), IEEE 2023 Cross-lingual and multilingual speech emotion recognition on english and french MNeumann 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE 2018 When low resource nlp meets unsupervised language model: Meta-pretraining then meta-learning for few-shot text classification (student abstract) SDeng NZhang ZSun JChen HChen Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2020 34 Cross lingual speech emotion recognition: Urdu vs. western languages SLatif AQayyum MUsman JQadir 2018 International conference on frontiers of information technology (FIT), IEEE 2018 Emomatchspanishdb: study of speech emotion recognition machine learning models in a new spanish elicited database EGarcia-Cuesta ABSalvador DGPãez Multimedia Tools and Applications 83 2024 AZadeh MChen SPoria ECambria L.-PMorency arXiv:1707.07250 Tensor fusion network for multimodal sentiment analysis 2017 arXiv preprint HHMao arXiv:2007.00800 A survey on self-supervised pre-training for sequential transfer learning in neural networks 2020 arXiv preprint A vector quantized masked autoencoder for speech emotion recognition SSadok SLeglaive RSéguier 2023 IEEE International conference on acoustics, speech, and signal processing workshops (ICASSPW) IEEE 2023 FCatania Speech emotion recognition in italian using wav2vec 2 Authorea Preprints 2023 Analyzing hidden representations in end-to-end automatic speech recognition systems YBelinkov JGlass Advances in Neural Information Processing Systems 30 2017 JShah YKSingla CChen RRShah arXiv:2101.00387 What all do audio transformer models hear? probing acoustic representations for language delivery and its structure 2021 arXiv preprint Layer-wise analysis of a self-supervised speech representation model APasad J.-CChou KLivescu IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) IEEE 2021. 2021 Exploration of a self-supervised speech model: A study on emotional corpora YLi YMohamied PBell CLai IEEE Spoken Language Technology Workshop (SLT), IEEE 2022. 2023 Comparative layerwise analysis of self-supervised speech models APasad BShi KLivescu ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2023 Speech emotion recognition for performance interaction NVryzas RKotsakis ALiatsou CADimoulas GKalliris Journal of the Audio Engineering Society 66 2018 A canadian french emotional speech dataset PGournay OLahaie RLefebvre Proceedings of the 9th ACM multimedia systems conference the 9th ACM multimedia systems conference 2018 Demos: An italian emotional speech corpus: Elicitation methods, machine learning, and perception EParada-Cabaleiro GCostantini ABatliner MSchmitt BWSchuller Language Resources and Evaluation 54 2020 A database of german emotional speech FBurkhardt APaeschke MRolfes WFSendlmeier BWeiss Interspeech 5 2005 Emovo corpus: an italian emotional speech database GCostantini IIaderola APaoloni MTodisco Proceedings of the ninth international conference on language resources and evaluation (LREC'14), European Language Resources Association (ELRA) the ninth international conference on language resources and evaluation (LREC'14), European Language Resources Association (ELRA) 2014 Egyptian arabic speech emotion recognition using prosodic, spectral and wavelet features LAbdel-Hamid Speech Communication 122 2020 French emotional speech database -oréau SOréau 2021 Zenodo Shemo: a large-scale validated database for persian speech emotion detection OMohamadNezami PLou MKarami Language Resources and Evaluation 53 2019 YGong Y.-AChung JGlass arXiv:2104.01778 Ast: Audio spectrogram transformer 2021 arXiv preprint