Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition Federico D’Asaro1,2,∗,† , Juan José Márquez Villacís1,† , Giuseppe Rizzo1,2 and Andrea Bottino2 1 LINKS Foundation – AI, Data & Space (ADS) 2 Politecnico di Torino – Dipartimento di Automatica e Informatica (DAUIN) Abstract Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or Weakly- Supervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic. Keywords Cross-lingual Speech Emotion Recognition, Large Speech models, Transfer Learning 1. Introduction edge of LSMs makes them effective feature extractors for SER. Research has adapted LSMs for SER in English Speech Emotion Recognition (SER) aims to identify emo- [14, 15, 16, 17], but efforts for other languages are lim- tions from speech audio, enhancing Human-AI inter- ited, focusing on Wav2Vec 2.0 [18] for cross-lingual SER action in fields such as healthcare, education, and se- [19, 20, 21]. curity [1]. Traditional methods rely on Low-Level De- This study examines how effective LSMs are as fea- scriptors (LLD) like spectral, prosodic, and voice qual- ture extractors for cross-lingual SER, using nine datasets ity features [2], using classifiers such as KNN, SVM, across eight languages: Italian, German, French, Cana- or Naïve Bayes [3]. Deep learning has introduced ad- dian French, Spanish, Greek, Persian, and Egyptian Arabic. vanced techniques, including Convolutional Neural Net- Specifically, we utilize LSMs from the Wav2Vec 2.0 and works (CNNs) [4, 5, 6], eventually followed by Recur- Whisper [13] model families, pre-trained with SSL and rent Neural Networks (RNNs) [7, 8], and Transformers WSL approaches, respectively. We introduce Whisper [9, 10, 11]. Transformers’ ability to learn from extensive due to its underexplored use in cross-lingual SER. To datasets has led to Large Speech Models (LSMs), which assess the effectiveness of LSMs as feature extractors, generalize across various speech tasks. Common train- we test three classifiers of increasing complexity—Linear, ing approaches for these models include Self-Supervised Non-Linear, and Multi-Layer—across nine datasets. This Learning (SSL), which uses data itself to learn general- evaluation determines which classifier best suits each purpose features [12], and Weakly-Supervised Learning LSM across different languages. Moreover, our study in- (WSL), which pairs audio with text for tasks like transcrip- cludes both English-Only and Multilingual models from tion and translation [13]. The general-purpose knowl- the Wav2Vec 2.0 and Whisper families, aiming to eval- CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, uate the effectiveness of multilingual pre-training for Dec 04 — 06, 2024, Pisa, Italy cross-lingual SER. ∗ Corresponding author. The main contributions of this work are: † These authors contributed equally. Envelope-Open federico.dasaro@polito.it (F. D’Asaro); • We evaluate LSMs from the Wav2Vec 2.0 and juan.marquez@linksfoundation.com (J. J. Márquez Villacís); Whisper models as feature extractors for cross- giuseppe.rizzo@linksfoundation.com (G. Rizzo); lingual SER across eight languages. andrea.bottino@polito.it (A. Bottino) • We test three types of downstream classi- GLOBE http://conceptbase.sourceforge.net/mjf/ (A. Bottino) fiers—Linear, Non-Linear, and Multi-Layer—and Orcid 0009-0003-8727-3393 (F. D’Asaro); 0009-0008-3098-5492 (J. J. Márquez Villacís); 0000-0003-0083-813X (G. Rizzo); find that Whisper models’ last Transformer layer 0000-0002-8894-5089 (A. Bottino) features are well-suited for a Linear classifier, © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). whereas Wav2Vec 2.0 models perform better with CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings features from the middle and early Transformer dual attention [21] and tensor fusion [34] enhance audio layers. and text interaction in languages such as Italian, Ger- • We compare English-Only and Multilingual LSMs, man, and Urdu. Self-supervised pre-training methods, revealing that Whisper models benefit from mul- including variational autoencoders, have also been ef- tilingual pre-training performing best on Italian, fective in transferring knowledge across languages like Spanish, Canadian French, French, and German German [35, 36]. The advent of LSMs pre-trained with and competitively on Greek, Egyptian Arabic, self-supervision has further increased the potential for Persian. Conversely, English-Only Wav2Vec 2.0 transfer learning due to their high generalization capa- models surpass multilingual XLS-R in most lan- bilities [15]. However, most research primarily focuses guages, achieving the highest performance in on adapting multilingual Wav2Vec 2.0 models (XLSR-53) Greek, Egyptian Arabic. [19, 37, 20, 21]. This work expands the scope of analyzed LSMs including WSL models as Whisper. Additionally, we evaluate the ability of English-only models to transfer 2. Background knowledge to other languages, beyond just multilingual models. 2.1. Large Speech Models Recent developments in natural language processing and 3. Method computer vision have harnessed large volumes of unla- beled data through Self-Supervised Learning [22, 23, 24]. In this section, we describe the methodology for eval- Building on techniques such as masked language and uating the effectiveness of LSMs as feature extractors image modeling, Wav2Vec 2.0 [18] introduced a LSM for downstream SER in various languages. We stack a trained on extensive audio datasets using masked speech classification model on top of the LSM backbone, with modeling. Wav2Vec 2.0 features seven 1D convolutional its parameters frozen. All LSMs used in this work share blocks for initial feature extraction, followed by 12 or 24 the same overall architecture, which we describe below transformer blocks (depending on the model variant) for along with the stacked classification model. contextual processing. The model masks part of the latent Formally, the input audio 𝐴 (raw waveform or log- features and reconstructs them using the surrounding mel spectrogram) passes through a convolutional en- context. To further refine LSMs for tasks like emotion coder 𝓏 ∶ 𝐴 → 𝑍, mapping the audio to latent features recognition, methods such as WavLM [25] have been 𝑍 = {𝑧1 , … , 𝑧𝑇 }, where 𝑇 is the sequence length and each developed. WavLM incorporates speech denoising along- frame 𝑧𝑖 typically corresponds to 25 ms with 𝑧𝑖 ∈ ℝ𝑑 . side masked modeling, demonstrating broad effective- Then, 𝑍 passes through a Transformer encoder consist- ness across various tasks in the SUPERB benchmark [26]. ing of 𝑙 layers 𝒽𝑙 ∶ 𝑍 → 𝐻, enriching the latent features Moreover, XLSR-53 [27] extends the Wav2Vec 2.0 frame- with contextual information, resulting in {ℎ𝑙1 , … , ℎ𝑙𝑇 } for work to cover 53 languages, sharing the latent space each of the 𝑙 = 1, … , 𝐿 Transformer layers. Here, 𝑙 = 𝐿 across these languages. This approach has shown supe- corresponds to the output features of the last layer, with rior performance over monolingual pretraining for auto- ℎ𝑙𝑖 ∈ ℝ𝑑 . The features {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 are considered the matic speech recognition. XLS-R [28] further advances extracted features from the LSM and are fed into a down- this by scaling to 128 languages, excelling in speech trans- stream classifier 𝓎 ∶ 𝐻 → 𝑌, which maps these features lation and language identification. In comparison, Whis- to the output class logits {𝑦1 , … , 𝑦𝑘 }. The output class per [13] leverages large-scale weak supervision from label 𝑦 ∗ for audio 𝐴 is given by: audio-transcription pairs to train an encoder-decoder transformer. Using log-mel spectrograms, Whisper is 𝑦 ∗ = arg max softmax (𝓎 (𝒽 (𝓏(𝐴)))) (1) 𝑘 trained in a multitask framework that includes multilin- gual transcription and translation, establishing itself as Inspired by previous work that uses probing to evalu- an effective zero-shot model for multilingual tasks. ate the quality of features extracted from backbone mod- els [38, 39], we evaluate three different downstream classi- fiers of increasing complexity: Linear Classifier (ℊ𝑙 ), Non- 2.2. Cross-Language Speech Emotion Linear Classifier (ℊ𝑛𝑙 ), and Multi-layer Classifier (ℊ𝑚𝑙 ). Recognition Figure 1 illustrates their architecture, which is detailed Emotion recognition in languages beyond English, like below. Italian [29], French [30], Persian [31, 32], and Spanish [33], is crucial but often limited by data availability. Re- 3.1. Linear Classifier cent efforts have focused on improving cross-lingual and cross-modal knowledge transfer. Techniques like For the linear classifier, we use a simple feed-forward neural network that consists solely of linear projections. 3.3. Multi-Layer Classifier As a third option, we adopt the approach from [14, 15], which utilizes all hidden states of the Transformer en- coder. The features {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 are combined into a new sequence {ℎ∗1 , … , ℎ∗𝑇 } using a learnable weighted sum. The function 𝓈 ∶ ℝ𝐿×𝑇 ×𝑑 → ℝ𝑇 ×𝑑 maps {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 to {ℎ∗1 , … , ℎ∗𝑇 } as follows: 𝐿 ℎ∗𝑡 = ∑ 𝑤𝑙 ⋅ ℎ𝑙𝑡 for 𝑡 = 1, … , 𝑇 (4) 𝑙=1 where 𝑤1 , … , 𝑤𝐿 are the weights assigned to each Trans- 𝐿 former layer, ensuring 𝑤𝑙 ∈ [0, 1] and ∑𝑙=1 𝑤𝑙 = 1. The ∗ ∗ resulting sequence {ℎ1 , … , ℎ𝑇 } is then processed by the same pipeline as the Non-Linear Classifier, resulting in: ℊ𝑚𝑙 (𝑥 = {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 ) = ℊ𝑛𝑙 (𝓈(𝑥)) (5) This classifier leverages internal layer information, which Figure 1: The three downstream classifiers used in this work has proven beneficial for paralinguistic and linguistic are: Linear (red), Non-Linear (purple), and Multi-Layer (green). downstream tasks [39, 40, 41, 42]. By investigating the The snowflake icon represents frozen weights, while the fire icon denotes trainable weights. contribution of internal LSM layers for SER across var- ious languages, we corroborates previous findings for Wav2Vec 2.0 models and provide new insights for Whis- Specifically, given the features from the last Transformer per models. layer {ℎ𝐿1 , … , ℎ𝐿𝑇 }, they are first projected by a linear layer 𝓁1 ∶ ℝ𝑑 → ℝ𝑚 that is shared across all frames, then ag- gregated by average pooling 𝓅, and finally pass through 4. Experiments the classification layer ℴ ∶ ℝ𝑚 → ℝ𝑘 to obtain the output class logits. The function ℊ𝑙 is compactly defined as: 4.1. Datasets and Metrics In this study, we conduct experiments using 9 distinct ℊ𝑙 (ℎ𝐿1 , … , ℎ𝐿𝑇 ) = ℴ (𝓅 (𝓁1 (ℎ𝐿1 , … , ℎ𝐿𝑇 ))) (2) datasets spanning 8 different languages: Greek, French, The absence of non-linear activations allows us to eval- Italian, German, Spanish, Egyptian Arabic, and Persian. uate the quality of the features extracted from the LSM The datasets vary in their collection methodologies, such based on the linear classifier model’s ability to handle as acted emotions and elicitation methods. The partic- the SER task. ipant demographics may be balanced by gender (e.g., CaFE, EYASE), by emotion (e.g., EMOVO), or may not be balanced at all. For all datasets, we conduct our ex- 3.2. Non-Linear Classifier periments in a speaker-independent setting to prevent To increase the complexity of the classification model, evaluation on speaker-dependent features. Table 1 pro- we utilize a series of linear layers interleaved with ReLU vides an overview of the dataset statistics, with a more activations both before and after feature pooling. We detailed description given below. follow the same architecture as in [14, 15], but unlike AESDD [43]: The Acted Emotional Speech Dynamic them, we only feed the features from the last Transformer Database comprises 500 recorded samples from 5 actors layer 𝐿 to the model. Each {ℎ𝐿1 , … , ℎ𝐿𝑇 } passes through (3 females, 2 males) expressing 5 distinct emotions in two shared linear layers, ReLU, and dropout blocks (𝒷), Greek. Each actor performed 20 utterances per emotion, followed by a linear layer (𝓁1 ). Linear layers are functions with some utterances recorded multiple times. In later 𝓁 ∶ ℝ𝑑 → ℝ𝑚 . Projected features are averaged, pass versions, additional actors were included, bringing the through 𝓁2 and ReLU, and are classified by ℴ. Thus, ℊ𝑛𝑙 total to 604 recordings from 6 actors. is: CaFE [44]: This dataset includes recordings of 6 dif- ferent sentences delivered by 12 actors (6 female, 6 male) ℊ𝑛𝑙 (𝑥 = ℎ𝐿1 , … , ℎ𝐿𝑇 ) = ℴ (ReLU (𝓁2 (𝓅 (𝓁1 (𝒷 (𝑥)))))) portraying the Big Six emotions and a neutral state in (3) Canadian French. It offers a high-quality version with a sampling rate of 192 kHz at 24 bits per sample, as well as Dataset Language # Samples Emotions AESDD Greek 500 anger, disgust, fear, happiness, and sadness CaFE Canadian French 936 anger, disgust, fear, happiness, surprise, sadness, and neutrality DEMoS Italian 9697 anger, disgust, fear, happiness, surprise, sadness, and neutrality EmoDB German 535 anger, disgust, fear, happiness, boredom, sadness, and neutrality EmoMatch Spanish 2005 anger, disgust, fear, happiness, surprise, sadness, and neutrality EMOVO Italian 588 anger, disgust, fear, happiness, surprise, sadness, and neutrality EYASE Egyptian Arabic 579 anger, happiness, sadness, and neutrality Oréau French 502 anger, disgust, fear, happiness, surprise, sadness, and neutrality ShEMO Persian 400 anger, happiness, sadness, and neutrality Table 1 Summary statistics of the 9 datasets used in this work. a down-sampled version at 48 kHz and 16 bits per sample. validation/test split is performed with ratios of 80/10/10. The total number of samples amounts to 936. All results are reported using the macro F1 score, ex- DEMoS [45]: DEMoS contains 9697 audio samples pressed as a percentage. We conducted 3 runs, presenting from 68 volunteer students (299 females, 131 males) ex- the mean ± standard deviation. pressing the Big Six emotions plus the neutral state in Italian. Instead of acted emotions, samples were gener- 4.2. Experimental Details ated using an elicitation approach. The recordings, with a mean duration of 2.9 seconds (std: 1.1s), are provided Baseline As a baseline to evaluate LSM transfer learn- in 48 kHz, 16-bit, mono format. ing capabilities, we adopt the Audio Spectrogram Trans- EmoDB [46]: This collection includes 535 utterances former (AST) [51], a fully transformer-based architecture across 7 emotional states, spoken in German by 5 female recently proposed as a substitute for CNNs [9, 10, 11]. and 5 male actors. Each actor performed a set of 10 We train AST from scratch on each of the 9 datasets using sentences, which were down-sampled from the original the same hyperparameters as [51]. 48 kHz to 16 kHz. LSM Models We use pre-trained checkpoints for both EmoMatch [33]: Consisting of 2005 recordings, Emo- English-Only and Multilingual models: Wav2Vec 2.0 Match features samples from 50 non-actor Spanish speak- Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0 ers (20 females, 30 males) expressing the Big Six emotions family, and Whisper Small (EN) (Whisper Small pre- and a neutral state. The dataset is a subset of the larger trained only on English data), Whisper Small, Whisper EmoSpanishDB and contains recordings sampled at 48 Medium from the Whisper family. The LSM backbones kHz with a 16-bit mono format. are kept frozen and used exclusively as feature extractors. EMOVO [47]: EMOVO presents 588 Italian audio Training We follow the same hyperparameters set- recordings from 3 male and 3 female actors simulating tings as [15] to train the downstream classifiers. Specifi- the Big Six emotions plus a neutral state. Each actor cally, we train for 30 epochs using the Adam optimizer voiced 14 utterances, and the recordings are provided in with a learning rate of 5.0e-04, weight decay of 1.0e-04, 48 kHz, 16-bit stereo WAV format. betas set to (0.9, 0.98), and epsilon of 1.0e-08. The dimen- EYASE [48]: EYASE contains 579 utterances in Egyp- sion of the classifier projection 𝑚 is 256. tian Arabic, recorded by 3 male and 3 female professional actors. The recordings, ranging from 1 to 6 seconds in 4.3. Results duration, were labeled as angry, happy, neutral, or sad and sampled at 44.1 kHz. To present our results, we first compare the performance Oréau [49]: The Oréau dataset features 502 audio sam- of the various classifiers (see Section 3) for each LSM ples from 32 non-professional actors (25 male, 7 female) utilized. This analysis provides insights into the char- who voiced 10 to 13 utterances in French for the Big Six acteristics of features extracted from Wav2Vec 2.0 and emotions plus a neutral state. Whisper models for downstream SER tasks. After identi- ShEMO [50]: ShEMO comprises 3000 semi-natural fying the best classifier for each LSM, we then compare recordings from 87 native Persian speakers (31 female, the performance of English-Only and Multilingual LSMs 56 male). The dataset captures 5 of the Big Six emo- across the 8 languages covered in this study. tions—sadness, anger, happiness, surprise, and fear—plus a neutral state. The samples were up-sampled to a fre- 4.3.1. Comparison between downstream classifiers quency of 44.1 kHz in mono-channel format, with an We examine the results in Table 2, comparing three clas- average length of 4.11 seconds (std: 3.41s). sifier methods for Wav2Vec 2.0 and Whisper models. The The audio is resampled to 16 kHz, and a stratified train/- Backbone Linear Non-Linear Multi-Layer ness for SER across multiple languages. We hypothesize Wav2Vec 2.0 Base 47.87 (± 0.93) 42.07 (± 5.27) 53.42 (± 1.27) that this differing behavior may be related to their respec- Wav2Vec 2.0 Large 12.09 (± 1.50) 12.93 (± 3.31) 57.50 (± 0.03) tive Self-Supervised and Weakly-Supervised pre-training XLS-R 5.43 (± 0.40) 5.86 (± 0.07) 40.89 (± 2.00) approaches, which warrant further investigation. To gain Whisper Small (EN) 58.16 (± 0.15) 53.50 (± 0.98) 49.73 (± 2.02) further insights into the importance of Transformer lay- Whisper Small 60.87 (± 0.26) 54.86 (± 0.93) 45.14 (± 1.54) ers in Wav2Vec 2.0 and Whisper for SER, we leverage the Whisper Medium 60.72 (± 0.16) 55.56 (± 1.09) 37.95 (± 2.27) weights learned in the Multi-Layer classifier as follows. Transformer Layer Weights. We analyze the weights 𝑤1 , … , 𝑤𝐿 from the Multi-Layer Classifier to as- Table 2 Performance of various LSM backbones using Linear, Non- sess Transformer layer importance. Figure 2 illustrates Linear, and Multi-Layer classification methods. F1 scores are that Wav2Vec 2.0 models assign greater weight to the averaged across all 9 datasets. For each LSM, the best classifier early and middle layers, whereas Whisper models em- is highlighted in bold. phasize the later layers. This observation confirms the earlier findings, suggesting that paralinguistic informa- tion in Whisper models is embedded in the features of the later Transformer layers. 4.3.2. Comparing English-Only and Multilingual LSMs Across Different Languages In this section, we compare English-Only and Multilin- gual LSMs with the AST baseline across 9 datasets. Table 3 displays F1 scores for the optimal classifiers found in the previous section: Multi-Layer for Wav2Vec 2.0 and Figure 2: Greyscale map of layer weight distribution from the Linear for Whisper models. Multi-Layer classification method. Weights are averaged over Transferring knowledge from LSMs proves to be ef- all 9 datasets for each model. Darker shades indicate higher fective across all datasets compared to the baseline. For weights. instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian Arabic, while Whisper Small scores 51.98 and AST scores table shows average F1 scores across 9 datasets, highlight- 33.23. This indicates that LSMs are effective feature ex- ing the most effective classifier for each LSM in cross- tractors for cross-lingual SER on multiple languages. lingual SER tasks. When comparing English-only and Multilingual mod- For Wav2Vec 2.0 models, the Multi-Layer Classifier els, we differentiate between the Wav2Vec 2.0 and Whis- performs best, with F1 scores of 53.42, 57.50, and 40.89 per families. For Wav2Vec 2.0, we observe that Wav2Vec for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R. 2.0 Base and Large generally outperform XLS-R (e.g., The Linear and Non-Linear classifiers perform similarly, 87.85 and 88.31 vs. 67.71 for DEMos), except in Persian, especially for Wav2Vec 2.0 Large and XLS-R, suggesting where their performance is comparable. This indicates improvements are due to using features from internal that multilingual pre-training may not be as effective Transformer layers rather than non-linear activations. for Wav2Vec 2.0 models across various languages. We For Whisper models, the Linear Classifier performs best, speculate that this may be due to the limitations of SSL with F1 scores of 58.16, 60.87, and 60.72 for Whisper pre-training, which might struggle with the diverse range Small (EN), Whisper Small, and Whisper Medium. In- of languages and lose important paralinguistic features creasing classifier complexity with non-linear activations that are retained in English-only models. Further investi- decreases performance, likely due to general information gation with a wider range of SSL-pretrained LSMs could loss caused by complex transformations. The Multi-Layer provide more insights. As regards to Whisper, Multilin- Classifier performs worse, indicating that using also fea- gual Whisper Small outperforms its English-only ver- tures from internal layers is less effective than using sion, with the exception of Greek and Persian, likely due features from the last layer alone. to limited pretraining data for these languages, which This comparison reveals that Wav2Vec 2.0 models ben- resulted in higher word error rates compared to other efit from features extracted from internal Transformer languages in this study [13]. Multilingual Whisper mod- layers and exhibit less sensitivity to classifier complex- els achieve best performance in Canadian French, Span- ity, consistent with prior research [41, 39]. Conversely, ish (66.71, 73.13 with Whisper Small), Italian, German, Whisper models achieve better performance with fea- and French (91.17, 90.64, 95.22 with Whisper Medium). tures from the last Transformer layer when using a simple This improvement is likely due to the larger pretraining linear classifier, offering new insights into their effective- datasets for these languages and the similarities between English-Only Multilingual Wav2Vec 2.0 Wav2Vec 2.0 Whisper Whisper Whisper Dataset/Model AST XLS-R‡ Base‡ Large‡ Small† Small† Medium† AESDD (el) 19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99) 9.16 (± 1.25) 26.34 (± 1.65) 27.62 (± 0.62) CaFE (fr-ca) 10.96 (± 6.26) 50.52 (± 3.54) 47.74 (± 0.33) 60.66 (± 0.76) 18.66 (± 0.01) 66.71 (± 0.72) 55.03 (± 0.38) DEMoS (it) 13.75 (± 4.26) 87.85 (± 0.01) 88.31 (± 0.74) 88.24 (± 0.21) 67.71 (± 1.47) 90.61 (± 0.14) 91.17 (± 0.20) EmoDB (de) 46.11 (± 6.55) 81.75 (± 7.30) 88.84 (± 7.48) 83.31 (± 0.18) 67.39 (± 4.33) 87.21 (± 1.11) 90.64 (± 1.47) EmoMatch (es) 36.10 (± 2.63) 69.84 (± 0.69) 71.85 (± 1.55) 67.59 (± 0.35) 44.14 (± 0.25) 73.13 (± 2.54) 68.23 (± 0.78) EMOVO (it) 15.74 (± 1.24) 16.47 (± 0.61) 20.33 (± 1.31) 27.30 (± 0.16) 14.86 (± 2.11) 41.05 (± 1.21) 50.19 (± 0.29) EYASE (ar-eg) 33.23 (± 4.58) 46.31 (± 3.62) 53.40 (± 1.56) 42.65 (± 0.70) 47.27 (± 1.36) 51.98 (± 0.88) 37.32 (± 3.62) Oréau (fr) 19.01 (± 2.35) 52.86 (± 0.07) 58.42 (± 4.14) 82.27 (± 0.23) 32.51 (± 4.89) 92.70 (± 1.67) 95.22 (± 0.84) ShEMO (fa) 36.15 (± 0.85) 60.55 (± 3.90) 57.52 (± 9.09) 67.93 (± 0.37) 61.24 (± 8.93) 63.88 (± 1.21) 63.85 (± 1.58) Table 3 Performance of Wav2Vec and Whisper models across 9 datasets, divided into English-Only and Multilingual LSMs. AST is the baseline. † indicates a Linear Classifier, ‡ a Multi-Layer Classifier. Bold values are the highest scores, and underlined values highlight the best between English-Only and Multilingual models. Canadian French and French. We believe that multilin- References gual pretraining benefits Whisper models by capturing language-specific features more effectively through WSL [1] C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, B.-H. Su, and multitask learning. However, further research is C. Busso, Deep representation learning for affective needed to evaluate the effectiveness of multilingual pre- speech signal analysis and processing: Preventing training with WSL compared to SSL across a broader unwanted signal disparities, IEEE Signal Processing range of LSMs. Magazine 38 (2021) 22–38. [2] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, A survey of deep learning-based multimodal emotion 5. Conclusion recognition: Speech, text, and face, Entropy 25 (2023) 1440. This paper examines the capabilities of Wav2Vec 2.0 and [3] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kar- Whisper models as feature extractors for cross-lingual tiwi, E. Ambikairajah, A comprehensive review of SER across eight languages, considering both English- speech emotion recognition systems, IEEE access 9 Only and Multilingual variants. Our findings reveal that (2021) 47795–47814. LSMs are effective feature extractors compared to a full [4] Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emo- Transformer baseline trained from scratch. We observe tion recognition using cnn, in: Proceedings of the that Whisper models encode acoustic information primar- 22nd ACM international conference on Multimedia, ily in the features of the last Transformer layer, whereas 2014, pp. 801–804. Wav2Vec 2.0 models rely on features from middle and [5] A. M. Badshah, J. Ahmad, N. Rahim, S. W. Baik, early layers. Furthermore, we show that multilingual Speech emotion recognition from spectrograms pre-training benefits Whisper models, leading to strong with deep convolutional neural network, in: 2017 performance in Italian, Canadian French, French, Span- international conference on platform technology ish, German, and competitive results in Greek, Egyptian and service (PlatCon), IEEE, 2017, pp. 1–5. Arabic, and Persian. In contrast, English-Only Wav2Vec [6] J. Zhao, X. Mao, L. Chen, Speech emotion recogni- 2.0 models outperform their multilingual counterpart, tion using deep 1d & 2d cnn lstm networks, Biomed- XLS-R, in most languages, achieving top performance in ical signal processing and control 47 (2019) 312–323. Greek and Egyptian Arabic. We attribute the disparity [7] T. Feng, H. Hashemi, M. Annavaram, S. S. in multilingual pre-training effectiveness to the differ- Narayanan, Enhancing privacy through domain ences between SSL and WSL strategies, which should be adaptive noise injection for speech emotion recog- explored further. nition, in: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal process- ing (ICASSP), IEEE, 2022, pp. 7702–7706. [8] W. Lim, D. Jang, T. Lee, Speech emotion recognition using convolutional and recurrent neural networks, in: 2016 Asia-Pacific signal and information pro- [20] S. G. Upadhyay, L. Martinez-Lucas, B.-H. Su, W.-C. cessing association annual summit and conference Lin, W.-S. Chien, Y.-T. Wu, W. Katz, C. Busso, C.-C. (APSIPA), IEEE, 2016, pp. 1–4. Lee, Phonetic anchor-based transfer learning to fa- [9] N.-C. Ristea, R. T. Ionescu, F. S. Khan, Septr: Separa- cilitate unsupervised cross-lingual speech emotion ble transformer for audio spectrogram processing, recognition, in: ICASSP 2023-2023 IEEE Interna- arXiv preprint arXiv:2203.09581 (2022). tional Conference on Acoustics, Speech and Signal [10] J.-Y. Kim, S.-H. Lee, Coordvit: a novel method of Processing (ICASSP), IEEE, 2023, pp. 1–5. improve vision transformer-based speech emotion [21] S. A. M. Zaidi, S. Latif, J. Qadir, Cross-language recognition using coordinate information concate- speech emotion recognition using multimodal nate, in: 2023 International conference on electron- dual attention transformers, arXiv preprint ics, information, and communication (ICEIC), IEEE, arXiv:2306.13804 (2023). 2023, pp. 1–4. [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, [11] S. Akinpelu, S. Viriri, A. Adegun, An enhanced Bert: Pre-training of deep bidirectional transform- speech emotion recognition using vision trans- ers for language understanding, arXiv preprint former, Scientific Reports 14 (2024) 13126. arXiv:1810.04805 (2018). [12] S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, [23] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, K. Qian, X. Jing, A. Kathan, B. Hu, B. W. Schuller, et al., Improving language understanding by gen- Audio self-supervised learning: A survey, Patterns erative pre-training (2018). 3 (2022). [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- [13] A. Radford, J. W. Kim, T. Xu, G. Brockman, senborn, X. Zhai, T. Unterthiner, M. Dehghani, C. McLeavey, I. Sutskever, Robust speech recog- M. Minderer, G. Heigold, S. Gelly, et al., An image is nition via large-scale weak supervision, in: Inter- worth 16x16 words: Transformers for image recog- national Conference on Machine Learning, PMLR, nition at scale, arXiv preprint arXiv:2010.11929 2023, pp. 28492–28518. (2020). [14] L. Pepino, P. Riera, L. Ferrer, Emotion recognition [25] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, from speech using wav2vec 2.0 embeddings, arXiv J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm: preprint arXiv:2104.03502 (2021). Large-scale self-supervised pre-training for full [15] T. Feng, S. Narayanan, Peft-ser: On the use of stack speech processing, IEEE Journal of Selected parameter efficient transfer learning approaches Topics in Signal Processing 16 (2022) 1505–1518. for speech emotion recognition using pre-trained [26] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, speech models, in: 2023 11th International Confer- K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, ence on Affective Computing and Intelligent Inter- G.-T. Lin, et al., Superb: Speech processing uni- action (ACII), IEEE, 2023, pp. 1–8. versal performance benchmark, arXiv preprint [16] Y. Li, A. Mehrish, R. Bhardwaj, N. Majumder, arXiv:2105.01051 (2021). B. Cheng, S. Zhao, A. Zadeh, R. Mihalcea, S. Po- [27] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, ria, Evaluating parameter-efficient transfer learning M. Auli, Unsupervised cross-lingual representation approaches on sure benchmark for speech under- learning for speech recognition, arXiv preprint standing, in: ICASSP 2023-2023 IEEE International arXiv:2006.13979 (2020). Conference on Acoustics, Speech and Signal Pro- [28] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, cessing (ICASSP), IEEE, 2023, pp. 1–5. N. Goyal, K. Singh, P. Von Platen, Y. Saraf, J. Pino, [17] T. Feng, R. Hebbar, S. Narayanan, Trust-ser: et al., Xls-r: Self-supervised cross-lingual speech On the trustworthiness of fine-tuning pre-trained representation learning at scale, arXiv preprint speech embeddings for speech emotion recognition, arXiv:2111.09296 (2021). in: ICASSP 2024-2024 IEEE International Confer- [29] A. Wurst, M. Hopwood, S. Wu, F. Li, Y.-D. Yao, Deep ence on Acoustics, Speech and Signal Processing learning for the detection of emotion in human (ICASSP), IEEE, 2024, pp. 11201–11205. speech: The impact of audio sample duration and [18] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec english versus italian languages, in: 2023 32nd 2.0: A framework for self-supervised learning of Wireless and Optical Communications Conference speech representations, Advances in neural infor- (WOCC), IEEE, 2023, pp. 1–6. mation processing systems 33 (2020) 12449–12460. [30] M. Neumann, et al., Cross-lingual and multilingual [19] M. Sharma, Multi-lingual multi-task speech emo- speech emotion recognition on english and french, tion recognition using wav2vec 2.0, in: ICASSP in: 2018 IEEE international conference on acoustics, 2022-2022 IEEE International Conference on Acous- speech and signal processing (ICASSP), IEEE, 2018, tics, Speech and Signal Processing (ICASSP), IEEE, pp. 5769–5773. 2022, pp. 6907–6911. [31] S. Deng, N. Zhang, Z. Sun, J. Chen, H. Chen, When low resource nlp meets unsupervised language ing Society 66 (2018) 457–467. model: Meta-pretraining then meta-learning for [44] P. Gournay, O. Lahaie, R. Lefebvre, A canadian few-shot text classification (student abstract), in: french emotional speech dataset, in: Proceedings Proceedings of the AAAI Conference on Artificial of the 9th ACM multimedia systems conference, Intelligence, volume 34, 2020, pp. 13773–13774. 2018, pp. 399–402. [32] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross [45] E. Parada-Cabaleiro, G. Costantini, A. Batliner, lingual speech emotion recognition: Urdu vs. west- M. Schmitt, B. W. Schuller, Demos: An italian emo- ern languages, in: 2018 International conference tional speech corpus: Elicitation methods, machine on frontiers of information technology (FIT), IEEE, learning, and perception, Language Resources and 2018, pp. 88–93. Evaluation 54 (2020) 341–383. [33] E. Garcia-Cuesta, A. B. Salvador, D. G. Pãez, Emo- [46] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. matchspanishdb: study of speech emotion recog- Sendlmeier, B. Weiss, et al., A database of german nition machine learning models in a new spanish emotional speech., in: Interspeech, volume 5, 2005, elicited database, Multimedia Tools and Applica- pp. 1517–1520. tions 83 (2024) 13093–13112. [47] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco, [34] A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. et al., Emovo corpus: an italian emotional speech Morency, Tensor fusion network for multimodal database, in: Proceedings of the ninth international sentiment analysis, arXiv preprint arXiv:1707.07250 conference on language resources and evaluation (2017). (LREC’14), European Language Resources Associa- [35] H. H. Mao, A survey on self-supervised pre-training tion (ELRA), 2014, pp. 3501–3504. for sequential transfer learning in neural networks, [48] L. Abdel-Hamid, Egyptian arabic speech emotion arXiv preprint arXiv:2007.00800 (2020). recognition using prosodic, spectral and wavelet [36] S. Sadok, S. Leglaive, R. Séguier, A vector quantized features, Speech Communication 122 (2020) 19–30. masked autoencoder for speech emotion recogni- [49] S. Oréau, French emotional speech database - oréau, tion, in: 2023 IEEE International conference on Zenodo, 2021. URL: https://zenodo.org/records/ acoustics, speech, and signal processing workshops 4405783. (ICASSPW), IEEE, 2023, pp. 1–5. [50] O. Mohamad Nezami, P. Jamshid Lou, M. Karami, [37] F. Catania, Speech emotion recognition in italian Shemo: a large-scale validated database for persian using wav2vec 2, Authorea Preprints (2023). speech emotion detection, Language Resources and [38] Y. Belinkov, J. Glass, Analyzing hidden representa- Evaluation 53 (2019) 1–16. tions in end-to-end automatic speech recognition [51] Y. Gong, Y.-A. Chung, J. Glass, Ast: Audio spectro- systems, Advances in Neural Information Process- gram transformer, arXiv preprint arXiv:2104.01778 ing Systems 30 (2017). (2021). [39] J. Shah, Y. K. Singla, C. Chen, R. R. Shah, What all do audio transformer models hear? probing acous- tic representations for language delivery and its structure, arXiv preprint arXiv:2101.00387 (2021). [40] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy- sis of a self-supervised speech representation model, in: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021, pp. 914–921. [41] Y. Li, Y. Mohamied, P. Bell, C. Lai, Exploration of a self-supervised speech model: A study on emotional corpora, in: 2022 IEEE Spoken Lan- guage Technology Workshop (SLT), IEEE, 2023, pp. 868–875. [42] A. Pasad, B. Shi, K. Livescu, Comparative layer- wise analysis of self-supervised speech models, in: ICASSP 2023-2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023, pp. 1–5. [43] N. Vryzas, R. Kotsakis, A. Liatsou, C. A. Dimoulas, G. Kalliris, Speech emotion recognition for perfor- mance interaction, Journal of the Audio Engineer-