=Paper=
{{Paper
|id=Vol-3878/30_main_long
|storemode=property
|title=Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition
|pdfUrl=https://ceur-ws.org/Vol-3878/30_main_long.pdf
|volume=Vol-3878
|authors=Federico D'asaro,Juan José Márquez Villacís,Giuseppe Rizzo,Andrea Bottino
|dblpUrl=https://dblp.org/rec/conf/clic-it/DasaroV0B24
}}
==Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition==
Using Large Speech Models for Feature Extraction in
Cross-Lingual Speech Emotion Recognition
Federico D’Asaro1,2,∗,† , Juan José Márquez Villacís1,† , Giuseppe Rizzo1,2 and Andrea Bottino2
1
LINKS Foundation – AI, Data & Space (ADS)
2
Politecnico di Torino – Dipartimento di Automatica e Informatica (DAUIN)
Abstract
Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or Weakly-
Supervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to
extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English,
with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and
Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers
of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper
models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models
benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs,
we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish,
German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform
their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.
Keywords
Cross-lingual Speech Emotion Recognition, Large Speech models, Transfer Learning
1. Introduction edge of LSMs makes them effective feature extractors
for SER. Research has adapted LSMs for SER in English
Speech Emotion Recognition (SER) aims to identify emo- [14, 15, 16, 17], but efforts for other languages are lim-
tions from speech audio, enhancing Human-AI inter- ited, focusing on Wav2Vec 2.0 [18] for cross-lingual SER
action in fields such as healthcare, education, and se- [19, 20, 21].
curity [1]. Traditional methods rely on Low-Level De- This study examines how effective LSMs are as fea-
scriptors (LLD) like spectral, prosodic, and voice qual- ture extractors for cross-lingual SER, using nine datasets
ity features [2], using classifiers such as KNN, SVM, across eight languages: Italian, German, French, Cana-
or Naïve Bayes [3]. Deep learning has introduced ad- dian French, Spanish, Greek, Persian, and Egyptian Arabic.
vanced techniques, including Convolutional Neural Net- Specifically, we utilize LSMs from the Wav2Vec 2.0 and
works (CNNs) [4, 5, 6], eventually followed by Recur- Whisper [13] model families, pre-trained with SSL and
rent Neural Networks (RNNs) [7, 8], and Transformers WSL approaches, respectively. We introduce Whisper
[9, 10, 11]. Transformers’ ability to learn from extensive due to its underexplored use in cross-lingual SER. To
datasets has led to Large Speech Models (LSMs), which assess the effectiveness of LSMs as feature extractors,
generalize across various speech tasks. Common train- we test three classifiers of increasing complexity—Linear,
ing approaches for these models include Self-Supervised Non-Linear, and Multi-Layer—across nine datasets. This
Learning (SSL), which uses data itself to learn general- evaluation determines which classifier best suits each
purpose features [12], and Weakly-Supervised Learning LSM across different languages. Moreover, our study in-
(WSL), which pairs audio with text for tasks like transcrip- cludes both English-Only and Multilingual models from
tion and translation [13]. The general-purpose knowl- the Wav2Vec 2.0 and Whisper families, aiming to eval-
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
uate the effectiveness of multilingual pre-training for
Dec 04 — 06, 2024, Pisa, Italy cross-lingual SER.
∗
Corresponding author. The main contributions of this work are:
†
These authors contributed equally.
Envelope-Open federico.dasaro@polito.it (F. D’Asaro); • We evaluate LSMs from the Wav2Vec 2.0 and
juan.marquez@linksfoundation.com (J. J. Márquez Villacís); Whisper models as feature extractors for cross-
giuseppe.rizzo@linksfoundation.com (G. Rizzo); lingual SER across eight languages.
andrea.bottino@polito.it (A. Bottino) • We test three types of downstream classi-
GLOBE http://conceptbase.sourceforge.net/mjf/ (A. Bottino)
fiers—Linear, Non-Linear, and Multi-Layer—and
Orcid 0009-0003-8727-3393 (F. D’Asaro); 0009-0008-3098-5492
(J. J. Márquez Villacís); 0000-0003-0083-813X (G. Rizzo); find that Whisper models’ last Transformer layer
0000-0002-8894-5089 (A. Bottino) features are well-suited for a Linear classifier,
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). whereas Wav2Vec 2.0 models perform better with
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
features from the middle and early Transformer dual attention [21] and tensor fusion [34] enhance audio
layers. and text interaction in languages such as Italian, Ger-
• We compare English-Only and Multilingual LSMs, man, and Urdu. Self-supervised pre-training methods,
revealing that Whisper models benefit from mul- including variational autoencoders, have also been ef-
tilingual pre-training performing best on Italian, fective in transferring knowledge across languages like
Spanish, Canadian French, French, and German German [35, 36]. The advent of LSMs pre-trained with
and competitively on Greek, Egyptian Arabic, self-supervision has further increased the potential for
Persian. Conversely, English-Only Wav2Vec 2.0 transfer learning due to their high generalization capa-
models surpass multilingual XLS-R in most lan- bilities [15]. However, most research primarily focuses
guages, achieving the highest performance in on adapting multilingual Wav2Vec 2.0 models (XLSR-53)
Greek, Egyptian Arabic. [19, 37, 20, 21]. This work expands the scope of analyzed
LSMs including WSL models as Whisper. Additionally,
we evaluate the ability of English-only models to transfer
2. Background knowledge to other languages, beyond just multilingual
models.
2.1. Large Speech Models
Recent developments in natural language processing and 3. Method
computer vision have harnessed large volumes of unla-
beled data through Self-Supervised Learning [22, 23, 24]. In this section, we describe the methodology for eval-
Building on techniques such as masked language and uating the effectiveness of LSMs as feature extractors
image modeling, Wav2Vec 2.0 [18] introduced a LSM for downstream SER in various languages. We stack a
trained on extensive audio datasets using masked speech classification model on top of the LSM backbone, with
modeling. Wav2Vec 2.0 features seven 1D convolutional its parameters frozen. All LSMs used in this work share
blocks for initial feature extraction, followed by 12 or 24 the same overall architecture, which we describe below
transformer blocks (depending on the model variant) for along with the stacked classification model.
contextual processing. The model masks part of the latent Formally, the input audio 𝐴 (raw waveform or log-
features and reconstructs them using the surrounding mel spectrogram) passes through a convolutional en-
context. To further refine LSMs for tasks like emotion coder 𝓏 ∶ 𝐴 → 𝑍, mapping the audio to latent features
recognition, methods such as WavLM [25] have been 𝑍 = {𝑧1 , … , 𝑧𝑇 }, where 𝑇 is the sequence length and each
developed. WavLM incorporates speech denoising along- frame 𝑧𝑖 typically corresponds to 25 ms with 𝑧𝑖 ∈ ℝ𝑑 .
side masked modeling, demonstrating broad effective- Then, 𝑍 passes through a Transformer encoder consist-
ness across various tasks in the SUPERB benchmark [26]. ing of 𝑙 layers 𝒽𝑙 ∶ 𝑍 → 𝐻, enriching the latent features
Moreover, XLSR-53 [27] extends the Wav2Vec 2.0 frame- with contextual information, resulting in {ℎ𝑙1 , … , ℎ𝑙𝑇 } for
work to cover 53 languages, sharing the latent space each of the 𝑙 = 1, … , 𝐿 Transformer layers. Here, 𝑙 = 𝐿
across these languages. This approach has shown supe- corresponds to the output features of the last layer, with
rior performance over monolingual pretraining for auto- ℎ𝑙𝑖 ∈ ℝ𝑑 . The features {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 are considered the
matic speech recognition. XLS-R [28] further advances extracted features from the LSM and are fed into a down-
this by scaling to 128 languages, excelling in speech trans- stream classifier 𝓎 ∶ 𝐻 → 𝑌, which maps these features
lation and language identification. In comparison, Whis- to the output class logits {𝑦1 , … , 𝑦𝑘 }. The output class
per [13] leverages large-scale weak supervision from label 𝑦 ∗ for audio 𝐴 is given by:
audio-transcription pairs to train an encoder-decoder
transformer. Using log-mel spectrograms, Whisper is 𝑦 ∗ = arg max softmax (𝓎 (𝒽 (𝓏(𝐴)))) (1)
𝑘
trained in a multitask framework that includes multilin-
gual transcription and translation, establishing itself as Inspired by previous work that uses probing to evalu-
an effective zero-shot model for multilingual tasks. ate the quality of features extracted from backbone mod-
els [38, 39], we evaluate three different downstream classi-
fiers of increasing complexity: Linear Classifier (ℊ𝑙 ), Non-
2.2. Cross-Language Speech Emotion
Linear Classifier (ℊ𝑛𝑙 ), and Multi-layer Classifier (ℊ𝑚𝑙 ).
Recognition Figure 1 illustrates their architecture, which is detailed
Emotion recognition in languages beyond English, like below.
Italian [29], French [30], Persian [31, 32], and Spanish
[33], is crucial but often limited by data availability. Re- 3.1. Linear Classifier
cent efforts have focused on improving cross-lingual
and cross-modal knowledge transfer. Techniques like For the linear classifier, we use a simple feed-forward
neural network that consists solely of linear projections.
3.3. Multi-Layer Classifier
As a third option, we adopt the approach from [14, 15],
which utilizes all hidden states of the Transformer en-
coder. The features {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 are combined into a
new sequence {ℎ∗1 , … , ℎ∗𝑇 } using a learnable weighted sum.
The function 𝓈 ∶ ℝ𝐿×𝑇 ×𝑑 → ℝ𝑇 ×𝑑 maps {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿
to {ℎ∗1 , … , ℎ∗𝑇 } as follows:
𝐿
ℎ∗𝑡 = ∑ 𝑤𝑙 ⋅ ℎ𝑙𝑡 for 𝑡 = 1, … , 𝑇 (4)
𝑙=1
where 𝑤1 , … , 𝑤𝐿 are the weights assigned to each Trans-
𝐿
former layer, ensuring 𝑤𝑙 ∈ [0, 1] and ∑𝑙=1 𝑤𝑙 = 1. The
∗ ∗
resulting sequence {ℎ1 , … , ℎ𝑇 } is then processed by the
same pipeline as the Non-Linear Classifier, resulting in:
ℊ𝑚𝑙 (𝑥 = {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 ) = ℊ𝑛𝑙 (𝓈(𝑥)) (5)
This classifier leverages internal layer information, which
Figure 1: The three downstream classifiers used in this work has proven beneficial for paralinguistic and linguistic
are: Linear (red), Non-Linear (purple), and Multi-Layer (green).
downstream tasks [39, 40, 41, 42]. By investigating the
The snowflake icon represents frozen weights, while the fire
icon denotes trainable weights.
contribution of internal LSM layers for SER across var-
ious languages, we corroborates previous findings for
Wav2Vec 2.0 models and provide new insights for Whis-
Specifically, given the features from the last Transformer per models.
layer {ℎ𝐿1 , … , ℎ𝐿𝑇 }, they are first projected by a linear layer
𝓁1 ∶ ℝ𝑑 → ℝ𝑚 that is shared across all frames, then ag-
gregated by average pooling 𝓅, and finally pass through 4. Experiments
the classification layer ℴ ∶ ℝ𝑚 → ℝ𝑘 to obtain the output
class logits. The function ℊ𝑙 is compactly defined as: 4.1. Datasets and Metrics
In this study, we conduct experiments using 9 distinct
ℊ𝑙 (ℎ𝐿1 , … , ℎ𝐿𝑇 ) = ℴ (𝓅 (𝓁1 (ℎ𝐿1 , … , ℎ𝐿𝑇 ))) (2)
datasets spanning 8 different languages: Greek, French,
The absence of non-linear activations allows us to eval- Italian, German, Spanish, Egyptian Arabic, and Persian.
uate the quality of the features extracted from the LSM The datasets vary in their collection methodologies, such
based on the linear classifier model’s ability to handle as acted emotions and elicitation methods. The partic-
the SER task. ipant demographics may be balanced by gender (e.g.,
CaFE, EYASE), by emotion (e.g., EMOVO), or may not
be balanced at all. For all datasets, we conduct our ex-
3.2. Non-Linear Classifier periments in a speaker-independent setting to prevent
To increase the complexity of the classification model, evaluation on speaker-dependent features. Table 1 pro-
we utilize a series of linear layers interleaved with ReLU vides an overview of the dataset statistics, with a more
activations both before and after feature pooling. We detailed description given below.
follow the same architecture as in [14, 15], but unlike AESDD [43]: The Acted Emotional Speech Dynamic
them, we only feed the features from the last Transformer Database comprises 500 recorded samples from 5 actors
layer 𝐿 to the model. Each {ℎ𝐿1 , … , ℎ𝐿𝑇 } passes through (3 females, 2 males) expressing 5 distinct emotions in
two shared linear layers, ReLU, and dropout blocks (𝒷), Greek. Each actor performed 20 utterances per emotion,
followed by a linear layer (𝓁1 ). Linear layers are functions with some utterances recorded multiple times. In later
𝓁 ∶ ℝ𝑑 → ℝ𝑚 . Projected features are averaged, pass versions, additional actors were included, bringing the
through 𝓁2 and ReLU, and are classified by ℴ. Thus, ℊ𝑛𝑙 total to 604 recordings from 6 actors.
is: CaFE [44]: This dataset includes recordings of 6 dif-
ferent sentences delivered by 12 actors (6 female, 6 male)
ℊ𝑛𝑙 (𝑥 = ℎ𝐿1 , … , ℎ𝐿𝑇 ) = ℴ (ReLU (𝓁2 (𝓅 (𝓁1 (𝒷 (𝑥)))))) portraying the Big Six emotions and a neutral state in
(3) Canadian French. It offers a high-quality version with a
sampling rate of 192 kHz at 24 bits per sample, as well as
Dataset Language # Samples Emotions
AESDD Greek 500 anger, disgust, fear, happiness, and sadness
CaFE Canadian French 936 anger, disgust, fear, happiness, surprise, sadness, and neutrality
DEMoS Italian 9697 anger, disgust, fear, happiness, surprise, sadness, and neutrality
EmoDB German 535 anger, disgust, fear, happiness, boredom, sadness, and neutrality
EmoMatch Spanish 2005 anger, disgust, fear, happiness, surprise, sadness, and neutrality
EMOVO Italian 588 anger, disgust, fear, happiness, surprise, sadness, and neutrality
EYASE Egyptian Arabic 579 anger, happiness, sadness, and neutrality
Oréau French 502 anger, disgust, fear, happiness, surprise, sadness, and neutrality
ShEMO Persian 400 anger, happiness, sadness, and neutrality
Table 1
Summary statistics of the 9 datasets used in this work.
a down-sampled version at 48 kHz and 16 bits per sample. validation/test split is performed with ratios of 80/10/10.
The total number of samples amounts to 936. All results are reported using the macro F1 score, ex-
DEMoS [45]: DEMoS contains 9697 audio samples pressed as a percentage. We conducted 3 runs, presenting
from 68 volunteer students (299 females, 131 males) ex- the mean ± standard deviation.
pressing the Big Six emotions plus the neutral state in
Italian. Instead of acted emotions, samples were gener- 4.2. Experimental Details
ated using an elicitation approach. The recordings, with
a mean duration of 2.9 seconds (std: 1.1s), are provided Baseline As a baseline to evaluate LSM transfer learn-
in 48 kHz, 16-bit, mono format. ing capabilities, we adopt the Audio Spectrogram Trans-
EmoDB [46]: This collection includes 535 utterances former (AST) [51], a fully transformer-based architecture
across 7 emotional states, spoken in German by 5 female recently proposed as a substitute for CNNs [9, 10, 11].
and 5 male actors. Each actor performed a set of 10 We train AST from scratch on each of the 9 datasets using
sentences, which were down-sampled from the original the same hyperparameters as [51].
48 kHz to 16 kHz. LSM Models We use pre-trained checkpoints for both
EmoMatch [33]: Consisting of 2005 recordings, Emo- English-Only and Multilingual models: Wav2Vec 2.0
Match features samples from 50 non-actor Spanish speak- Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0
ers (20 females, 30 males) expressing the Big Six emotions family, and Whisper Small (EN) (Whisper Small pre-
and a neutral state. The dataset is a subset of the larger trained only on English data), Whisper Small, Whisper
EmoSpanishDB and contains recordings sampled at 48 Medium from the Whisper family. The LSM backbones
kHz with a 16-bit mono format. are kept frozen and used exclusively as feature extractors.
EMOVO [47]: EMOVO presents 588 Italian audio Training We follow the same hyperparameters set-
recordings from 3 male and 3 female actors simulating tings as [15] to train the downstream classifiers. Specifi-
the Big Six emotions plus a neutral state. Each actor cally, we train for 30 epochs using the Adam optimizer
voiced 14 utterances, and the recordings are provided in with a learning rate of 5.0e-04, weight decay of 1.0e-04,
48 kHz, 16-bit stereo WAV format. betas set to (0.9, 0.98), and epsilon of 1.0e-08. The dimen-
EYASE [48]: EYASE contains 579 utterances in Egyp- sion of the classifier projection 𝑚 is 256.
tian Arabic, recorded by 3 male and 3 female professional
actors. The recordings, ranging from 1 to 6 seconds in 4.3. Results
duration, were labeled as angry, happy, neutral, or sad
and sampled at 44.1 kHz. To present our results, we first compare the performance
Oréau [49]: The Oréau dataset features 502 audio sam- of the various classifiers (see Section 3) for each LSM
ples from 32 non-professional actors (25 male, 7 female) utilized. This analysis provides insights into the char-
who voiced 10 to 13 utterances in French for the Big Six acteristics of features extracted from Wav2Vec 2.0 and
emotions plus a neutral state. Whisper models for downstream SER tasks. After identi-
ShEMO [50]: ShEMO comprises 3000 semi-natural fying the best classifier for each LSM, we then compare
recordings from 87 native Persian speakers (31 female, the performance of English-Only and Multilingual LSMs
56 male). The dataset captures 5 of the Big Six emo- across the 8 languages covered in this study.
tions—sadness, anger, happiness, surprise, and fear—plus
a neutral state. The samples were up-sampled to a fre- 4.3.1. Comparison between downstream classifiers
quency of 44.1 kHz in mono-channel format, with an
We examine the results in Table 2, comparing three clas-
average length of 4.11 seconds (std: 3.41s).
sifier methods for Wav2Vec 2.0 and Whisper models. The
The audio is resampled to 16 kHz, and a stratified train/-
Backbone Linear Non-Linear Multi-Layer ness for SER across multiple languages. We hypothesize
Wav2Vec 2.0 Base 47.87 (± 0.93) 42.07 (± 5.27) 53.42 (± 1.27) that this differing behavior may be related to their respec-
Wav2Vec 2.0 Large 12.09 (± 1.50) 12.93 (± 3.31) 57.50 (± 0.03) tive Self-Supervised and Weakly-Supervised pre-training
XLS-R 5.43 (± 0.40) 5.86 (± 0.07) 40.89 (± 2.00) approaches, which warrant further investigation. To gain
Whisper Small (EN) 58.16 (± 0.15) 53.50 (± 0.98) 49.73 (± 2.02) further insights into the importance of Transformer lay-
Whisper Small 60.87 (± 0.26) 54.86 (± 0.93) 45.14 (± 1.54) ers in Wav2Vec 2.0 and Whisper for SER, we leverage the
Whisper Medium 60.72 (± 0.16) 55.56 (± 1.09) 37.95 (± 2.27) weights learned in the Multi-Layer classifier as follows.
Transformer Layer Weights. We analyze the
weights 𝑤1 , … , 𝑤𝐿 from the Multi-Layer Classifier to as-
Table 2
Performance of various LSM backbones using Linear, Non-
sess Transformer layer importance. Figure 2 illustrates
Linear, and Multi-Layer classification methods. F1 scores are that Wav2Vec 2.0 models assign greater weight to the
averaged across all 9 datasets. For each LSM, the best classifier early and middle layers, whereas Whisper models em-
is highlighted in bold. phasize the later layers. This observation confirms the
earlier findings, suggesting that paralinguistic informa-
tion in Whisper models is embedded in the features of
the later Transformer layers.
4.3.2. Comparing English-Only and Multilingual
LSMs Across Different Languages
In this section, we compare English-Only and Multilin-
gual LSMs with the AST baseline across 9 datasets. Table
3 displays F1 scores for the optimal classifiers found in
the previous section: Multi-Layer for Wav2Vec 2.0 and
Figure 2: Greyscale map of layer weight distribution from the Linear for Whisper models.
Multi-Layer classification method. Weights are averaged over Transferring knowledge from LSMs proves to be ef-
all 9 datasets for each model. Darker shades indicate higher
fective across all datasets compared to the baseline. For
weights.
instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian
Arabic, while Whisper Small scores 51.98 and AST scores
table shows average F1 scores across 9 datasets, highlight- 33.23. This indicates that LSMs are effective feature ex-
ing the most effective classifier for each LSM in cross- tractors for cross-lingual SER on multiple languages.
lingual SER tasks. When comparing English-only and Multilingual mod-
For Wav2Vec 2.0 models, the Multi-Layer Classifier els, we differentiate between the Wav2Vec 2.0 and Whis-
performs best, with F1 scores of 53.42, 57.50, and 40.89 per families. For Wav2Vec 2.0, we observe that Wav2Vec
for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R. 2.0 Base and Large generally outperform XLS-R (e.g.,
The Linear and Non-Linear classifiers perform similarly, 87.85 and 88.31 vs. 67.71 for DEMos), except in Persian,
especially for Wav2Vec 2.0 Large and XLS-R, suggesting where their performance is comparable. This indicates
improvements are due to using features from internal that multilingual pre-training may not be as effective
Transformer layers rather than non-linear activations. for Wav2Vec 2.0 models across various languages. We
For Whisper models, the Linear Classifier performs best, speculate that this may be due to the limitations of SSL
with F1 scores of 58.16, 60.87, and 60.72 for Whisper pre-training, which might struggle with the diverse range
Small (EN), Whisper Small, and Whisper Medium. In- of languages and lose important paralinguistic features
creasing classifier complexity with non-linear activations that are retained in English-only models. Further investi-
decreases performance, likely due to general information gation with a wider range of SSL-pretrained LSMs could
loss caused by complex transformations. The Multi-Layer provide more insights. As regards to Whisper, Multilin-
Classifier performs worse, indicating that using also fea- gual Whisper Small outperforms its English-only ver-
tures from internal layers is less effective than using sion, with the exception of Greek and Persian, likely due
features from the last layer alone. to limited pretraining data for these languages, which
This comparison reveals that Wav2Vec 2.0 models ben- resulted in higher word error rates compared to other
efit from features extracted from internal Transformer languages in this study [13]. Multilingual Whisper mod-
layers and exhibit less sensitivity to classifier complex- els achieve best performance in Canadian French, Span-
ity, consistent with prior research [41, 39]. Conversely, ish (66.71, 73.13 with Whisper Small), Italian, German,
Whisper models achieve better performance with fea- and French (91.17, 90.64, 95.22 with Whisper Medium).
tures from the last Transformer layer when using a simple This improvement is likely due to the larger pretraining
linear classifier, offering new insights into their effective- datasets for these languages and the similarities between
English-Only Multilingual
Wav2Vec 2.0 Wav2Vec 2.0 Whisper Whisper Whisper
Dataset/Model AST XLS-R‡
Base‡ Large‡ Small† Small† Medium†
AESDD (el) 19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99) 9.16 (± 1.25) 26.34 (± 1.65) 27.62 (± 0.62)
CaFE (fr-ca) 10.96 (± 6.26) 50.52 (± 3.54) 47.74 (± 0.33) 60.66 (± 0.76) 18.66 (± 0.01) 66.71 (± 0.72) 55.03 (± 0.38)
DEMoS (it) 13.75 (± 4.26) 87.85 (± 0.01) 88.31 (± 0.74) 88.24 (± 0.21) 67.71 (± 1.47) 90.61 (± 0.14) 91.17 (± 0.20)
EmoDB (de) 46.11 (± 6.55) 81.75 (± 7.30) 88.84 (± 7.48) 83.31 (± 0.18) 67.39 (± 4.33) 87.21 (± 1.11) 90.64 (± 1.47)
EmoMatch (es) 36.10 (± 2.63) 69.84 (± 0.69) 71.85 (± 1.55) 67.59 (± 0.35) 44.14 (± 0.25) 73.13 (± 2.54) 68.23 (± 0.78)
EMOVO (it) 15.74 (± 1.24) 16.47 (± 0.61) 20.33 (± 1.31) 27.30 (± 0.16) 14.86 (± 2.11) 41.05 (± 1.21) 50.19 (± 0.29)
EYASE (ar-eg) 33.23 (± 4.58) 46.31 (± 3.62) 53.40 (± 1.56) 42.65 (± 0.70) 47.27 (± 1.36) 51.98 (± 0.88) 37.32 (± 3.62)
Oréau (fr) 19.01 (± 2.35) 52.86 (± 0.07) 58.42 (± 4.14) 82.27 (± 0.23) 32.51 (± 4.89) 92.70 (± 1.67) 95.22 (± 0.84)
ShEMO (fa) 36.15 (± 0.85) 60.55 (± 3.90) 57.52 (± 9.09) 67.93 (± 0.37) 61.24 (± 8.93) 63.88 (± 1.21) 63.85 (± 1.58)
Table 3
Performance of Wav2Vec and Whisper models across 9 datasets, divided into English-Only and Multilingual LSMs. AST is the
baseline. † indicates a Linear Classifier, ‡ a Multi-Layer Classifier. Bold values are the highest scores, and underlined values
highlight the best between English-Only and Multilingual models.
Canadian French and French. We believe that multilin- References
gual pretraining benefits Whisper models by capturing
language-specific features more effectively through WSL [1] C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, B.-H. Su,
and multitask learning. However, further research is C. Busso, Deep representation learning for affective
needed to evaluate the effectiveness of multilingual pre- speech signal analysis and processing: Preventing
training with WSL compared to SSL across a broader unwanted signal disparities, IEEE Signal Processing
range of LSMs. Magazine 38 (2021) 22–38.
[2] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, A
survey of deep learning-based multimodal emotion
5. Conclusion recognition: Speech, text, and face, Entropy 25
(2023) 1440.
This paper examines the capabilities of Wav2Vec 2.0 and [3] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kar-
Whisper models as feature extractors for cross-lingual tiwi, E. Ambikairajah, A comprehensive review of
SER across eight languages, considering both English- speech emotion recognition systems, IEEE access 9
Only and Multilingual variants. Our findings reveal that (2021) 47795–47814.
LSMs are effective feature extractors compared to a full [4] Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emo-
Transformer baseline trained from scratch. We observe tion recognition using cnn, in: Proceedings of the
that Whisper models encode acoustic information primar- 22nd ACM international conference on Multimedia,
ily in the features of the last Transformer layer, whereas 2014, pp. 801–804.
Wav2Vec 2.0 models rely on features from middle and [5] A. M. Badshah, J. Ahmad, N. Rahim, S. W. Baik,
early layers. Furthermore, we show that multilingual Speech emotion recognition from spectrograms
pre-training benefits Whisper models, leading to strong with deep convolutional neural network, in: 2017
performance in Italian, Canadian French, French, Span- international conference on platform technology
ish, German, and competitive results in Greek, Egyptian and service (PlatCon), IEEE, 2017, pp. 1–5.
Arabic, and Persian. In contrast, English-Only Wav2Vec [6] J. Zhao, X. Mao, L. Chen, Speech emotion recogni-
2.0 models outperform their multilingual counterpart, tion using deep 1d & 2d cnn lstm networks, Biomed-
XLS-R, in most languages, achieving top performance in ical signal processing and control 47 (2019) 312–323.
Greek and Egyptian Arabic. We attribute the disparity [7] T. Feng, H. Hashemi, M. Annavaram, S. S.
in multilingual pre-training effectiveness to the differ- Narayanan, Enhancing privacy through domain
ences between SSL and WSL strategies, which should be adaptive noise injection for speech emotion recog-
explored further. nition, in: ICASSP 2022-2022 IEEE international
conference on acoustics, speech and signal process-
ing (ICASSP), IEEE, 2022, pp. 7702–7706.
[8] W. Lim, D. Jang, T. Lee, Speech emotion recognition
using convolutional and recurrent neural networks,
in: 2016 Asia-Pacific signal and information pro- [20] S. G. Upadhyay, L. Martinez-Lucas, B.-H. Su, W.-C.
cessing association annual summit and conference Lin, W.-S. Chien, Y.-T. Wu, W. Katz, C. Busso, C.-C.
(APSIPA), IEEE, 2016, pp. 1–4. Lee, Phonetic anchor-based transfer learning to fa-
[9] N.-C. Ristea, R. T. Ionescu, F. S. Khan, Septr: Separa- cilitate unsupervised cross-lingual speech emotion
ble transformer for audio spectrogram processing, recognition, in: ICASSP 2023-2023 IEEE Interna-
arXiv preprint arXiv:2203.09581 (2022). tional Conference on Acoustics, Speech and Signal
[10] J.-Y. Kim, S.-H. Lee, Coordvit: a novel method of Processing (ICASSP), IEEE, 2023, pp. 1–5.
improve vision transformer-based speech emotion [21] S. A. M. Zaidi, S. Latif, J. Qadir, Cross-language
recognition using coordinate information concate- speech emotion recognition using multimodal
nate, in: 2023 International conference on electron- dual attention transformers, arXiv preprint
ics, information, and communication (ICEIC), IEEE, arXiv:2306.13804 (2023).
2023, pp. 1–4. [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
[11] S. Akinpelu, S. Viriri, A. Adegun, An enhanced Bert: Pre-training of deep bidirectional transform-
speech emotion recognition using vision trans- ers for language understanding, arXiv preprint
former, Scientific Reports 14 (2024) 13126. arXiv:1810.04805 (2018).
[12] S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, [23] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
K. Qian, X. Jing, A. Kathan, B. Hu, B. W. Schuller, et al., Improving language understanding by gen-
Audio self-supervised learning: A survey, Patterns erative pre-training (2018).
3 (2022). [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
[13] A. Radford, J. W. Kim, T. Xu, G. Brockman, senborn, X. Zhai, T. Unterthiner, M. Dehghani,
C. McLeavey, I. Sutskever, Robust speech recog- M. Minderer, G. Heigold, S. Gelly, et al., An image is
nition via large-scale weak supervision, in: Inter- worth 16x16 words: Transformers for image recog-
national Conference on Machine Learning, PMLR, nition at scale, arXiv preprint arXiv:2010.11929
2023, pp. 28492–28518. (2020).
[14] L. Pepino, P. Riera, L. Ferrer, Emotion recognition [25] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
from speech using wav2vec 2.0 embeddings, arXiv J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm:
preprint arXiv:2104.03502 (2021). Large-scale self-supervised pre-training for full
[15] T. Feng, S. Narayanan, Peft-ser: On the use of stack speech processing, IEEE Journal of Selected
parameter efficient transfer learning approaches Topics in Signal Processing 16 (2022) 1505–1518.
for speech emotion recognition using pre-trained [26] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai,
speech models, in: 2023 11th International Confer- K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang,
ence on Affective Computing and Intelligent Inter- G.-T. Lin, et al., Superb: Speech processing uni-
action (ACII), IEEE, 2023, pp. 1–8. versal performance benchmark, arXiv preprint
[16] Y. Li, A. Mehrish, R. Bhardwaj, N. Majumder, arXiv:2105.01051 (2021).
B. Cheng, S. Zhao, A. Zadeh, R. Mihalcea, S. Po- [27] A. Conneau, A. Baevski, R. Collobert, A. Mohamed,
ria, Evaluating parameter-efficient transfer learning M. Auli, Unsupervised cross-lingual representation
approaches on sure benchmark for speech under- learning for speech recognition, arXiv preprint
standing, in: ICASSP 2023-2023 IEEE International arXiv:2006.13979 (2020).
Conference on Acoustics, Speech and Signal Pro- [28] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu,
cessing (ICASSP), IEEE, 2023, pp. 1–5. N. Goyal, K. Singh, P. Von Platen, Y. Saraf, J. Pino,
[17] T. Feng, R. Hebbar, S. Narayanan, Trust-ser: et al., Xls-r: Self-supervised cross-lingual speech
On the trustworthiness of fine-tuning pre-trained representation learning at scale, arXiv preprint
speech embeddings for speech emotion recognition, arXiv:2111.09296 (2021).
in: ICASSP 2024-2024 IEEE International Confer- [29] A. Wurst, M. Hopwood, S. Wu, F. Li, Y.-D. Yao, Deep
ence on Acoustics, Speech and Signal Processing learning for the detection of emotion in human
(ICASSP), IEEE, 2024, pp. 11201–11205. speech: The impact of audio sample duration and
[18] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec english versus italian languages, in: 2023 32nd
2.0: A framework for self-supervised learning of Wireless and Optical Communications Conference
speech representations, Advances in neural infor- (WOCC), IEEE, 2023, pp. 1–6.
mation processing systems 33 (2020) 12449–12460. [30] M. Neumann, et al., Cross-lingual and multilingual
[19] M. Sharma, Multi-lingual multi-task speech emo- speech emotion recognition on english and french,
tion recognition using wav2vec 2.0, in: ICASSP in: 2018 IEEE international conference on acoustics,
2022-2022 IEEE International Conference on Acous- speech and signal processing (ICASSP), IEEE, 2018,
tics, Speech and Signal Processing (ICASSP), IEEE, pp. 5769–5773.
2022, pp. 6907–6911. [31] S. Deng, N. Zhang, Z. Sun, J. Chen, H. Chen, When
low resource nlp meets unsupervised language ing Society 66 (2018) 457–467.
model: Meta-pretraining then meta-learning for [44] P. Gournay, O. Lahaie, R. Lefebvre, A canadian
few-shot text classification (student abstract), in: french emotional speech dataset, in: Proceedings
Proceedings of the AAAI Conference on Artificial of the 9th ACM multimedia systems conference,
Intelligence, volume 34, 2020, pp. 13773–13774. 2018, pp. 399–402.
[32] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross [45] E. Parada-Cabaleiro, G. Costantini, A. Batliner,
lingual speech emotion recognition: Urdu vs. west- M. Schmitt, B. W. Schuller, Demos: An italian emo-
ern languages, in: 2018 International conference tional speech corpus: Elicitation methods, machine
on frontiers of information technology (FIT), IEEE, learning, and perception, Language Resources and
2018, pp. 88–93. Evaluation 54 (2020) 341–383.
[33] E. Garcia-Cuesta, A. B. Salvador, D. G. Pãez, Emo- [46] F. Burkhardt, A. Paeschke, M. Rolfes, W. F.
matchspanishdb: study of speech emotion recog- Sendlmeier, B. Weiss, et al., A database of german
nition machine learning models in a new spanish emotional speech., in: Interspeech, volume 5, 2005,
elicited database, Multimedia Tools and Applica- pp. 1517–1520.
tions 83 (2024) 13093–13112. [47] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco,
[34] A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. et al., Emovo corpus: an italian emotional speech
Morency, Tensor fusion network for multimodal database, in: Proceedings of the ninth international
sentiment analysis, arXiv preprint arXiv:1707.07250 conference on language resources and evaluation
(2017). (LREC’14), European Language Resources Associa-
[35] H. H. Mao, A survey on self-supervised pre-training tion (ELRA), 2014, pp. 3501–3504.
for sequential transfer learning in neural networks, [48] L. Abdel-Hamid, Egyptian arabic speech emotion
arXiv preprint arXiv:2007.00800 (2020). recognition using prosodic, spectral and wavelet
[36] S. Sadok, S. Leglaive, R. Séguier, A vector quantized features, Speech Communication 122 (2020) 19–30.
masked autoencoder for speech emotion recogni- [49] S. Oréau, French emotional speech database - oréau,
tion, in: 2023 IEEE International conference on Zenodo, 2021. URL: https://zenodo.org/records/
acoustics, speech, and signal processing workshops 4405783.
(ICASSPW), IEEE, 2023, pp. 1–5. [50] O. Mohamad Nezami, P. Jamshid Lou, M. Karami,
[37] F. Catania, Speech emotion recognition in italian Shemo: a large-scale validated database for persian
using wav2vec 2, Authorea Preprints (2023). speech emotion detection, Language Resources and
[38] Y. Belinkov, J. Glass, Analyzing hidden representa- Evaluation 53 (2019) 1–16.
tions in end-to-end automatic speech recognition [51] Y. Gong, Y.-A. Chung, J. Glass, Ast: Audio spectro-
systems, Advances in Neural Information Process- gram transformer, arXiv preprint arXiv:2104.01778
ing Systems 30 (2017). (2021).
[39] J. Shah, Y. K. Singla, C. Chen, R. R. Shah, What all
do audio transformer models hear? probing acous-
tic representations for language delivery and its
structure, arXiv preprint arXiv:2101.00387 (2021).
[40] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy-
sis of a self-supervised speech representation model,
in: 2021 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), IEEE, 2021, pp.
914–921.
[41] Y. Li, Y. Mohamied, P. Bell, C. Lai, Exploration
of a self-supervised speech model: A study on
emotional corpora, in: 2022 IEEE Spoken Lan-
guage Technology Workshop (SLT), IEEE, 2023, pp.
868–875.
[42] A. Pasad, B. Shi, K. Livescu, Comparative layer-
wise analysis of self-supervised speech models,
in: ICASSP 2023-2023 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2023, pp. 1–5.
[43] N. Vryzas, R. Kotsakis, A. Liatsou, C. A. Dimoulas,
G. Kalliris, Speech emotion recognition for perfor-
mance interaction, Journal of the Audio Engineer-