=Paper= {{Paper |id=Vol-3878/30_main_long |storemode=property |title=Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition |pdfUrl=https://ceur-ws.org/Vol-3878/30_main_long.pdf |volume=Vol-3878 |authors=Federico D'asaro,Juan José Márquez Villacís,Giuseppe Rizzo,Andrea Bottino |dblpUrl=https://dblp.org/rec/conf/clic-it/DasaroV0B24 }} ==Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition== https://ceur-ws.org/Vol-3878/30_main_long.pdf
                                Using Large Speech Models for Feature Extraction in
                                Cross-Lingual Speech Emotion Recognition
                                Federico D’Asaro1,2,∗,† , Juan José Márquez Villacís1,† , Giuseppe Rizzo1,2 and Andrea Bottino2
                                1
                                    LINKS Foundation – AI, Data & Space (ADS)
                                2
                                    Politecnico di Torino – Dipartimento di Automatica e Informatica (DAUIN)


                                                  Abstract
                                                  Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or Weakly-
                                                  Supervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to
                                                  extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English,
                                                  with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and
                                                  Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers
                                                  of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper
                                                  models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models
                                                  benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs,
                                                  we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish,
                                                  German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform
                                                  their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.

                                                  Keywords
                                                  Cross-lingual Speech Emotion Recognition, Large Speech models, Transfer Learning



                                1. Introduction                                                                                            edge of LSMs makes them effective feature extractors
                                                                                                                                           for SER. Research has adapted LSMs for SER in English
                                Speech Emotion Recognition (SER) aims to identify emo-                                                     [14, 15, 16, 17], but efforts for other languages are lim-
                                tions from speech audio, enhancing Human-AI inter-                                                         ited, focusing on Wav2Vec 2.0 [18] for cross-lingual SER
                                action in fields such as healthcare, education, and se-                                                    [19, 20, 21].
                                curity [1]. Traditional methods rely on Low-Level De-                                                         This study examines how effective LSMs are as fea-
                                scriptors (LLD) like spectral, prosodic, and voice qual-                                                   ture extractors for cross-lingual SER, using nine datasets
                                ity features [2], using classifiers such as KNN, SVM,                                                      across eight languages: Italian, German, French, Cana-
                                or Naïve Bayes [3]. Deep learning has introduced ad-                                                       dian French, Spanish, Greek, Persian, and Egyptian Arabic.
                                vanced techniques, including Convolutional Neural Net-                                                     Specifically, we utilize LSMs from the Wav2Vec 2.0 and
                                works (CNNs) [4, 5, 6], eventually followed by Recur-                                                      Whisper [13] model families, pre-trained with SSL and
                                rent Neural Networks (RNNs) [7, 8], and Transformers                                                       WSL approaches, respectively. We introduce Whisper
                                [9, 10, 11]. Transformers’ ability to learn from extensive                                                 due to its underexplored use in cross-lingual SER. To
                                datasets has led to Large Speech Models (LSMs), which                                                      assess the effectiveness of LSMs as feature extractors,
                                generalize across various speech tasks. Common train-                                                      we test three classifiers of increasing complexity—Linear,
                                ing approaches for these models include Self-Supervised                                                    Non-Linear, and Multi-Layer—across nine datasets. This
                                Learning (SSL), which uses data itself to learn general-                                                   evaluation determines which classifier best suits each
                                purpose features [12], and Weakly-Supervised Learning                                                      LSM across different languages. Moreover, our study in-
                                (WSL), which pairs audio with text for tasks like transcrip-                                               cludes both English-Only and Multilingual models from
                                tion and translation [13]. The general-purpose knowl-                                                      the Wav2Vec 2.0 and Whisper families, aiming to eval-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                           uate the effectiveness of multilingual pre-training for
                                Dec 04 — 06, 2024, Pisa, Italy                                                                             cross-lingual SER.
                                ∗
                                     Corresponding author.                                                                                    The main contributions of this work are:
                                †
                                    These authors contributed equally.
                                Envelope-Open federico.dasaro@polito.it (F. D’Asaro);                                                           • We evaluate LSMs from the Wav2Vec 2.0 and
                                juan.marquez@linksfoundation.com (J. J. Márquez Villacís);                                                        Whisper models as feature extractors for cross-
                                giuseppe.rizzo@linksfoundation.com (G. Rizzo);                                                                    lingual SER across eight languages.
                                andrea.bottino@polito.it (A. Bottino)                                                                           • We test three types of downstream classi-
                                GLOBE http://conceptbase.sourceforge.net/mjf/ (A. Bottino)
                                                                                                                                                  fiers—Linear, Non-Linear, and Multi-Layer—and
                                Orcid 0009-0003-8727-3393 (F. D’Asaro); 0009-0008-3098-5492
                                (J. J. Márquez Villacís); 0000-0003-0083-813X (G. Rizzo);                                                         find that Whisper models’ last Transformer layer
                                0000-0002-8894-5089 (A. Bottino)                                                                                  features are well-suited for a Linear classifier,
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                            whereas Wav2Vec 2.0 models perform better with




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
       features from the middle and early Transformer          dual attention [21] and tensor fusion [34] enhance audio
       layers.                                                 and text interaction in languages such as Italian, Ger-
     • We compare English-Only and Multilingual LSMs,          man, and Urdu. Self-supervised pre-training methods,
       revealing that Whisper models benefit from mul-         including variational autoencoders, have also been ef-
       tilingual pre-training performing best on Italian,      fective in transferring knowledge across languages like
       Spanish, Canadian French, French, and German            German [35, 36]. The advent of LSMs pre-trained with
       and competitively on Greek, Egyptian Arabic,            self-supervision has further increased the potential for
       Persian. Conversely, English-Only Wav2Vec 2.0           transfer learning due to their high generalization capa-
       models surpass multilingual XLS-R in most lan-          bilities [15]. However, most research primarily focuses
       guages, achieving the highest performance in            on adapting multilingual Wav2Vec 2.0 models (XLSR-53)
       Greek, Egyptian Arabic.                                 [19, 37, 20, 21]. This work expands the scope of analyzed
                                                               LSMs including WSL models as Whisper. Additionally,
                                                               we evaluate the ability of English-only models to transfer
2. Background                                                  knowledge to other languages, beyond just multilingual
                                                               models.
2.1. Large Speech Models
Recent developments in natural language processing and         3. Method
computer vision have harnessed large volumes of unla-
beled data through Self-Supervised Learning [22, 23, 24].      In this section, we describe the methodology for eval-
Building on techniques such as masked language and             uating the effectiveness of LSMs as feature extractors
image modeling, Wav2Vec 2.0 [18] introduced a LSM              for downstream SER in various languages. We stack a
trained on extensive audio datasets using masked speech        classification model on top of the LSM backbone, with
modeling. Wav2Vec 2.0 features seven 1D convolutional          its parameters frozen. All LSMs used in this work share
blocks for initial feature extraction, followed by 12 or 24    the same overall architecture, which we describe below
transformer blocks (depending on the model variant) for        along with the stacked classification model.
contextual processing. The model masks part of the latent          Formally, the input audio 𝐴 (raw waveform or log-
features and reconstructs them using the surrounding           mel spectrogram) passes through a convolutional en-
context. To further refine LSMs for tasks like emotion         coder 𝓏 ∶ 𝐴 → 𝑍, mapping the audio to latent features
recognition, methods such as WavLM [25] have been              𝑍 = {𝑧1 , … , 𝑧𝑇 }, where 𝑇 is the sequence length and each
developed. WavLM incorporates speech denoising along-          frame 𝑧𝑖 typically corresponds to 25 ms with 𝑧𝑖 ∈ ℝ𝑑 .
side masked modeling, demonstrating broad effective-           Then, 𝑍 passes through a Transformer encoder consist-
ness across various tasks in the SUPERB benchmark [26].        ing of 𝑙 layers 𝒽𝑙 ∶ 𝑍 → 𝐻, enriching the latent features
Moreover, XLSR-53 [27] extends the Wav2Vec 2.0 frame-          with contextual information, resulting in {ℎ𝑙1 , … , ℎ𝑙𝑇 } for
work to cover 53 languages, sharing the latent space           each of the 𝑙 = 1, … , 𝐿 Transformer layers. Here, 𝑙 = 𝐿
across these languages. This approach has shown supe-          corresponds to the output features of the last layer, with
rior performance over monolingual pretraining for auto-        ℎ𝑙𝑖 ∈ ℝ𝑑 . The features {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 are considered the
matic speech recognition. XLS-R [28] further advances          extracted features from the LSM and are fed into a down-
this by scaling to 128 languages, excelling in speech trans-   stream classifier 𝓎 ∶ 𝐻 → 𝑌, which maps these features
lation and language identification. In comparison, Whis-       to the output class logits {𝑦1 , … , 𝑦𝑘 }. The output class
per [13] leverages large-scale weak supervision from           label 𝑦 ∗ for audio 𝐴 is given by:
audio-transcription pairs to train an encoder-decoder
transformer. Using log-mel spectrograms, Whisper is                       𝑦 ∗ = arg max softmax (𝓎 (𝒽 (𝓏(𝐴))))                (1)
                                                                                       𝑘
trained in a multitask framework that includes multilin-
gual transcription and translation, establishing itself as        Inspired by previous work that uses probing to evalu-
an effective zero-shot model for multilingual tasks.           ate the quality of features extracted from backbone mod-
                                                               els [38, 39], we evaluate three different downstream classi-
                                                               fiers of increasing complexity: Linear Classifier (ℊ𝑙 ), Non-
2.2. Cross-Language Speech Emotion
                                                               Linear Classifier (ℊ𝑛𝑙 ), and Multi-layer Classifier (ℊ𝑚𝑙 ).
     Recognition                                               Figure 1 illustrates their architecture, which is detailed
Emotion recognition in languages beyond English, like          below.
Italian [29], French [30], Persian [31, 32], and Spanish
[33], is crucial but often limited by data availability. Re- 3.1. Linear Classifier
cent efforts have focused on improving cross-lingual
and cross-modal knowledge transfer. Techniques like For the linear classifier, we use a simple feed-forward
                                                             neural network that consists solely of linear projections.
                                                                   3.3. Multi-Layer Classifier
                                                                   As a third option, we adopt the approach from [14, 15],
                                                                   which utilizes all hidden states of the Transformer en-
                                                                   coder. The features {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 are combined into a
                                                                   new sequence {ℎ∗1 , … , ℎ∗𝑇 } using a learnable weighted sum.
                                                                   The function 𝓈 ∶ ℝ𝐿×𝑇 ×𝑑 → ℝ𝑇 ×𝑑 maps {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿
                                                                   to {ℎ∗1 , … , ℎ∗𝑇 } as follows:
                                                                                         𝐿
                                                                                 ℎ∗𝑡 = ∑ 𝑤𝑙 ⋅ ℎ𝑙𝑡    for 𝑡 = 1, … , 𝑇          (4)
                                                                                        𝑙=1

                                                                   where 𝑤1 , … , 𝑤𝐿 are the weights assigned to each Trans-
                                                                                                               𝐿
                                                                   former layer, ensuring 𝑤𝑙 ∈ [0, 1] and ∑𝑙=1 𝑤𝑙 = 1. The
                                                                                           ∗     ∗
                                                                   resulting sequence {ℎ1 , … , ℎ𝑇 } is then processed by the
                                                                   same pipeline as the Non-Linear Classifier, resulting in:

                                                                            ℊ𝑚𝑙 (𝑥 = {ℎ𝑙1 , … , ℎ𝑙𝑇 }𝑙=1,..,𝐿 ) = ℊ𝑛𝑙 (𝓈(𝑥))   (5)

                                                                   This classifier leverages internal layer information, which
Figure 1: The three downstream classifiers used in this work       has proven beneficial for paralinguistic and linguistic
are: Linear (red), Non-Linear (purple), and Multi-Layer (green).
                                                                   downstream tasks [39, 40, 41, 42]. By investigating the
The snowflake icon represents frozen weights, while the fire
icon denotes trainable weights.
                                                                   contribution of internal LSM layers for SER across var-
                                                                   ious languages, we corroborates previous findings for
                                                                   Wav2Vec 2.0 models and provide new insights for Whis-
Specifically, given the features from the last Transformer         per models.
layer {ℎ𝐿1 , … , ℎ𝐿𝑇 }, they are first projected by a linear layer
𝓁1 ∶ ℝ𝑑 → ℝ𝑚 that is shared across all frames, then ag-
gregated by average pooling 𝓅, and finally pass through 4. Experiments
the classification layer ℴ ∶ ℝ𝑚 → ℝ𝑘 to obtain the output
class logits. The function ℊ𝑙 is compactly defined as:             4.1. Datasets and Metrics
                                                                In this study, we conduct experiments using 9 distinct
         ℊ𝑙 (ℎ𝐿1 , … , ℎ𝐿𝑇 ) = ℴ (𝓅 (𝓁1 (ℎ𝐿1 , … , ℎ𝐿𝑇 )))   (2)
                                                                datasets spanning 8 different languages: Greek, French,
The absence of non-linear activations allows us to eval- Italian, German, Spanish, Egyptian Arabic, and Persian.
uate the quality of the features extracted from the LSM The datasets vary in their collection methodologies, such
based on the linear classifier model’s ability to handle as acted emotions and elicitation methods. The partic-
the SER task.                                                   ipant demographics may be balanced by gender (e.g.,
                                                                CaFE, EYASE), by emotion (e.g., EMOVO), or may not
                                                                be balanced at all. For all datasets, we conduct our ex-
3.2. Non-Linear Classifier                                      periments in a speaker-independent setting to prevent
To increase the complexity of the classification model, evaluation on speaker-dependent features. Table 1 pro-
we utilize a series of linear layers interleaved with ReLU vides an overview of the dataset statistics, with a more
activations both before and after feature pooling. We detailed description given below.
follow the same architecture as in [14, 15], but unlike            AESDD [43]: The Acted Emotional Speech Dynamic
them, we only feed the features from the last Transformer       Database  comprises 500 recorded samples from 5 actors
layer 𝐿 to the model. Each {ℎ𝐿1 , … , ℎ𝐿𝑇 } passes through (3 females, 2 males) expressing 5 distinct emotions in
two shared linear layers, ReLU, and dropout blocks (𝒷), Greek. Each actor performed 20 utterances per emotion,
followed by a linear layer (𝓁1 ). Linear layers are functions with some utterances recorded multiple times. In later
𝓁 ∶ ℝ𝑑 → ℝ𝑚 . Projected features are averaged, pass versions, additional actors were included, bringing the
through 𝓁2 and ReLU, and are classified by ℴ. Thus, ℊ𝑛𝑙 total to 604 recordings from 6 actors.
is:                                                                CaFE [44]: This dataset includes recordings of 6 dif-
                                                                ferent sentences delivered by 12 actors (6 female, 6 male)
   ℊ𝑛𝑙 (𝑥 = ℎ𝐿1 , … , ℎ𝐿𝑇 ) = ℴ (ReLU (𝓁2 (𝓅 (𝓁1 (𝒷 (𝑥))))))    portraying the Big Six emotions and a neutral state in
                                                            (3) Canadian French. It offers a high-quality version with a
                                                                sampling rate of 192 kHz at 24 bits per sample, as well as
         Dataset   Language      # Samples                              Emotions
         AESDD        Greek          500              anger, disgust, fear, happiness, and sadness
          CaFE   Canadian French     936   anger, disgust, fear, happiness, surprise, sadness, and neutrality
         DEMoS       Italian        9697   anger, disgust, fear, happiness, surprise, sadness, and neutrality
         EmoDB      German           535   anger, disgust, fear, happiness, boredom, sadness, and neutrality
        EmoMatch    Spanish         2005   anger, disgust, fear, happiness, surprise, sadness, and neutrality
         EMOVO       Italian         588   anger, disgust, fear, happiness, surprise, sadness, and neutrality
          EYASE  Egyptian Arabic     579               anger, happiness, sadness, and neutrality
          Oréau      French          502   anger, disgust, fear, happiness, surprise, sadness, and neutrality
         ShEMO       Persian         400               anger, happiness, sadness, and neutrality
Table 1
Summary statistics of the 9 datasets used in this work.

a down-sampled version at 48 kHz and 16 bits per sample.        validation/test split is performed with ratios of 80/10/10.
The total number of samples amounts to 936.                     All results are reported using the macro F1 score, ex-
   DEMoS [45]: DEMoS contains 9697 audio samples                pressed as a percentage. We conducted 3 runs, presenting
from 68 volunteer students (299 females, 131 males) ex-         the mean ± standard deviation.
pressing the Big Six emotions plus the neutral state in
Italian. Instead of acted emotions, samples were gener-         4.2. Experimental Details
ated using an elicitation approach. The recordings, with
a mean duration of 2.9 seconds (std: 1.1s), are provided        Baseline As a baseline to evaluate LSM transfer learn-
in 48 kHz, 16-bit, mono format.                                 ing capabilities, we adopt the Audio Spectrogram Trans-
   EmoDB [46]: This collection includes 535 utterances          former (AST) [51], a fully transformer-based architecture
across 7 emotional states, spoken in German by 5 female         recently proposed as a substitute for CNNs [9, 10, 11].
and 5 male actors. Each actor performed a set of 10             We train AST from scratch on each of the 9 datasets using
sentences, which were down-sampled from the original            the same hyperparameters as [51].
48 kHz to 16 kHz.                                                  LSM Models We use pre-trained checkpoints for both
   EmoMatch [33]: Consisting of 2005 recordings, Emo-           English-Only and Multilingual models: Wav2Vec 2.0
Match features samples from 50 non-actor Spanish speak-         Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0
ers (20 females, 30 males) expressing the Big Six emotions      family, and Whisper Small (EN) (Whisper Small pre-
and a neutral state. The dataset is a subset of the larger      trained only on English data), Whisper Small, Whisper
EmoSpanishDB and contains recordings sampled at 48              Medium from the Whisper family. The LSM backbones
kHz with a 16-bit mono format.                                  are kept frozen and used exclusively as feature extractors.
   EMOVO [47]: EMOVO presents 588 Italian audio                    Training We follow the same hyperparameters set-
recordings from 3 male and 3 female actors simulating           tings as [15] to train the downstream classifiers. Specifi-
the Big Six emotions plus a neutral state. Each actor           cally, we train for 30 epochs using the Adam optimizer
voiced 14 utterances, and the recordings are provided in        with a learning rate of 5.0e-04, weight decay of 1.0e-04,
48 kHz, 16-bit stereo WAV format.                               betas set to (0.9, 0.98), and epsilon of 1.0e-08. The dimen-
   EYASE [48]: EYASE contains 579 utterances in Egyp-           sion of the classifier projection 𝑚 is 256.
tian Arabic, recorded by 3 male and 3 female professional
actors. The recordings, ranging from 1 to 6 seconds in          4.3. Results
duration, were labeled as angry, happy, neutral, or sad
and sampled at 44.1 kHz.                                        To present our results, we first compare the performance
   Oréau [49]: The Oréau dataset features 502 audio sam-        of the various classifiers (see Section 3) for each LSM
ples from 32 non-professional actors (25 male, 7 female)        utilized. This analysis provides insights into the char-
who voiced 10 to 13 utterances in French for the Big Six        acteristics of features extracted from Wav2Vec 2.0 and
emotions plus a neutral state.                                  Whisper models for downstream SER tasks. After identi-
   ShEMO [50]: ShEMO comprises 3000 semi-natural                fying the best classifier for each LSM, we then compare
recordings from 87 native Persian speakers (31 female,          the performance of English-Only and Multilingual LSMs
56 male). The dataset captures 5 of the Big Six emo-            across the 8 languages covered in this study.
tions—sadness, anger, happiness, surprise, and fear—plus
a neutral state. The samples were up-sampled to a fre-          4.3.1. Comparison between downstream classifiers
quency of 44.1 kHz in mono-channel format, with an
                                                                We examine the results in Table 2, comparing three clas-
average length of 4.11 seconds (std: 3.41s).
                                                                sifier methods for Wav2Vec 2.0 and Whisper models. The
   The audio is resampled to 16 kHz, and a stratified train/-
     Backbone          Linear       Non-Linear    Multi-Layer     ness for SER across multiple languages. We hypothesize
 Wav2Vec 2.0 Base   47.87 (± 0.93) 42.07 (± 5.27) 53.42 (± 1.27)  that this differing behavior may be related to their respec-
  Wav2Vec 2.0 Large 12.09 (± 1.50) 12.93 (± 3.31) 57.50 (± 0.03)  tive Self-Supervised and Weakly-Supervised pre-training
       XLS-R          5.43 (± 0.40) 5.86 (± 0.07) 40.89 (± 2.00)  approaches, which warrant further investigation. To gain
 Whisper Small (EN) 58.16 (± 0.15) 53.50 (± 0.98) 49.73 (± 2.02)  further insights into the importance of Transformer lay-
    Whisper Small    60.87 (± 0.26) 54.86 (± 0.93) 45.14 (± 1.54) ers in Wav2Vec 2.0 and Whisper for SER, we leverage the
  Whisper Medium 60.72 (± 0.16) 55.56 (± 1.09) 37.95 (± 2.27)     weights learned in the Multi-Layer classifier as follows.
                                                                     Transformer Layer Weights. We analyze the
                                                                  weights 𝑤1 , … , 𝑤𝐿 from the Multi-Layer Classifier to as-
Table 2
Performance of various LSM backbones using Linear, Non-
                                                                  sess Transformer layer importance. Figure 2 illustrates
Linear, and Multi-Layer classification methods. F1 scores are that Wav2Vec 2.0 models assign greater weight to the
averaged across all 9 datasets. For each LSM, the best classifier early and middle layers, whereas Whisper models em-
is highlighted in bold.                                           phasize the later layers. This observation confirms the
                                                                  earlier findings, suggesting that paralinguistic informa-
                                                                  tion in Whisper models is embedded in the features of
                                                                  the later Transformer layers.

                                                                   4.3.2. Comparing English-Only and Multilingual
                                                                          LSMs Across Different Languages
                                                                   In this section, we compare English-Only and Multilin-
                                                                   gual LSMs with the AST baseline across 9 datasets. Table
                                                                   3 displays F1 scores for the optimal classifiers found in
                                                                   the previous section: Multi-Layer for Wav2Vec 2.0 and
Figure 2: Greyscale map of layer weight distribution from the      Linear for Whisper models.
Multi-Layer classification method. Weights are averaged over          Transferring knowledge from LSMs proves to be ef-
all 9 datasets for each model. Darker shades indicate higher
                                                                   fective across all datasets compared to the baseline. For
weights.
                                                                   instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian
                                                                   Arabic, while Whisper Small scores 51.98 and AST scores
table shows average F1 scores across 9 datasets, highlight-        33.23. This indicates that LSMs are effective feature ex-
ing the most effective classifier for each LSM in cross-           tractors for cross-lingual SER on multiple languages.
lingual SER tasks.                                                    When comparing English-only and Multilingual mod-
   For Wav2Vec 2.0 models, the Multi-Layer Classifier              els, we differentiate between the Wav2Vec 2.0 and Whis-
performs best, with F1 scores of 53.42, 57.50, and 40.89           per families. For Wav2Vec 2.0, we observe that Wav2Vec
for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R.                2.0 Base and Large generally outperform XLS-R (e.g.,
The Linear and Non-Linear classifiers perform similarly,           87.85 and 88.31 vs. 67.71 for DEMos), except in Persian,
especially for Wav2Vec 2.0 Large and XLS-R, suggesting             where their performance is comparable. This indicates
improvements are due to using features from internal               that multilingual pre-training may not be as effective
Transformer layers rather than non-linear activations.             for Wav2Vec 2.0 models across various languages. We
For Whisper models, the Linear Classifier performs best,           speculate that this may be due to the limitations of SSL
with F1 scores of 58.16, 60.87, and 60.72 for Whisper              pre-training, which might struggle with the diverse range
Small (EN), Whisper Small, and Whisper Medium. In-                 of languages and lose important paralinguistic features
creasing classifier complexity with non-linear activations         that are retained in English-only models. Further investi-
decreases performance, likely due to general information           gation with a wider range of SSL-pretrained LSMs could
loss caused by complex transformations. The Multi-Layer            provide more insights. As regards to Whisper, Multilin-
Classifier performs worse, indicating that using also fea-         gual Whisper Small outperforms its English-only ver-
tures from internal layers is less effective than using            sion, with the exception of Greek and Persian, likely due
features from the last layer alone.                                to limited pretraining data for these languages, which
   This comparison reveals that Wav2Vec 2.0 models ben-            resulted in higher word error rates compared to other
efit from features extracted from internal Transformer             languages in this study [13]. Multilingual Whisper mod-
layers and exhibit less sensitivity to classifier complex-         els achieve best performance in Canadian French, Span-
ity, consistent with prior research [41, 39]. Conversely,          ish (66.71, 73.13 with Whisper Small), Italian, German,
Whisper models achieve better performance with fea-                and French (91.17, 90.64, 95.22 with Whisper Medium).
tures from the last Transformer layer when using a simple          This improvement is likely due to the larger pretraining
linear classifier, offering new insights into their effective-     datasets for these languages and the similarities between
                                                      English-Only                                     Multilingual
                                      Wav2Vec 2.0     Wav2Vec 2.0       Whisper                          Whisper         Whisper
       Dataset/Model       AST                                                            XLS-R‡
                                        Base‡           Large‡          Small†                           Small†          Medium†
       AESDD (el)      19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99)     9.16 (± 1.25)   26.34 (± 1.65)   27.62 (± 0.62)
       CaFE (fr-ca)    10.96 (± 6.26) 50.52 (± 3.54) 47.74 (± 0.33)   60.66 (± 0.76)   18.66 (± 0.01) 66.71 (± 0.72) 55.03 (± 0.38)
       DEMoS (it)      13.75 (± 4.26) 87.85 (± 0.01) 88.31 (± 0.74)   88.24 (± 0.21)   67.71 (± 1.47) 90.61 (± 0.14) 91.17 (± 0.20)
       EmoDB (de)      46.11 (± 6.55) 81.75 (± 7.30) 88.84 (± 7.48)   83.31 (± 0.18)   67.39 (± 4.33) 87.21 (± 1.11) 90.64 (± 1.47)
       EmoMatch (es)   36.10 (± 2.63) 69.84 (± 0.69) 71.85 (± 1.55)   67.59 (± 0.35)   44.14 (± 0.25) 73.13 (± 2.54) 68.23 (± 0.78)
       EMOVO (it)      15.74 (± 1.24) 16.47 (± 0.61) 20.33 (± 1.31)   27.30 (± 0.16)   14.86 (± 2.11) 41.05 (± 1.21) 50.19 (± 0.29)
       EYASE (ar-eg)   33.23 (± 4.58) 46.31 (± 3.62) 53.40 (± 1.56) 42.65 (± 0.70)     47.27 (± 1.36) 51.98 (± 0.88)    37.32 (± 3.62)
       Oréau (fr)      19.01 (± 2.35) 52.86 (± 0.07) 58.42 (± 4.14)   82.27 (± 0.23)   32.51 (± 4.89) 92.70 (± 1.67) 95.22 (± 0.84)
       ShEMO (fa)      36.15 (± 0.85) 60.55 (± 3.90) 57.52 (± 9.09) 67.93 (± 0.37) 61.24 (± 8.93) 63.88 (± 1.21)        63.85 (± 1.58)


Table 3
Performance of Wav2Vec and Whisper models across 9 datasets, divided into English-Only and Multilingual LSMs. AST is the
baseline. † indicates a Linear Classifier, ‡ a Multi-Layer Classifier. Bold values are the highest scores, and underlined values
highlight the best between English-Only and Multilingual models.

Canadian French and French. We believe that multilin-                 References
gual pretraining benefits Whisper models by capturing
language-specific features more effectively through WSL                [1] C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, B.-H. Su,
and multitask learning. However, further research is                       C. Busso, Deep representation learning for affective
needed to evaluate the effectiveness of multilingual pre-                  speech signal analysis and processing: Preventing
training with WSL compared to SSL across a broader                         unwanted signal disparities, IEEE Signal Processing
range of LSMs.                                                             Magazine 38 (2021) 22–38.
                                                                       [2] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, A
                                                                           survey of deep learning-based multimodal emotion
5. Conclusion                                                              recognition: Speech, text, and face, Entropy 25
                                                                           (2023) 1440.
This paper examines the capabilities of Wav2Vec 2.0 and                [3] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kar-
Whisper models as feature extractors for cross-lingual                     tiwi, E. Ambikairajah, A comprehensive review of
SER across eight languages, considering both English-                      speech emotion recognition systems, IEEE access 9
Only and Multilingual variants. Our findings reveal that                   (2021) 47795–47814.
LSMs are effective feature extractors compared to a full               [4] Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emo-
Transformer baseline trained from scratch. We observe                      tion recognition using cnn, in: Proceedings of the
that Whisper models encode acoustic information primar-                    22nd ACM international conference on Multimedia,
ily in the features of the last Transformer layer, whereas                 2014, pp. 801–804.
Wav2Vec 2.0 models rely on features from middle and                    [5] A. M. Badshah, J. Ahmad, N. Rahim, S. W. Baik,
early layers. Furthermore, we show that multilingual                       Speech emotion recognition from spectrograms
pre-training benefits Whisper models, leading to strong                    with deep convolutional neural network, in: 2017
performance in Italian, Canadian French, French, Span-                     international conference on platform technology
ish, German, and competitive results in Greek, Egyptian                    and service (PlatCon), IEEE, 2017, pp. 1–5.
Arabic, and Persian. In contrast, English-Only Wav2Vec                 [6] J. Zhao, X. Mao, L. Chen, Speech emotion recogni-
2.0 models outperform their multilingual counterpart,                      tion using deep 1d & 2d cnn lstm networks, Biomed-
XLS-R, in most languages, achieving top performance in                     ical signal processing and control 47 (2019) 312–323.
Greek and Egyptian Arabic. We attribute the disparity                  [7] T. Feng, H. Hashemi, M. Annavaram, S. S.
in multilingual pre-training effectiveness to the differ-                  Narayanan, Enhancing privacy through domain
ences between SSL and WSL strategies, which should be                      adaptive noise injection for speech emotion recog-
explored further.                                                          nition, in: ICASSP 2022-2022 IEEE international
                                                                           conference on acoustics, speech and signal process-
                                                                           ing (ICASSP), IEEE, 2022, pp. 7702–7706.
                                                                       [8] W. Lim, D. Jang, T. Lee, Speech emotion recognition
                                                                           using convolutional and recurrent neural networks,
     in: 2016 Asia-Pacific signal and information pro-         [20] S. G. Upadhyay, L. Martinez-Lucas, B.-H. Su, W.-C.
     cessing association annual summit and conference               Lin, W.-S. Chien, Y.-T. Wu, W. Katz, C. Busso, C.-C.
     (APSIPA), IEEE, 2016, pp. 1–4.                                 Lee, Phonetic anchor-based transfer learning to fa-
 [9] N.-C. Ristea, R. T. Ionescu, F. S. Khan, Septr: Separa-        cilitate unsupervised cross-lingual speech emotion
     ble transformer for audio spectrogram processing,              recognition, in: ICASSP 2023-2023 IEEE Interna-
     arXiv preprint arXiv:2203.09581 (2022).                        tional Conference on Acoustics, Speech and Signal
[10] J.-Y. Kim, S.-H. Lee, Coordvit: a novel method of              Processing (ICASSP), IEEE, 2023, pp. 1–5.
     improve vision transformer-based speech emotion           [21] S. A. M. Zaidi, S. Latif, J. Qadir, Cross-language
     recognition using coordinate information concate-              speech emotion recognition using multimodal
     nate, in: 2023 International conference on electron-           dual attention transformers,         arXiv preprint
     ics, information, and communication (ICEIC), IEEE,             arXiv:2306.13804 (2023).
     2023, pp. 1–4.                                            [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
[11] S. Akinpelu, S. Viriri, A. Adegun, An enhanced                 Bert: Pre-training of deep bidirectional transform-
     speech emotion recognition using vision trans-                 ers for language understanding, arXiv preprint
     former, Scientific Reports 14 (2024) 13126.                    arXiv:1810.04805 (2018).
[12] S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro,           [23] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
     K. Qian, X. Jing, A. Kathan, B. Hu, B. W. Schuller,            et al., Improving language understanding by gen-
     Audio self-supervised learning: A survey, Patterns             erative pre-training (2018).
     3 (2022).                                                 [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
[13] A. Radford, J. W. Kim, T. Xu, G. Brockman,                     senborn, X. Zhai, T. Unterthiner, M. Dehghani,
     C. McLeavey, I. Sutskever, Robust speech recog-                M. Minderer, G. Heigold, S. Gelly, et al., An image is
     nition via large-scale weak supervision, in: Inter-            worth 16x16 words: Transformers for image recog-
     national Conference on Machine Learning, PMLR,                 nition at scale, arXiv preprint arXiv:2010.11929
     2023, pp. 28492–28518.                                         (2020).
[14] L. Pepino, P. Riera, L. Ferrer, Emotion recognition       [25] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
     from speech using wav2vec 2.0 embeddings, arXiv                J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm:
     preprint arXiv:2104.03502 (2021).                              Large-scale self-supervised pre-training for full
[15] T. Feng, S. Narayanan, Peft-ser: On the use of                 stack speech processing, IEEE Journal of Selected
     parameter efficient transfer learning approaches               Topics in Signal Processing 16 (2022) 1505–1518.
     for speech emotion recognition using pre-trained          [26] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai,
     speech models, in: 2023 11th International Confer-             K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang,
     ence on Affective Computing and Intelligent Inter-             G.-T. Lin, et al., Superb: Speech processing uni-
     action (ACII), IEEE, 2023, pp. 1–8.                            versal performance benchmark, arXiv preprint
[16] Y. Li, A. Mehrish, R. Bhardwaj, N. Majumder,                   arXiv:2105.01051 (2021).
     B. Cheng, S. Zhao, A. Zadeh, R. Mihalcea, S. Po-          [27] A. Conneau, A. Baevski, R. Collobert, A. Mohamed,
     ria, Evaluating parameter-efficient transfer learning          M. Auli, Unsupervised cross-lingual representation
     approaches on sure benchmark for speech under-                 learning for speech recognition, arXiv preprint
     standing, in: ICASSP 2023-2023 IEEE International              arXiv:2006.13979 (2020).
     Conference on Acoustics, Speech and Signal Pro-           [28] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu,
     cessing (ICASSP), IEEE, 2023, pp. 1–5.                         N. Goyal, K. Singh, P. Von Platen, Y. Saraf, J. Pino,
[17] T. Feng, R. Hebbar, S. Narayanan, Trust-ser:                   et al., Xls-r: Self-supervised cross-lingual speech
     On the trustworthiness of fine-tuning pre-trained              representation learning at scale, arXiv preprint
     speech embeddings for speech emotion recognition,              arXiv:2111.09296 (2021).
     in: ICASSP 2024-2024 IEEE International Confer-           [29] A. Wurst, M. Hopwood, S. Wu, F. Li, Y.-D. Yao, Deep
     ence on Acoustics, Speech and Signal Processing                learning for the detection of emotion in human
     (ICASSP), IEEE, 2024, pp. 11201–11205.                         speech: The impact of audio sample duration and
[18] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec              english versus italian languages, in: 2023 32nd
     2.0: A framework for self-supervised learning of               Wireless and Optical Communications Conference
     speech representations, Advances in neural infor-              (WOCC), IEEE, 2023, pp. 1–6.
     mation processing systems 33 (2020) 12449–12460.          [30] M. Neumann, et al., Cross-lingual and multilingual
[19] M. Sharma, Multi-lingual multi-task speech emo-                speech emotion recognition on english and french,
     tion recognition using wav2vec 2.0, in: ICASSP                 in: 2018 IEEE international conference on acoustics,
     2022-2022 IEEE International Conference on Acous-              speech and signal processing (ICASSP), IEEE, 2018,
     tics, Speech and Signal Processing (ICASSP), IEEE,             pp. 5769–5773.
     2022, pp. 6907–6911.                                      [31] S. Deng, N. Zhang, Z. Sun, J. Chen, H. Chen, When
     low resource nlp meets unsupervised language                 ing Society 66 (2018) 457–467.
     model: Meta-pretraining then meta-learning for          [44] P. Gournay, O. Lahaie, R. Lefebvre, A canadian
     few-shot text classification (student abstract), in:         french emotional speech dataset, in: Proceedings
     Proceedings of the AAAI Conference on Artificial             of the 9th ACM multimedia systems conference,
     Intelligence, volume 34, 2020, pp. 13773–13774.              2018, pp. 399–402.
[32] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross          [45] E. Parada-Cabaleiro, G. Costantini, A. Batliner,
     lingual speech emotion recognition: Urdu vs. west-           M. Schmitt, B. W. Schuller, Demos: An italian emo-
     ern languages, in: 2018 International conference             tional speech corpus: Elicitation methods, machine
     on frontiers of information technology (FIT), IEEE,          learning, and perception, Language Resources and
     2018, pp. 88–93.                                             Evaluation 54 (2020) 341–383.
[33] E. Garcia-Cuesta, A. B. Salvador, D. G. Pãez, Emo-      [46] F. Burkhardt, A. Paeschke, M. Rolfes, W. F.
     matchspanishdb: study of speech emotion recog-               Sendlmeier, B. Weiss, et al., A database of german
     nition machine learning models in a new spanish              emotional speech., in: Interspeech, volume 5, 2005,
     elicited database, Multimedia Tools and Applica-             pp. 1517–1520.
     tions 83 (2024) 13093–13112.                            [47] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco,
[34] A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P.               et al., Emovo corpus: an italian emotional speech
     Morency, Tensor fusion network for multimodal                database, in: Proceedings of the ninth international
     sentiment analysis, arXiv preprint arXiv:1707.07250          conference on language resources and evaluation
     (2017).                                                      (LREC’14), European Language Resources Associa-
[35] H. H. Mao, A survey on self-supervised pre-training          tion (ELRA), 2014, pp. 3501–3504.
     for sequential transfer learning in neural networks,    [48] L. Abdel-Hamid, Egyptian arabic speech emotion
     arXiv preprint arXiv:2007.00800 (2020).                      recognition using prosodic, spectral and wavelet
[36] S. Sadok, S. Leglaive, R. Séguier, A vector quantized        features, Speech Communication 122 (2020) 19–30.
     masked autoencoder for speech emotion recogni-          [49] S. Oréau, French emotional speech database - oréau,
     tion, in: 2023 IEEE International conference on              Zenodo, 2021. URL: https://zenodo.org/records/
     acoustics, speech, and signal processing workshops           4405783.
     (ICASSPW), IEEE, 2023, pp. 1–5.                         [50] O. Mohamad Nezami, P. Jamshid Lou, M. Karami,
[37] F. Catania, Speech emotion recognition in italian            Shemo: a large-scale validated database for persian
     using wav2vec 2, Authorea Preprints (2023).                  speech emotion detection, Language Resources and
[38] Y. Belinkov, J. Glass, Analyzing hidden representa-          Evaluation 53 (2019) 1–16.
     tions in end-to-end automatic speech recognition        [51] Y. Gong, Y.-A. Chung, J. Glass, Ast: Audio spectro-
     systems, Advances in Neural Information Process-             gram transformer, arXiv preprint arXiv:2104.01778
     ing Systems 30 (2017).                                       (2021).
[39] J. Shah, Y. K. Singla, C. Chen, R. R. Shah, What all
     do audio transformer models hear? probing acous-
     tic representations for language delivery and its
     structure, arXiv preprint arXiv:2101.00387 (2021).
[40] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy-
     sis of a self-supervised speech representation model,
     in: 2021 IEEE Automatic Speech Recognition and
     Understanding Workshop (ASRU), IEEE, 2021, pp.
     914–921.
[41] Y. Li, Y. Mohamied, P. Bell, C. Lai, Exploration
     of a self-supervised speech model: A study on
     emotional corpora, in: 2022 IEEE Spoken Lan-
     guage Technology Workshop (SLT), IEEE, 2023, pp.
     868–875.
[42] A. Pasad, B. Shi, K. Livescu, Comparative layer-
     wise analysis of self-supervised speech models,
     in: ICASSP 2023-2023 IEEE International Confer-
     ence on Acoustics, Speech and Signal Processing
     (ICASSP), IEEE, 2023, pp. 1–5.
[43] N. Vryzas, R. Kotsakis, A. Liatsou, C. A. Dimoulas,
     G. Kalliris, Speech emotion recognition for perfor-
     mance interaction, Journal of the Audio Engineer-