<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico D'Asaro</string-name>
          <email>federico.dasaro@polito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan José Márquez Villacís</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Rizzo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Bottino</string-name>
          <email>andrea.bottino@polito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LINKS Foundation - AI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Space (ADS)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Cross-lingual Speech Emotion Recognition, Large Speech models, Transfer Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>dian French</institution>
          ,
          <addr-line>Spanish, Greek, Persian, and Egyptian Arabic</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or WeaklySupervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>
        ceur-ws.org
1. Introduction
tions from speech audio, enhancing Human-AI
interaction in fields such as healthcare, education, and
security [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Traditional methods rely on Low-Level
Deity features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], using classifiers such as KNN, SVM,
or Naïve Bayes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Deep learning has introduced
advanced techniques, including Convolutional Neural
Networks (CNNs) [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ], eventually followed by
Recurrent Neural Networks (RNNs) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], and Transformers
[9, 10, 11]. Transformers’ ability to learn from extensive
datasets has led to Large Speech Models (LSMs), which
generalize across various speech tasks. Common
training approaches for these models include Self-Supervised
      </p>
    </sec>
    <sec id="sec-2">
      <title>Learning (SSL), which uses data itself to learn generalpurpose features [12], and Weakly-Supervised Learning (WSL), which pairs audio with text for tasks like transcrip</title>
      <p>0009-0003-8727-3393 (F. D’Asaro); 0009-0008-3098-5492
(J. J. Márquez Villacís); 0000-0003-0083-813X (G. Rizzo);
0000-0002-8894-5089 (A. Bottino)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
edge of LSMs makes them efective feature extractors
for SER. Research has adapted LSMs for SER in English
ited, focusing on Wav2Vec 2.0 [18] for cross-lingual SER</p>
    </sec>
    <sec id="sec-3">
      <title>This study examines how efective LSMs are as fea</title>
      <p>across eight languages: Italian, German, French,
Cana</p>
    </sec>
    <sec id="sec-4">
      <title>Specifically, we utilize LSMs from the Wav2Vec 2.0 and</title>
    </sec>
    <sec id="sec-5">
      <title>Whisper [13] model families, pre-trained with SSL and</title>
    </sec>
    <sec id="sec-6">
      <title>WSL approaches, respectively. We introduce Whisper</title>
      <p>due to its underexplored use in cross-lingual SER. To
assess the efectiveness of LSMs as feature extractors,
we test three classifiers of increasing complexity— Linear,</p>
      <sec id="sec-6-1">
        <title>Non-Linear, and Multi-Layer —across nine datasets. This</title>
        <p>evaluation determines which classifier best suits each</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>LSM across diferent languages. Moreover, our study includes both English-Only and Multilingual models from uate the efectiveness of multilingual pre-training for cross-lingual SER.</title>
    </sec>
    <sec id="sec-8">
      <title>The main contributions of this work are:</title>
      <p>• We evaluate LSMs from the Wav2Vec 2.0 and</p>
      <p>Whisper models as feature extractors for
crosslingual SER across eight languages.
• We test three types of downstream
classiifers—Linear, Non-Linear, and Multi-Layer—and
ifnd that Whisper models’ last Transformer layer
features are well-suited for a Linear classifier,
whereas Wav2Vec 2.0 models perform better with
tion and translation [13]. The general-purpose knowl- the Wav2Vec 2.0 and Whisper families, aiming to
evallayers.
• We compare English-Only and Multilingual LSMs,
revealing that Whisper models benefit from
mul</p>
    </sec>
    <sec id="sec-9">
      <title>Spanish, Canadian French, French, and German</title>
      <p>and competitively on Greek, Egyptian Arabic,</p>
    </sec>
    <sec id="sec-10">
      <title>Persian. Conversely, English-Only Wav2Vec 2.0</title>
      <p>models surpass multilingual XLS-R in most
languages, achieving the highest performance in</p>
    </sec>
    <sec id="sec-11">
      <title>Greek, Egyptian Arabic.</title>
      <sec id="sec-11-1">
        <title>2. Background</title>
        <p>2.1. Large Speech Models</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Recent developments in natural language processing and</title>
      <p>computer vision have harnessed large volumes of
unlaBuilding on techniques such as masked language and
image modeling, Wav2Vec 2.0 [18] introduced a LSM
trained on extensive audio datasets using masked speech
modeling. Wav2Vec 2.0 features seven 1D convolutional
blocks for initial feature extraction, followed by 12 or 24
transformer blocks (depending on the model variant) for
contextual processing. The model masks part of the latent
features and reconstructs them using the surrounding
context. To further refine LSMs for tasks like emotion
recognition, methods such as WavLM [25] have been
and text interaction in languages such as Italian,
German, and Urdu. Self-supervised pre-training methods,
including variational autoencoders, have also been
ef</p>
    </sec>
    <sec id="sec-13">
      <title>German [35, 36]. The advent of LSMs pre-trained with</title>
      <p>self-supervision has further increased the potential for
transfer learning due to their high generalization
capabilities [15]. However, most research primarily focuses
on adapting multilingual Wav2Vec 2.0 models (XLSR-53)
[19, 37, 20, 21]. This work expands the scope of analyzed</p>
    </sec>
    <sec id="sec-14">
      <title>LSMs including WSL models as Whisper. Additionally, we evaluate the ability of English-only models to transfer knowledge to other languages, beyond just multilingual models.</title>
      <p>3.</p>
      <p>Method
uating the efectiveness of LSMs as feature extractors
for downstream SER in various languages. We stack a
classification model on top of the LSM backbone, with
its parameters frozen. All LSMs used in this work share
the same overall architecture, which we describe below
along with the stacked classification model.</p>
      <p>Formally, the input audio  (raw waveform or
logmel spectrogram) passes through a convolutional
encoder  ∶  →</p>
      <p>, mapping the audio to latent features
 = { 1, … ,   }, where  is the sequence length and each
features from the middle and early Transformer</p>
      <p>dual attention [21] and tensor fusion [34] enhance audio
tilingual pre-training performing best on Italian, fective in transferring knowledge across languages like
beled data through Self-Supervised Learning [22, 23, 24]. In this section, we describe the methodology for
evalside masked modeling, demonstrating broad
efectivedeveloped. WavLM incorporates speech denoising along- frame   typically corresponds to 25 ms with   ∈ ℝ .</p>
    </sec>
    <sec id="sec-15">
      <title>Then,  passes through a Transformer encoder consist</title>
      <p>ness across various tasks in the SUPERB benchmark [26]. ing of  layers   ∶  → 
, enriching the latent features
Moreover, XLSR-53 [27] extends the Wav2Vec 2.0
framework to cover 53 languages, sharing the latent space
across these languages. This approach has shown
superior performance over monolingual pretraining for
automatic speech recognition. XLS-R [28] further advances
with contextual information, resulting in {ℎ1, … , ℎ } for
each of the  = 1, … ,</p>
      <p>Transformer layers. Here,  = 
corresponds to the output features of the last layer, with
ℎ ∈ ℝ . The features {ℎ1, … , ℎ }=1,.., are considered the
extracted features from the LSM and are fed into a
downthis by scaling to 128 languages, excelling in speech trans- stream classifier  ∶  → 
, which maps these features
lation and language identification. In comparison,
Whisper [13] leverages large-scale weak supervision from
audio-transcription pairs to train an encoder-decoder
transformer. Using log-mel spectrograms, Whisper is
trained in a multitask framework that includes
multilingual transcription and translation, establishing itself as
an efective zero-shot model for multilingual tasks.
2.2. Cross-Language Speech Emotion</p>
      <p>Recognition</p>
    </sec>
    <sec id="sec-16">
      <title>Emotion recognition in languages beyond English, like</title>
      <p>Italian [29], French [30], Persian [31, 32], and Spanish
[33], is crucial but often limited by data availability.
Recent eforts have focused on improving cross-lingual
and cross-modal knowledge transfer. Techniques like
label  ∗ for audio  is given by:
to the output class logits { 1, … ,   }. The output class

 ∗ = arg max softmax ( ( (()
)))
(1)</p>
    </sec>
    <sec id="sec-17">
      <title>Inspired by previous work that uses probing to evalu</title>
      <p>ate the quality of features extracted from backbone
models [38, 39], we evaluate three diferent downstream
classiifers of increasing complexity: Linear Classifier (ℊ ),
Non</p>
      <sec id="sec-17-1">
        <title>Linear Classifier</title>
        <p>(ℊ ), and Multi-layer Classifier</p>
        <p>(ℊ ).
neural network that consists solely of linear projections.
layer {ℎ1 , … , ℎ }, they are first projected by a linear layer
 1 ∶ ℝ
 → ℝ</p>
        <p>that is shared across all frames, then
aggregated by average pooling  , and finally pass through
the classification layer ℴ ∶ ℝ</p>
        <p>→ ℝ to obtain the output
class logits. The function ℊ is compactly defined as:</p>
        <p>ℊ (ℎ1 , … , ℎ ) = ℴ ( ( 1 (ℎ1 , … , ℎ )))
uate the quality of the features extracted from the LSM
based on the linear classifier model’s ability to handle
the SER task.
3.2. Non-Linear Classifier
is:
To increase the complexity of the classification model,
we utilize a series of linear layers interleaved with ReLU
activations both before and after feature pooling. We
follow the same architecture as in [14, 15], but unlike
them, we only feed the features from the last Transformer
layer  to the model. Each {ℎ1 , … , ℎ } passes through
two shared linear layers, ReLU, and dropout blocks ( ),
 ∶ ℝ 
followed by a linear layer ( 1). Linear layers are functions</p>
        <p>
          → ℝ . Projected features are averaged, pass
through  2 and ReLU, and are classified by ℴ. Thus, ℊ
ℊ ( = ℎ 1 , … , ℎ ) = ℴ (ReLU ( 2 ( ( 1 ( ( ))))))
(3)
(4)
(5)
The absence of non-linear activations allows us to eval- Italian, German, Spanish, Egyptian Arabic, and Persian.

=1
ℎ∗ = ∑   ⋅ ℎ for  = 1, … , 
where  1, … ,   are the weights assigned to each
Transformer layer, ensuring   ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and ∑
        </p>
        <p>=1   = 1. The
resulting sequence {ℎ1∗, … , ℎ∗} is then processed by the
same pipeline as the Non-Linear Classifier , resulting in:</p>
        <p>ℊ</p>
        <p>( = {ℎ 1, … , ℎ }=1,.., ) = ℊ (() )</p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>This classifier leverages internal layer information, which</title>
      <p>has proven beneficial for paralinguistic and linguistic
downstream tasks [39, 40, 41, 42]. By investigating the
contribution of internal LSM layers for SER across
various languages, we corroborates previous findings for</p>
    </sec>
    <sec id="sec-19">
      <title>Wav2Vec 2.0 models and provide new insights for Whisper models.</title>
      <p>The function  ∶ ℝ × ×
to {ℎ1∗, … , ℎ∗} as follows:
3.3. Multi-Layer Classifier
As a third option, we adopt the approach from [14, 15],
which utilizes all hidden states of the Transformer
encoder. The features {ℎ1, … , ℎ }=1,.., are combined into a
new sequence {ℎ1∗, … , ℎ∗} using a learnable weighted sum.
→ ℝ ×
maps {ℎ1, … , ℎ }=1,..,</p>
      <sec id="sec-19-1">
        <title>4. Experiments</title>
        <p>4.1. Datasets and Metrics
(2)</p>
      </sec>
    </sec>
    <sec id="sec-20">
      <title>In this study, we conduct experiments using 9 distinct datasets spanning 8 diferent languages:</title>
      <sec id="sec-20-1">
        <title>Greek, French,</title>
      </sec>
    </sec>
    <sec id="sec-21">
      <title>The datasets vary in their collection methodologies, such</title>
      <p>as acted emotions and elicitation methods. The
participant demographics may be balanced by gender (e.g.,</p>
    </sec>
    <sec id="sec-22">
      <title>CaFE, EYASE), by emotion (e.g., EMOVO), or may not</title>
      <p>be balanced at all. For all datasets, we conduct our
experiments in a speaker-independent setting to prevent
evaluation on speaker-dependent features. Table 1
provides an overview of the dataset statistics, with a more
detailed description given below.</p>
    </sec>
    <sec id="sec-23">
      <title>AESDD [43]: The Acted Emotional Speech Dynamic</title>
    </sec>
    <sec id="sec-24">
      <title>Database comprises 500 recorded samples from 5 actors</title>
      <p>(3 females, 2 males) expressing 5 distinct emotions in</p>
    </sec>
    <sec id="sec-25">
      <title>Greek. Each actor performed 20 utterances per emotion, with some utterances recorded multiple times. In later versions, additional actors were included, bringing the total to 604 recordings from 6 actors.</title>
    </sec>
    <sec id="sec-26">
      <title>CaFE [44]: This dataset includes recordings of 6 dif</title>
      <p>ferent sentences delivered by 12 actors (6 female, 6 male)
portraying the Big Six emotions and a neutral state in</p>
    </sec>
    <sec id="sec-27">
      <title>Canadian French. It ofers a high-quality version with a sampling rate of 192 kHz at 24 bits per sample, as well as</title>
      <p>Dataset
AESDD
CaFE
DEMoS</p>
      <p>EmoDB
EmoMatch
EMOVO
EYASE
Oréau
ShEMO</p>
      <p>Language</p>
      <p>Greek
Canadian French</p>
      <p>Italian
German
Spanish</p>
      <p>Italian
Egyptian Arabic</p>
      <p>French
Persian</p>
      <p>Emotions
anger, disgust, fear, happiness, and sadness
anger, disgust, fear, happiness, surprise, sadness, and neutrality
anger, disgust, fear, happiness, surprise, sadness, and neutrality
anger, disgust, fear, happiness, boredom, sadness, and neutrality
anger, disgust, fear, happiness, surprise, sadness, and neutrality
anger, disgust, fear, happiness, surprise, sadness, and neutrality</p>
      <p>anger, happiness, sadness, and neutrality
anger, disgust, fear, happiness, surprise, sadness, and neutrality
anger, happiness, sadness, and neutrality
a down-sampled version at 48 kHz and 16 bits per sample. validation/test split is performed with ratios of 80/10/10.
The total number of samples amounts to 936. All results are reported using the macro F1 score,
ex</p>
      <p>DEMoS [45]: DEMoS contains 9697 audio samples pressed as a percentage. We conducted 3 runs, presenting
from 68 volunteer students (299 females, 131 males) ex- the mean ± standard deviation.
pressing the Big Six emotions plus the neutral state in
Italian. Instead of acted emotions, samples were gener- 4.2. Experimental Details
ated using an elicitation approach. The recordings, with
a mean duration of 2.9 seconds (std: 1.1s), are provided Baseline As a baseline to evaluate LSM transfer
learnin 48 kHz, 16-bit, mono format. ing capabilities, we adopt the Audio Spectrogram
Trans</p>
      <p>EmoDB [46]: This collection includes 535 utterances former (AST) [51], a fully transformer-based architecture
across 7 emotional states, spoken in German by 5 female recently proposed as a substitute for CNNs [9, 10, 11].
and 5 male actors. Each actor performed a set of 10 We train AST from scratch on each of the 9 datasets using
sentences, which were down-sampled from the original the same hyperparameters as [51].
48 kHz to 16 kHz. LSM Models We use pre-trained checkpoints for both</p>
      <p>EmoMatch [33]: Consisting of 2005 recordings, Emo- English-Only and Multilingual models: Wav2Vec 2.0
Match features samples from 50 non-actor Spanish speak- Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0
ers (20 females, 30 males) expressing the Big Six emotions family, and Whisper Small (EN) (Whisper Small
preand a neutral state. The dataset is a subset of the larger trained only on English data), Whisper Small, Whisper
EmoSpanishDB and contains recordings sampled at 48 Medium from the Whisper family. The LSM backbones
kHz with a 16-bit mono format. are kept frozen and used exclusively as feature extractors.</p>
      <p>EMOVO [47]: EMOVO presents 588 Italian audio Training We follow the same hyperparameters
setrecordings from 3 male and 3 female actors simulating tings as [15] to train the downstream classifiers.
Specifithe Big Six emotions plus a neutral state. Each actor cally, we train for 30 epochs using the Adam optimizer
voiced 14 utterances, and the recordings are provided in with a learning rate of 5.0e-04, weight decay of 1.0e-04,
48 kHz, 16-bit stereo WAV format. betas set to (0.9, 0.98), and epsilon of 1.0e-08. The
dimen</p>
      <p>EYASE [48]: EYASE contains 579 utterances in Egyp- sion of the classifier projection  is 256.
tian Arabic, recorded by 3 male and 3 female professional
actors. The recordings, ranging from 1 to 6 seconds in 4.3. Results
duration, were labeled as angry, happy, neutral, or sad
and sampled at 44.1 kHz. To present our results, we first compare the performance</p>
      <p>Oréau [49]: The Oréau dataset features 502 audio sam- of the various classifiers (see Section 3) for each LSM
ples from 32 non-professional actors (25 male, 7 female) utilized. This analysis provides insights into the
charwho voiced 10 to 13 utterances in French for the Big Six acteristics of features extracted from Wav2Vec 2.0 and
emotions plus a neutral state. Whisper models for downstream SER tasks. After
identi</p>
      <p>ShEMO [50]: ShEMO comprises 3000 semi-natural fying the best classifier for each LSM, we then compare
recordings from 87 native Persian speakers (31 female, the performance of English-Only and Multilingual LSMs
56 male). The dataset captures 5 of the Big Six emo- across the 8 languages covered in this study.
tions—sadness, anger, happiness, surprise, and fear—plus
a neutral state. The samples were up-sampled to a fre- 4.3.1. Comparison between downstream classifiers
quency of 44.1 kHz in mono-channel format, with an
average length of 4.11 seconds (std: 3.41s). We examine the results in Table 2, comparing three
clasThe audio is resampled to 16 kHz, and a stratified train/- sifier methods for Wav2Vec 2.0 and Whisper models. The
ness for SER across multiple languages. We hypothesize
that this difering behavior may be related to their
respective Self-Supervised and Weakly-Supervised pre-training
approaches, which warrant further investigation. To gain
further insights into the importance of Transformer
layers in Wav2Vec 2.0 and Whisper for SER, we leverage the
weights learned in the Multi-Layer classifier as follows.</p>
      <p>Transformer Layer Weights. We analyze the
weights  1, … ,   from the Multi-Layer Classifier to
assess Transformer layer importance. Figure 2 illustrates
that Wav2Vec 2.0 models assign greater weight to the
early and middle layers, whereas Whisper models
emphasize the later layers. This observation confirms the
earlier findings, suggesting that paralinguistic
information in Whisper models is embedded in the features of
the later Transformer layers.
4.3.2. Comparing English-Only and Multilingual</p>
      <p>LSMs Across Diferent Languages</p>
    </sec>
    <sec id="sec-28">
      <title>In this section, we compare English-Only and Multilin</title>
      <p>gual LSMs with the AST baseline across 9 datasets. Table
3 displays F1 scores for the optimal classifiers found in
the previous section: Multi-Layer for Wav2Vec 2.0 and
Figure 2: Greyscale map of layer weight distribution from the Linear for Whisper models.</p>
      <p>Multi-Layer classification method. Weights are averaged over Transferring knowledge from LSMs proves to be
efall 9 datasets for each model. Darker shades indicate higher fective across all datasets compared to the baseline. For
weights. instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian
Arabic, while Whisper Small scores 51.98 and AST scores
table shows average F1 scores across 9 datasets, highlight- 33.23. This indicates that LSMs are efective feature
exing the most efective classifier for each LSM in cross- tractors for cross-lingual SER on multiple languages.
lingual SER tasks. When comparing English-only and Multilingual
mod</p>
      <p>For Wav2Vec 2.0 models, the Multi-Layer Classifier els, we diferentiate between the Wav2Vec 2.0 and
Whisperforms best, with F1 scores of 53.42, 57.50, and 40.89 per families. For Wav2Vec 2.0, we observe that Wav2Vec
for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R. 2.0 Base and Large generally outperform XLS-R (e.g.,
The Linear and Non-Linear classifiers perform similarly, 87.85 and 88.31 vs. 67.71 for DEMos), except in Persian,
especially for Wav2Vec 2.0 Large and XLS-R, suggesting where their performance is comparable. This indicates
improvements are due to using features from internal that multilingual pre-training may not be as efective
Transformer layers rather than non-linear activations. for Wav2Vec 2.0 models across various languages. We
For Whisper models, the Linear Classifier performs best, speculate that this may be due to the limitations of SSL
with F1 scores of 58.16, 60.87, and 60.72 for Whisper pre-training, which might struggle with the diverse range
Small (EN), Whisper Small, and Whisper Medium. In- of languages and lose important paralinguistic features
creasing classifier complexity with non-linear activations that are retained in English-only models. Further
investidecreases performance, likely due to general information gation with a wider range of SSL-pretrained LSMs could
loss caused by complex transformations. The Multi-Layer provide more insights. As regards to Whisper,
MultilinClassifier performs worse, indicating that using also fea- gual Whisper Small outperforms its English-only
vertures from internal layers is less efective than using sion, with the exception of Greek and Persian, likely due
features from the last layer alone. to limited pretraining data for these languages, which</p>
      <p>This comparison reveals that Wav2Vec 2.0 models ben- resulted in higher word error rates compared to other
efit from features extracted from internal Transformer languages in this study [13]. Multilingual Whisper
modlayers and exhibit less sensitivity to classifier complex- els achieve best performance in Canadian French,
Spanity, consistent with prior research [41, 39]. Conversely, ish (66.71, 73.13 with Whisper Small), Italian, German,
Whisper models achieve better performance with fea- and French (91.17, 90.64, 95.22 with Whisper Medium).
tures from the last Transformer layer when using a simple This improvement is likely due to the larger pretraining
linear classifier, ofering new insights into their efective- datasets for these languages and the similarities between
Dataset/Model</p>
      <p>AST
AESDD (el)
CaFE (fr-ca)
DEMoS (it)
EmoDB (de)
EMOVO (it)
EYASE (ar-eg)
Oréau (fr)
ShEMO (fa)</p>
      <p>Wav2Vec 2.0</p>
      <p>Base‡
Wav2Vec 2.0</p>
      <p>Large‡</p>
      <p>Whisper
Small†</p>
      <p>XLS-R‡</p>
      <p>Multilingual
Whisper
Small†</p>
      <p>Whisper
Medium†
19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99) 9.16 (± 1.25) 26.34 (± 1.65) 27.62 (± 0.62)</p>
      <p>Canadian French and French. We believe that
multilingual pretraining benefits Whisper models by capturing
language-specific features more efectively through WSL
and multitask learning. However, further research is
needed to evaluate the efectiveness of multilingual
pretraining with WSL compared to SSL across a broader
range of LSMs.</p>
      <sec id="sec-28-1">
        <title>5. Conclusion</title>
      </sec>
    </sec>
    <sec id="sec-29">
      <title>This paper examines the capabilities of Wav2Vec 2.0 and</title>
      <p>Whisper models as feature extractors for cross-lingual
SER across eight languages, considering both
EnglishOnly and Multilingual variants. Our findings reveal that
LSMs are efective feature extractors compared to a full
Transformer baseline trained from scratch. We observe
that Whisper models encode acoustic information
primarily in the features of the last Transformer layer, whereas
Wav2Vec 2.0 models rely on features from middle and
early layers. Furthermore, we show that multilingual
pre-training benefits Whisper models, leading to strong
performance in Italian, Canadian French, French,
Spanish, German, and competitive results in Greek, Egyptian
Arabic, and Persian. In contrast, English-Only Wav2Vec
2.0 models outperform their multilingual counterpart,
XLS-R, in most languages, achieving top performance in
Greek and Egyptian Arabic. We attribute the disparity
in multilingual pre-training efectiveness to the
diferences between SSL and WSL strategies, which should be
explored further.
in: 2016 Asia-Pacific signal and information pro- [20] S. G. Upadhyay, L. Martinez-Lucas, B.-H. Su, W.-C.
cessing association annual summit and conference Lin, W.-S. Chien, Y.-T. Wu, W. Katz, C. Busso, C.-C.
(APSIPA), IEEE, 2016, pp. 1–4. Lee, Phonetic anchor-based transfer learning to
fa[9] N.-C. Ristea, R. T. Ionescu, F. S. Khan, Septr: Separa- cilitate unsupervised cross-lingual speech emotion
ble transformer for audio spectrogram processing, recognition, in: ICASSP 2023-2023 IEEE
InternaarXiv preprint arXiv:2203.09581 (2022). tional Conference on Acoustics, Speech and Signal
[10] J.-Y. Kim, S.-H. Lee, Coordvit: a novel method of Processing (ICASSP), IEEE, 2023, pp. 1–5.
improve vision transformer-based speech emotion [21] S. A. M. Zaidi, S. Latif, J. Qadir, Cross-language
recognition using coordinate information concate- speech emotion recognition using multimodal
nate, in: 2023 International conference on electron- dual attention transformers, arXiv preprint
ics, information, and communication (ICEIC), IEEE, arXiv:2306.13804 (2023).</p>
      <p>2023, pp. 1–4. [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
[11] S. Akinpelu, S. Viriri, A. Adegun, An enhanced Bert: Pre-training of deep bidirectional
transformspeech emotion recognition using vision trans- ers for language understanding, arXiv preprint
former, Scientific Reports 14 (2024) 13126. arXiv:1810.04805 (2018).
[12] S. Liu, A. Mallol-Ragolta, E. Parada-Cabaleiro, [23] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
K. Qian, X. Jing, A. Kathan, B. Hu, B. W. Schuller, et al., Improving language understanding by
genAudio self-supervised learning: A survey, Patterns erative pre-training (2018).</p>
      <p>3 (2022). [24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D.
Weis[13] A. Radford, J. W. Kim, T. Xu, G. Brockman, senborn, X. Zhai, T. Unterthiner, M. Dehghani,
C. McLeavey, I. Sutskever, Robust speech recog- M. Minderer, G. Heigold, S. Gelly, et al., An image is
nition via large-scale weak supervision, in: Inter- worth 16x16 words: Transformers for image
recognational Conference on Machine Learning, PMLR, nition at scale, arXiv preprint arXiv:2010.11929
2023, pp. 28492–28518. (2020).
[14] L. Pepino, P. Riera, L. Ferrer, Emotion recognition [25] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
from speech using wav2vec 2.0 embeddings, arXiv J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., Wavlm:
preprint arXiv:2104.03502 (2021). Large-scale self-supervised pre-training for full
[15] T. Feng, S. Narayanan, Peft-ser: On the use of stack speech processing, IEEE Journal of Selected
parameter eficient transfer learning approaches Topics in Signal Processing 16 (2022) 1505–1518.
for speech emotion recognition using pre-trained [26] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai,
speech models, in: 2023 11th International Confer- K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang,
ence on Afective Computing and Intelligent Inter- G.-T. Lin, et al., Superb: Speech processing
uniaction (ACII), IEEE, 2023, pp. 1–8. versal performance benchmark, arXiv preprint
[16] Y. Li, A. Mehrish, R. Bhardwaj, N. Majumder, arXiv:2105.01051 (2021).</p>
      <p>B. Cheng, S. Zhao, A. Zadeh, R. Mihalcea, S. Po- [27] A. Conneau, A. Baevski, R. Collobert, A. Mohamed,
ria, Evaluating parameter-eficient transfer learning M. Auli, Unsupervised cross-lingual representation
approaches on sure benchmark for speech under- learning for speech recognition, arXiv preprint
standing, in: ICASSP 2023-2023 IEEE International arXiv:2006.13979 (2020).</p>
      <p>Conference on Acoustics, Speech and Signal Pro- [28] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu,
cessing (ICASSP), IEEE, 2023, pp. 1–5. N. Goyal, K. Singh, P. Von Platen, Y. Saraf, J. Pino,
[17] T. Feng, R. Hebbar, S. Narayanan, Trust-ser: et al., Xls-r: Self-supervised cross-lingual speech
On the trustworthiness of fine-tuning pre-trained representation learning at scale, arXiv preprint
speech embeddings for speech emotion recognition, arXiv:2111.09296 (2021).
in: ICASSP 2024-2024 IEEE International Confer- [29] A. Wurst, M. Hopwood, S. Wu, F. Li, Y.-D. Yao, Deep
ence on Acoustics, Speech and Signal Processing learning for the detection of emotion in human
(ICASSP), IEEE, 2024, pp. 11201–11205. speech: The impact of audio sample duration and
[18] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec english versus italian languages, in: 2023 32nd
2.0: A framework for self-supervised learning of Wireless and Optical Communications Conference
speech representations, Advances in neural infor- (WOCC), IEEE, 2023, pp. 1–6.</p>
      <p>mation processing systems 33 (2020) 12449–12460. [30] M. Neumann, et al., Cross-lingual and multilingual
[19] M. Sharma, Multi-lingual multi-task speech emo- speech emotion recognition on english and french,
tion recognition using wav2vec 2.0, in: ICASSP in: 2018 IEEE international conference on acoustics,
2022-2022 IEEE International Conference on Acous- speech and signal processing (ICASSP), IEEE, 2018,
tics, Speech and Signal Processing (ICASSP), IEEE, pp. 5769–5773.
2022, pp. 6907–6911. [31] S. Deng, N. Zhang, Z. Sun, J. Chen, H. Chen, When
low resource nlp meets unsupervised language ing Society 66 (2018) 457–467.
model: Meta-pretraining then meta-learning for [44] P. Gournay, O. Lahaie, R. Lefebvre, A canadian
few-shot text classification (student abstract), in: french emotional speech dataset, in: Proceedings
Proceedings of the AAAI Conference on Artificial of the 9th ACM multimedia systems conference,
Intelligence, volume 34, 2020, pp. 13773–13774. 2018, pp. 399–402.
[32] S. Latif, A. Qayyum, M. Usman, J. Qadir, Cross [45] E. Parada-Cabaleiro, G. Costantini, A. Batliner,
lingual speech emotion recognition: Urdu vs. west- M. Schmitt, B. W. Schuller, Demos: An italian
emoern languages, in: 2018 International conference tional speech corpus: Elicitation methods, machine
on frontiers of information technology (FIT), IEEE, learning, and perception, Language Resources and
2018, pp. 88–93. Evaluation 54 (2020) 341–383.
[33] E. Garcia-Cuesta, A. B. Salvador, D. G. Pãez, Emo- [46] F. Burkhardt, A. Paeschke, M. Rolfes, W. F.
matchspanishdb: study of speech emotion recog- Sendlmeier, B. Weiss, et al., A database of german
nition machine learning models in a new spanish emotional speech., in: Interspeech, volume 5, 2005,
elicited database, Multimedia Tools and Applica- pp. 1517–1520.</p>
      <p>tions 83 (2024) 13093–13112. [47] G. Costantini, I. Iaderola, A. Paoloni, M. Todisco,
[34] A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. et al., Emovo corpus: an italian emotional speech
Morency, Tensor fusion network for multimodal database, in: Proceedings of the ninth international
sentiment analysis, arXiv preprint arXiv:1707.07250 conference on language resources and evaluation
(2017). (LREC’14), European Language Resources
Associa[35] H. H. Mao, A survey on self-supervised pre-training tion (ELRA), 2014, pp. 3501–3504.
for sequential transfer learning in neural networks, [48] L. Abdel-Hamid, Egyptian arabic speech emotion
arXiv preprint arXiv:2007.00800 (2020). recognition using prosodic, spectral and wavelet
[36] S. Sadok, S. Leglaive, R. Séguier, A vector quantized features, Speech Communication 122 (2020) 19–30.
masked autoencoder for speech emotion recogni- [49] S. Oréau, French emotional speech database - oréau,
tion, in: 2023 IEEE International conference on Zenodo, 2021. URL: https://zenodo.org/records/
acoustics, speech, and signal processing workshops 4405783.</p>
      <p>(ICASSPW), IEEE, 2023, pp. 1–5. [50] O. Mohamad Nezami, P. Jamshid Lou, M. Karami,
[37] F. Catania, Speech emotion recognition in italian Shemo: a large-scale validated database for persian
using wav2vec 2, Authorea Preprints (2023). speech emotion detection, Language Resources and
[38] Y. Belinkov, J. Glass, Analyzing hidden representa- Evaluation 53 (2019) 1–16.</p>
      <p>tions in end-to-end automatic speech recognition [51] Y. Gong, Y.-A. Chung, J. Glass, Ast: Audio
spectrosystems, Advances in Neural Information Process- gram transformer, arXiv preprint arXiv:2104.01778
ing Systems 30 (2017). (2021).
[39] J. Shah, Y. K. Singla, C. Chen, R. R. Shah, What all
do audio transformer models hear? probing
acoustic representations for language delivery and its
structure, arXiv preprint arXiv:2101.00387 (2021).
[40] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise
analysis of a self-supervised speech representation model,
in: 2021 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), IEEE, 2021, pp.</p>
      <p>914–921.
[41] Y. Li, Y. Mohamied, P. Bell, C. Lai, Exploration
of a self-supervised speech model: A study on
emotional corpora, in: 2022 IEEE Spoken
Language Technology Workshop (SLT), IEEE, 2023, pp.</p>
      <p>868–875.
[42] A. Pasad, B. Shi, K. Livescu, Comparative
layerwise analysis of self-supervised speech models,
in: ICASSP 2023-2023 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), IEEE, 2023, pp. 1–5.
[43] N. Vryzas, R. Kotsakis, A. Liatsou, C. A. Dimoulas,</p>
      <p>G. Kalliris, Speech emotion recognition for
performance interaction, Journal of the Audio
Engineer</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <article-title>Deep representation learning for afective speech signal analysis and processing: Preventing unwanted signal disparities</article-title>
          ,
          <source>IEEE Signal Processing Magazine</source>
          <volume>38</volume>
          (
          <year>2021</year>
          )
          <fpage>22</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <article-title>A survey of deep learning-based multimodal emotion recognition: Speech, text, and face</article-title>
          ,
          <source>Entropy</source>
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <fpage>1440</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Wani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Gunawan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A. A.</given-names>
            <surname>Qadri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kartiwi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ambikairajah</surname>
          </string-name>
          ,
          <article-title>A comprehensive review of speech emotion recognition systems</article-title>
          ,
          <source>IEEE access 9</source>
          (
          <year>2021</year>
          )
          <fpage>47795</fpage>
          -
          <lpage>47814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition using cnn</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM international conference on Multimedia</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>801</fpage>
          -
          <lpage>804</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Badshah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Baik</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition from spectrograms with deep convolutional neural network, in: 2017 international conference on platform technology and service (PlatCon)</article-title>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          , L. Chen,
          <article-title>Speech emotion recognition using deep 1d &amp; 2d cnn lstm networks</article-title>
          ,
          <source>Biomedical signal processing and control 47</source>
          (
          <year>2019</year>
          )
          <fpage>312</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hashemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Annavaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Enhancing privacy through domain adaptive noise injection for speech emotion recognition</article-title>
          ,
          <source>in: ICASSP</source>
          <year>2022</year>
          -2022 IEEE international conference on acoustics,
          <source>speech and signal processing (ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>7702</fpage>
          -
          <lpage>7706</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Speech emotion recognition using convolutional and recurrent neural networks,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>